Обсуждение: Improve monitoring of shared memory allocations
Hi,
The 0001* patch improved the accounting for the shared memory allocated for a
hash table during hash_create.
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct, which,
for shared hash tables, only covers memory allocated for the hash  directory 
and control structure via ShmemInitHash. The hash segments and buckets
are allocated using ShmemAllocNoError, which does not attribute the allocation
to the hash table and also does not add it to ShmemIndex.
and control structure via ShmemInitHash. The hash segments and buckets
are allocated using ShmemAllocNoError, which does not attribute the allocation
to the hash table and also does not add it to ShmemIndex.
Therefore, these allocations are not tracked in pg_shmem_allocations. 
To improve this, include the allocation of segments and buckets in the initial
To improve this, include the allocation of segments and buckets in the initial
allocation of the shared memory for the hash table, in ShmemInitHash. 
This will result in pg_shmem_allocations representing the total size of the initial
hash table, including all the buckets and elements, instead of just the directory
size.
Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
tracked by pg_shmem_allocations.
The 0002* patch replaces most of the calls to ShmemAlloc with ShmemInitStruct
to associate a name with the allocations and ensure that they get tracked by
pg_shmem_allocations.
I observed an improvement in total memory allocation by consolidating initial shared
memory allocations for the hash table. For ex. the allocated size for the LOCK hash
hash_create decreased from 801664 bytes to 799616 bytes. Please find the attached
patches, which I will add to the March Commitfest.
Thank you,
Rahila Syed
This will result in pg_shmem_allocations representing the total size of the initial
hash table, including all the buckets and elements, instead of just the directory
size.
Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
tracked by pg_shmem_allocations.
The 0002* patch replaces most of the calls to ShmemAlloc with ShmemInitStruct
to associate a name with the allocations and ensure that they get tracked by
pg_shmem_allocations.
I observed an improvement in total memory allocation by consolidating initial shared
memory allocations for the hash table. For ex. the allocated size for the LOCK hash
hash_create decreased from 801664 bytes to 799616 bytes. Please find the attached
patches, which I will add to the March Commitfest.
Thank you,
Rahila Syed
Вложения
Hi, Thanks for sending these, the issues addressed here have been bugging me for a long while. On 2025-03-01 10:19:01 +0530, Rahila Syed wrote: > The 0001* patch improved the accounting for the shared memory allocated for > a hash table during hash_create. pg_shmem_allocations tracks the memory > allocated by ShmemInitStruct, which, for shared hash tables, only covers > memory allocated for the hash directory and control structure via > ShmemInitHash. The hash segments and buckets are allocated using > ShmemAllocNoError, which does not attribute the allocation to the hash table > and also does not add it to ShmemIndex. > > Therefore, these allocations are not tracked in pg_shmem_allocations. To > improve this, include the allocation of segments and buckets in the initial > allocation of the shared memory for the hash table, in ShmemInitHash. > > This will result in pg_shmem_allocations representing the total size of the > initial hash table, including all the buckets and elements, instead of just > the directory size. I think this should also be more efficient. Less space wasted on padding and fewer indirect function calls. > Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not > tracked by pg_shmem_allocations. The 0002* patch replaces most of the calls > to ShmemAlloc with ShmemInitStruct to associate a name with the allocations > and ensure that they get tracked by pg_shmem_allocations. In some of these cases it may be better to combine the allocation with a prior ShmemInitStruct instead of doing a separate ShmemInitStruct() for the allocations that are doing ShmemAlloc() right now. cfbot found a few compiler warnings: https://cirrus-ci.com/task/6526903542087680 [16:47:46.964] make -s -j${BUILD_JOBS} clean [16:47:47.452] time make -s -j${BUILD_JOBS} world-bin [16:49:10.496] lwlock.c: In function ‘CreateLWLocks’: [16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable] [16:49:10.496] 467 | bool found; [16:49:10.496] | ^~~~~ [16:49:10.496] cc1: all warnings being treated as errors [16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1 [16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2 [16:49:10.496] make[3]: *** Waiting for unfinished jobs.... [16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2 [16:49:11.881] make[2]: *** Waiting for unfinished jobs.... [16:49:20.195] dynahash.c: In function ‘hash_create’: [16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized] [16:49:20.195] 643 | curr_offset = (((char *)curr_offset) + (temp * elementSize)); [16:49:20.195] | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here [16:49:20.195] 588 | void *curr_offset; [16:49:20.195] | ^~~~~~~~~~~ [16:49:20.195] cc1: all warnings being treated as errors [16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1 > From c13a7133ed455842b685426217a7b5079e6fc869 Mon Sep 17 00:00:00 2001 > From: Rahila Syed <rahilasyed.90@gmail.com> > Date: Fri, 21 Feb 2025 15:08:12 +0530 > Subject: [PATCH 1/2] Account for initial shared memory allocated during > hash_create. > > pg_shmem_allocations tracks the memory allocated by ShmemInitStruct, > which, in case of shared hash tables, only covers memory allocated > to the hash directory and control structure. The hash segments and > buckets are allocated using ShmemAllocNoError which does not attribute > the allocations to the hash table name. Thus, these allocations are > not tracked in pg_shmem_allocations. > > Improve this include the alocation of segments and buckets or elements > in the initial allocation of shared hash directory. Since this adds numbers > to existing hash table entries, the resulting tuples in pg_shmem_allocations > represent the total size of the initial hash table including all the > buckets and the elements they contain, instead of just the directory size. > diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c > index cd5a00132f..5203f5b30b 100644 > --- a/src/backend/utils/hash/dynahash.c > +++ b/src/backend/utils/hash/dynahash.c > @@ -120,7 +120,6 @@ > * a good idea of the maximum number of entries!). For non-shared hash > * tables, the initial directory size can be left at the default. > */ > -#define DEF_SEGSIZE 256 > #define DEF_SEGSIZE_SHIFT 8 /* must be log2(DEF_SEGSIZE) */ > #define DEF_DIRSIZE 256 Why did you move this to the header? Afaict it's just used in hash_get_shared_size(), which is also in dynahash.c? > @@ -265,7 +264,7 @@ static long hash_accesses, > */ > static void *DynaHashAlloc(Size size); > static HASHSEGMENT seg_alloc(HTAB *hashp); > -static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx); > +static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement); > static bool dir_realloc(HTAB *hashp); > static bool expand_table(HTAB *hashp); > static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx); > @@ -276,11 +275,10 @@ static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn(); > static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue, > HASHBUCKET **bucketptr); > static long next_pow2_long(long num); > -static int next_pow2_int(long num); > static void register_seq_scan(HTAB *hashp); > static void deregister_seq_scan(HTAB *hashp); > static bool has_seq_scans(HTAB *hashp); > - > +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize); > > /* > * memory allocation support You removed a newline here that probably shouldn't be removed. > @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > else > hashp->keycopy = memcpy; > > - /* And select the entry allocation function, too. */ > + /* > + * And select the entry allocation function, too. XXX should this also > + * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is > + * currently only set with HASH_SHARED_MEM * > + */ > if (flags & HASH_ALLOC) > hashp->alloc = info->alloc; > else > @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > > hashp->frozen = false; > > + /* Initializing the HASHHDR variables with default values */ > hdefault(hashp); > > hctl = hashp->hctl; I assume these were just observations you made while looking into this? They seem unrelated to the change itself? > @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > freelist_partitions, > nelem_alloc, > nelem_alloc_first; > - > + void *curr_offset; > + > /* > * If hash table is partitioned, give each freelist an equal share of > * the initial allocation. Otherwise only freeList[0] is used. > @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > else > freelist_partitions = 1; > > + /* > + * If table is shared, calculate the offset at which to find the > + * the first partition of elements > + */ > + if (hashp->isshared) > + { > + int nsegs; > + int nbuckets; > + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize); > + > + curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) + > + + (sizeof(HASHBUCKET) * hctl->ssize * nsegs)); > + } > + Why only do this for shared hashtables? Couldn't we allocate the elments together with the rest for non-share hashtables too? > nelem_alloc = nelem / freelist_partitions; > if (nelem_alloc <= 0) > nelem_alloc = 1; > @@ -609,11 +627,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) > for (i = 0; i < freelist_partitions; i++) > { > int temp = (i == 0) ? nelem_alloc_first : nelem_alloc; > - > - if (!element_alloc(hashp, temp, i)) > + HASHELEMENT *firstElement; > + Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize); > + > + /* Memory is allocated as part of initial allocation in ShmemInitHash */ > + if (hashp->isshared) > + firstElement = (HASHELEMENT *) curr_offset; > + else > + firstElement = NULL; > + > + if (!element_alloc(hashp, temp, i, firstElement)) > ereport(ERROR, > (errcode(ERRCODE_OUT_OF_MEMORY), > errmsg("out of memory"))); > + curr_offset = (((char *)curr_offset) + (temp * elementSize)); > } > } Seems a bit ugly to go through element_alloc() when pre-allocating. Perhaps it's the best thing we can do to avoid duplicating code, but it seems worth checking if we can do better. Perhaps we could split element_alloc() into element_alloc() and element_add() or such? With the latter doing everything after hashp->alloc(). > @@ -693,7 +720,7 @@ init_htab(HTAB *hashp, long nelem) > int nbuckets; > int nsegs; > int i; > - > + > /* > * initialize mutexes if it's a partitioned table > */ Spurious change. > @@ -701,30 +728,11 @@ init_htab(HTAB *hashp, long nelem) > for (i = 0; i < NUM_FREELISTS; i++) > SpinLockInit(&(hctl->freeList[i].mutex)); > > - /* > - * Allocate space for the next greater power of two number of buckets, > - * assuming a desired maximum load factor of 1. > - */ > - nbuckets = next_pow2_int(nelem); > - > - /* > - * In a partitioned table, nbuckets must be at least equal to > - * num_partitions; were it less, keys with apparently different partition > - * numbers would map to the same bucket, breaking partition independence. > - * (Normally nbuckets will be much bigger; this is just a safety check.) > - */ > - while (nbuckets < hctl->num_partitions) > - nbuckets <<= 1; > + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize); > > hctl->max_bucket = hctl->low_mask = nbuckets - 1; > hctl->high_mask = (nbuckets << 1) - 1; > > - /* > - * Figure number of directory segments needed, round up to a power of 2 > - */ > - nsegs = (nbuckets - 1) / hctl->ssize + 1; > - nsegs = next_pow2_int(nsegs); > - > /* > * Make sure directory is big enough. If pre-allocated directory is too > * small, choke (caller screwed up). A function called find_num_of_segs() that also sets nbuckets seems a bit confusing. I also don't like "find_*", as that sounds like it's searching some datastructure, rather than just doing a bit of math. > @@ -1689,6 +1726,7 @@ seg_alloc(HTAB *hashp) > HASHSEGMENT segp; > > CurrentDynaHashCxt = hashp->hcxt; > + > segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize); > > if (!segp) Spurious change. > @@ -1816,7 +1854,7 @@ next_pow2_long(long num) > } > > /* calculate first power of 2 >= num, bounded to what will fit in an int */ > -static int > +int > next_pow2_int(long num) > { > if (num > INT_MAX / 2) > @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth) > } > } > } Why export this? > diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h > index 932cc4f34d..5e16bd4183 100644 > --- a/src/include/utils/hsearch.h > +++ b/src/include/utils/hsearch.h > @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status); > extern void hash_freeze(HTAB *hashp); > extern Size hash_estimate_size(long num_entries, Size entrysize); > extern long hash_select_dirsize(long num_entries); > -extern Size hash_get_shared_size(HASHCTL *info, int flags); > +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size); > extern void AtEOXact_HashTables(bool isCommit); > extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth); It's imo a bit weird that we have very related logic in hash_estimate_size() and hash_get_shared_size(). Why do we need to duplicate it? > --- a/src/backend/storage/lmgr/predicate.c > +++ b/src/backend/storage/lmgr/predicate.c > @@ -1244,7 +1244,7 @@ PredicateLockShmemInit(void) > PredXact->HavePartialClearedThrough = 0; > requestSize = mul_size((Size) max_table_size, > sizeof(SERIALIZABLEXACT)); > - PredXact->element = ShmemAlloc(requestSize); > + PredXact->element = ShmemInitStruct("SerializableXactList", requestSize, &found); > /* Add all elements to available list, clean. */ > memset(PredXact->element, 0, requestSize); > for (i = 0; i < max_table_size; i++) > @@ -1299,9 +1299,11 @@ PredicateLockShmemInit(void) > * probably OK. > */ > max_table_size *= 5; > + requestSize = mul_size((Size) max_table_size, > + RWConflictDataSize); > > RWConflictPool = ShmemInitStruct("RWConflictPool", > - RWConflictPoolHeaderDataSize, > + RWConflictPoolHeaderDataSize + requestSize, > &found); > Assert(found == IsUnderPostmaster); > if (!found) > @@ -1309,9 +1311,7 @@ PredicateLockShmemInit(void) > int i; > > dlist_init(&RWConflictPool->availableList); > - requestSize = mul_size((Size) max_table_size, > - RWConflictDataSize); > - RWConflictPool->element = ShmemAlloc(requestSize); > + RWConflictPool->element = (RWConflict) ((char *)RWConflictPool + RWConflictPoolHeaderDataSize); > /* Add all elements to available list, clean. */ > memset(RWConflictPool->element, 0, requestSize); > for (i = 0; i < max_table_size; i++) These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating the additional space. The pointer math is a bit annoying, but it makes much more sense to have one entry in pg_shmem_allocations. > diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c > index 49204f91a2..e112735d93 100644 > --- a/src/backend/storage/lmgr/proc.c > +++ b/src/backend/storage/lmgr/proc.c > @@ -218,11 +218,11 @@ InitProcGlobal(void) > * how hotly they are accessed. > */ > ProcGlobal->xids = > - (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids)); > + (TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found); > MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids)); > - ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates)); > + ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates),&found); > MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates)); > - ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags)); > + ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags),&found); > MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags)); > > /* Same. Although here I'd say it's worth padding the size of each separate "allocation" by PG_CACHE_LINE_SIZE. > @@ -233,7 +233,7 @@ InitProcGlobal(void) > fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64)); > fpRelIdSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(Oid) * FP_LOCK_SLOTS_PER_GROUP); > > - fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize)); > + fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found); > MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize)); > > /* For asserts checking we did not overflow. */ This one might actually make sense to keep separate, depending on the configuration it can be reasonably big (max_connection = 1k, max_locks_per_transaction=1k results in ~5MB).. Greetings, Andres Freund
Hi,
Thank you for the review.
cfbot found a few compiler warnings:
https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable]
[16:49:10.496] 467 | bool found;
[16:49:10.496] | ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195] 643 | curr_offset = (((char *)curr_offset) + (temp * elementSize));
[16:49:20.195] | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195] 588 | void *curr_offset;
[16:49:20.195] | ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1
 Fixed these. 
> diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
> index cd5a00132f..5203f5b30b 100644
> --- a/src/backend/utils/hash/dynahash.c
> +++ b/src/backend/utils/hash/dynahash.c
> @@ -120,7 +120,6 @@
> * a good idea of the maximum number of entries!). For non-shared hash
> * tables, the initial directory size can be left at the default.
> */
> -#define DEF_SEGSIZE 256
> #define DEF_SEGSIZE_SHIFT 8 /* must be log2(DEF_SEGSIZE) */
> #define DEF_DIRSIZE 256
Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?
Yes. This was accidentally left behind by the previous version of the
patch, so I undid the change. 
> static void register_seq_scan(HTAB *hashp);
> static void deregister_seq_scan(HTAB *hashp);
> static bool has_seq_scans(HTAB *hashp);
> -
> +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);
>
> /*
> * memory allocation support
You removed a newline here that probably shouldn't be removed.
Fixed this. 
> @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> else
> hashp->keycopy = memcpy;
>
> - /* And select the entry allocation function, too. */
> + /*
> + * And select the entry allocation function, too. XXX should this also
> + * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> + * currently only set with HASH_SHARED_MEM *
> + */
> if (flags & HASH_ALLOC)
> hashp->alloc = info->alloc;
> else
> @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>
> hashp->frozen = false;
>
> + /* Initializing the HASHHDR variables with default values */
> hdefault(hashp);
>
> hctl = hashp->hctl;
I assume these were just observations you made while looking into this? They
seem unrelated to the change itself?
Yes. I removed the first one and left the second one as a code comment. 
 
> @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> freelist_partitions,
> nelem_alloc,
> nelem_alloc_first;
> -
> + void *curr_offset;
> +
> /*
> * If hash table is partitioned, give each freelist an equal share of
> * the initial allocation. Otherwise only freeList[0] is used.
> @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> else
> freelist_partitions = 1;
>
> + /*
> + * If table is shared, calculate the offset at which to find the
> + * the first partition of elements
> + */
> + if (hashp->isshared)
> + {
> + int nsegs;
> + int nbuckets;
> + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
> +
> + curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> + + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> + }
> +
Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?
I think it is possible to consolidate the allocations for non-shared hash tables
too. However, initial elements are much smaller in non-shared hash tables due to
their ease of expansion. Therefore, there is probably less benefit in trying to do
that for non-shared tables.
In addition, the proposed changes are targeted to improve the monitoring in
pg_shmem_allocations which won't be applicable to non-shared hashtables. 
While I believe it is feasible, I am uncertain about the utility of such a change.
While I believe it is feasible, I am uncertain about the utility of such a change.
Seems a bit ugly to go through element_alloc() when pre-allocating. Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such? With the latter doing everything
after hashp->alloc().
Makes sense. I split the element_alloc() into element_alloc() and element_add().
> -
> +
> /*
> * initialize mutexes if it's a partitioned table
> */
Spurious change.
Fixed. 
A function called find_num_of_segs() that also sets nbuckets seems a bit
confusing. I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.
 I renamed it to compute_buckets_and_segs(). I am open to better suggestions. 
> segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);
>
> if (!segp)
Spurious change.
Fixed. 
> -static int
> +int
> next_pow2_int(long num)
> {
> if (num > INT_MAX / 2)
> @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
> }
> }
> }
Why export this?
It was a stale change, I removed it now 
> diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> index 932cc4f34d..5e16bd4183 100644
> --- a/src/include/utils/hsearch.h
> +++ b/src/include/utils/hsearch.h
> @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
> extern void hash_freeze(HTAB *hashp);
> extern Size hash_estimate_size(long num_entries, Size entrysize);
> extern long hash_select_dirsize(long num_entries);
> -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
> extern void AtEOXact_HashTables(bool isCommit);
> extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
It's imo a bit weird that we have very related logic in hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?
hash_estimate_size() estimates using default values and hash_get_shared_size()
calculates using specific values depending on the flags associated with the hash
table. For instance, segment_size used by the former is DEF_SEGSIZE and
the latter uses info->ssize which is set when the HASH_SEGMENT flag is true.
Hence, they might return different values for shared memory sizes.
calculates using specific values depending on the flags associated with the hash
table. For instance, segment_size used by the former is DEF_SEGSIZE and
the latter uses info->ssize which is set when the HASH_SEGMENT flag is true.
Hence, they might return different values for shared memory sizes.
These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.
Fixed accordingly.
 
> - (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
> + (TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
> MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
> - ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> + ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates), &found);
> MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
> - ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
> + ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
> MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
>
> /*
Same.
Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.
Made this change.
> - fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> + fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
> MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
>
> /* For asserts checking we did not overflow. */
This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..
OK 
PFA the rebased patches with the above changes.
Kindly let me know your views.
Thank you,
Rahila Syed
PFA the rebased patches with the above changes.
Kindly let me know your views.
Thank you,
Rahila Syed
Вложения
Hi Andres,
> + if (hashp->isshared)
> + {
> + int nsegs;
> + int nbuckets;
> + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
> +
> + curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> + + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> + }
> +
Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?
I have now made the changes uniformly across shared and non-shared hash tables,
making the code fix look cleaner. I hope this aligns with your suggestions. 
Please find attached updated and rebased versions of both patches.
Kindly let me know your views.
Thank you,
Kindly let me know your views.
Thank you,
Rahila Syed
Вложения
Hi,
On Wed, 12 Mar 2025 at 13:46, Rahila Syed <rahilasyed90@gmail.com> wrote:
> I have now made the changes uniformly across shared and non-shared hash tables,
> making the code fix look cleaner. I hope this aligns with your suggestions.
> Please find attached updated and rebased versions of both patches.
Thank you for working on this!
I have a couple of comments, I have only reviewed 0001 so far.
You may need to run pgindent, it makes some changes.
diff --git a/src/backend/utils/hash/dynahash.c
b/src/backend/utils/hash/dynahash.c
index cd5a00132f..3bdf3d6fd5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
+        /*
+         * If table is shared, calculate the offset at which to find the the
+         * first partition of elements
+         */
+
+        nsegs = compute_buckets_and_segs(nelem, &nbuckets,
hctl->num_partitions, hctl->ssize);
Blank line between the comment and the code.
+    bool        element_alloc = true;
+    Size        elementSize = MAXALIGN(sizeof(HASHELEMENT)) +
MAXALIGN(info->entrysize);
It looks odd to me that camelCase and snake_case are used together but
it is already used like that in this file; so I think it should be
okay.
 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
Comment needs an update. This function no longer links elements into
the free list.
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long
num_partitions, long ssize)
+{
...
+    /*
+     * In a partitioned table, nbuckets must be at least equal to
+     * num_partitions; were it less, keys with apparently different partition
+     * numbers would map to the same bucket, breaking partition independence.
+     * (Normally nbuckets will be much bigger; this is just a safety check.)
+     */
+    while ((*nbuckets) < num_partitions)
+        (*nbuckets) <<= 1;
I have some worries about this function, I am not sure what I said
below has real life implications as you already said 'Normally
nbuckets will be much bigger; this is just a safety check.'.
1- num_partitions is long and nbuckets is int, so could there be any
case where num_partition is bigger than MAX_INT and cause an infinite
loop?
2- Although we assume both nbuckets and num_partition initialized as
the same type, (*nbuckets) <<= 1 will cause an infinite loop if
num_partition is bigger than MAX_TYPE / 2.
So I think that the solution is to confirm that num_partition <
MAX_NBUCKETS_TYPE / 2, what do you think?
--
Regards,
Nazir Bilal Yavuz
Microsoft
			
		Hi Bilal,
I have a couple of comments, I have only reviewed 0001 so far.
Thank you for reviewing!
You may need to run pgindent, it makes some changes.
Attached v4-patch has been updated after running pgindent.
+ * If table is shared, calculate the offset at which to find the the
+ * first partition of elements
+ */
+
+ nsegs = compute_buckets_and_segs(nelem, &nbuckets,
hctl->num_partitions, hctl->ssize);
Blank line between the comment and the code.
Removed this.
 
/*
* allocate some new elements and link them into the indicated free list
*/
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
Comment needs an update. This function no longer links elements into
the free list.
Updated this and few other comments in the attached v4-patch.
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long
num_partitions, long ssize)
+{
...
+ /*
+ * In a partitioned table, nbuckets must be at least equal to
+ * num_partitions; were it less, keys with apparently different partition
+ * numbers would map to the same bucket, breaking partition independence.
+ * (Normally nbuckets will be much bigger; this is just a safety check.)
+ */
+ while ((*nbuckets) < num_partitions)
+ (*nbuckets) <<= 1;
I have some worries about this function, I am not sure what I said
below has real life implications as you already said 'Normally
nbuckets will be much bigger; this is just a safety check.'.
1- num_partitions is long and nbuckets is int, so could there be any
case where num_partition is bigger than MAX_INT and cause an infinite
loop?
2- Although we assume both nbuckets and num_partition initialized as
the same type, (*nbuckets) <<= 1 will cause an infinite loop if
num_partition is bigger than MAX_TYPE / 2.
So I think that the solution is to confirm that num_partition <
MAX_NBUCKETS_TYPE / 2, what do you think?
Your concern is valid. This has been addressed in the existing code by 
calling next_pow2_int() on num_partitions before running the function.
Additionally, I am not adding any new code to the compute_buckets_and_segs
function. I am simply moving part of the init_tab() code into a separate function
for reuse.
Please find attached the updated and rebased patches.
Thank you,
Rahila Syed
calling next_pow2_int() on num_partitions before running the function.
Additionally, I am not adding any new code to the compute_buckets_and_segs
function. I am simply moving part of the init_tab() code into a separate function
for reuse.
Please find attached the updated and rebased patches.
Thank you,
Rahila Syed
Вложения
Hi,
I did a review on v3 of the patch. I see there's been some minor changes
in v4 - I haven't tried to adjust my review, but I assume most of my
comments still apply.
Most of my suggestions are about formatting and/or readability. Some of
the likes (e.g. the pointer arithmetics) got so long pgindent would have
to wrap them, but more importantly really hard to decipher what it does.
I've added a couple "review" commits, actually doing most of what I'm
going to suggest.
1) I don't quite understand why hash_get_shared_size() got renamed to
hash_get_init_size()? Why is that needed? Does it cover only some
initial allocation, and there's additional memory allocated later? And
what's the point of the added const?
2) I find it more readable when ShmemInitStruct() is formatted on
multiple lines. Yes, it's a matter of choice, but also other places do
it this way, I think.
3) The changes in hash_create() are a good example of readability issues
I already mentioned. Consider this line:
curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize *
sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
First, I'd point out this is not an offset but a pointer, so the
variable name is a bit misleading. But more importantly, I envy those
who can parse this in their head and see if it's correct.
I think it's much better to define a couple macros to calculate parts of
this, essentially a tiny "language" expressing this in a concise way.
The 0002 patch defines
- HASH_ELEMENTS
- HASH_ELEMENT_NEXT
- HASH_SEGMENT_PTR
- HASH_SEGMENT_SIZE
- ...
and then uses that in hash_create(). I believe it's way more readable
like this.
4) I find it a bit strange when a function calculates multiple values,
and then returns them in different ways - one as a return value, one
through a pointer, the way compute_buckets_and_segs() did.
There are cases when it makes sense (e.g. when one of the values is
returned only optionally), but otherwise I think it's better to return
both values in the same way through a pointer. The 0002 patch adjusts
compute_buckets_and_segs() to it like this.
We have a precedent for this - ExecChooseHashTableSize(), which is doing
a very similar thing for sizing hashjoin hash table.
5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
along with calculating the per-element requestSize, but then still does
    memset(PredXact->element, 0, requestSize);
and
    memset(RWConflictPool->element, 0, requestSize);
with requestSize for the whole allocation? I haven't seen any crashes,
but this seems wrong to me. I believe we can simply zero the whole
allocation right after ShmemInitStruct(), there's no need to do this for
each individual element.
5) InitProcGlobal is another example of hard-to-read code. Admittedly,
it wasn't particularly readable before, but making the lines even longer
does not make it much better ...
I guess adding a couple macros similar to hash_create() would be one way
to improve this (and there's even a review comment in that sense), but I
chose to just do maintain a simple pointer. But if you think it should
be done differently, feel free to do so.
6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
which seems to be doing a very similar thing, and to not require the
prototype. Minor detail, feel free to undo.
7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
the patch, and I see no reason to do it in the same commit. So 0005
removes this change, and 0006 reintroduces it separately.
FWIW I see no justification for why the cache line padding is useful,
and it seems quite unlikely this would benefit anything, consiering it
adds padding between fairly long arrays. What kind of contention can we
get there? I don't get it.
Also, why is the patch adding padding after statusFlags (the last array
allocated in InitProcGlobal) and not between allProcs and xids?
regards
-- 
Tomas Vondra
			
		Вложения
I did a review on v3 of the patch. I see there's been some minor changes
in v4 - I haven't tried to adjust my review, but I assume most of my
comments still apply.
Thank you for your interest and review.
1) I don't quite understand why hash_get_shared_size() got renamed to
hash_get_init_size()? Why is that needed? Does it cover only some
initial allocation, and there's additional memory allocated later? And
what's the point of the added const?
Earlier, this function was used to calculate the size for shared hash tables only.
Now, it also calculates the size for a non-shared hash table. Hence the change
of name.
If I don't change the argument to const, I get a compilation error as follows:
"passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier from pointer target type."
This is due to this function being called from hash_create which considers HASHCTL to be
a const.
5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
along with calculating the per-element requestSize, but then still does
memset(PredXact->element, 0, requestSize);
and
memset(RWConflictPool->element, 0, requestSize);
with requestSize for the whole allocation? I haven't seen any crashes,
but this seems wrong to me. I believe we can simply zero the whole
allocation right after ShmemInitStruct(), there's no need to do this for
each individual element.
Good catch.  Thanks for updating.
 
5) InitProcGlobal is another example of hard-to-read code. Admittedly,
it wasn't particularly readable before, but making the lines even longer
does not make it much better ...
I guess adding a couple macros similar to hash_create() would be one way
to improve this (and there's even a review comment in that sense), but I
chose to just do maintain a simple pointer. But if you think it should
be done differently, feel free to do so.
LGTM, long lines have been reduced by the ptr approach. 
6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
which seems to be doing a very similar thing, and to not require the
prototype. Minor detail, feel free to undo.
LGTM.
 
7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
the patch, and I see no reason to do it in the same commit. So 0005
removes this change, and 0006 reintroduces it separately.
OK.
 
FWIW I see no justification for why the cache line padding is useful,
and it seems quite unlikely this would benefit anything, consiering it
adds padding between fairly long arrays. What kind of contention can we
get there? I don't get it.
I have done that to address a review comment upthread by Andres and
the because comment above that code block also mentions need for
padding.
Also, why is the patch adding padding after statusFlags (the last array
allocated in InitProcGlobal) and not between allProcs and xids?
I agree it makes more sense this way, so changing accordingly.
*
+ * XXX can we rely on this matching the calculation in hash_get_shared_size?
+ * or could/should we add some asserts? Can we have at least some sanity checks
+ * on nbuckets/nsegs?
 Both places rely on compute_buckets_and_segs() to calculate nbuckets and nsegs,
 hence the probability of mismatch is low.  I am open to adding some asserts to verify this.
 Do you have any suggestions in mind?  
Please find attached updated patches after merging all your review comments except
a few discussed above.
Please find attached updated patches after merging all your review comments except
a few discussed above.
Thank you,
Rahila Syed
Вложения
Hi, On 3/27/25 13:02, Rahila Syed wrote: > > Hi Tomas, > > > I did a review on v3 of the patch. I see there's been some minor changes > in v4 - I haven't tried to adjust my review, but I assume most of my > comments still apply. > > > Thank you for your interest and review. > > > 1) I don't quite understand why hash_get_shared_size() got renamed to > hash_get_init_size()? Why is that needed? Does it cover only some > initial allocation, and there's additional memory allocated later? And > what's the point of the added const? > > > Earlier, this function was used to calculate the size for shared hash > tables only. > Now, it also calculates the size for a non-shared hash table. Hence the > change > of name. > > If I don't change the argument to const, I get a compilation error as > follows: > "passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier > from pointer target type." > This is due to this function being called from hash_create which > considers HASHCTL to be > a const. > OK, makes sense. I haven't tried removing the const, but if removing it would trigger a compiler warning, it's probably needed. > ... > > FWIW I see no justification for why the cache line padding is useful, > and it seems quite unlikely this would benefit anything, consiering it > adds padding between fairly long arrays. What kind of contention can we > get there? I don't get it. > > > I have done that to address a review comment upthread by Andres and > the because comment above that code block also mentions need for > padding. > Neither that seems like a sufficient justification for the change. The comment is more of a speculation, it doesn't demonstrate the benefits are real. It's true Andres tends to be right about these things, but still, I'd like to see some argument for why this helps. How exactly does padding before/after the whole array help anything? In fact, is this the padding Andres (or the comment) suggested? The comment says: * XXX: It might make sense to increase padding for these arrays, given * how hotly they are accessed. Doesn't "padding of array" actually suggest padding each element, not the array as a whole? In any case, if the 0003 patch addresses the padding, it should also remove the XXX comment. > > Also, why is the patch adding padding after statusFlags (the last array > allocated in InitProcGlobal) and not between allProcs and xids? > > > I agree it makes more sense this way, so changing accordingly. > > > > * > + * XXX can we rely on this matching the calculation > in hash_get_shared_size? > + * or could/should we add some asserts? Can we have > at least some sanity checks > + * on nbuckets/nsegs? > > > Both places rely on compute_buckets_and_segs() to calculate nbuckets > and nsegs, > hence the probability of mismatch is low. I am open to adding some > asserts to verify this. > Do you have any suggestions in mind? > > Please find attached updated patches after merging all your review > comments except > a few discussed above. > OK, I don't have any other comments for 0001 and 0002. I'll do some more review and polishing on those, and will get them committed soon. I don't plan to push 0003, unless someone can actually explain and demonstrate the benefits of the proposed padding, regards -- Tomas Vondra
On 3/27/25 13:56, Tomas Vondra wrote:
> ...
> 
> OK, I don't have any other comments for 0001 and 0002.  I'll do some
> more review and polishing on those, and will get them committed soon.
> 
Actually ... while polishing 0001 and 0002, I noticed a couple more
details that I'd like to ask about. I also ran pgindent and tweaked the
formatting, so some of the diff is caused by that.
1) alignment
There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).
I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.
Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...
The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...
The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".
The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.
Now, back to the patch. The code originally did this in ShmemInitStruct
    hashp = ShmemInitStruct(...)
to allocate the hctl, and then
    firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);
in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:
    size = CACHELINEALIGN(size);
So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.
But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().
I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    (sizeof(HASHHDR) + \
+     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
to
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    (MAXALIGN(sizeof(HASHHDR)) + \
+     ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+     ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.
But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.
So something like this might be more correct:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    (MAXALIGN(sizeof(HASHHDR)) + \
+     MAXALIGN((hctl)->dsize * izeof(HASHSEGMENT)) + \
+     MAXALIGN((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    CACHELINEALIGN(sizeof(HASHHDR) + \
+     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.
In fact, I'm starting to wonder if we need to worry about this at all.
Maybe it's not that significant for dynahash after all - otherwise we
actually should align/pad the individual elements, not just the array as
a whole, I think.
2) I do think the last patch should use CACHELINEALIGN() too, instead of
adding PG_CACHE_LINE_SIZE to the sizes (which does not ensure good
alignment, it just means there's gap between the allocated parts). But I
still don't know what the intention was, so hard to say ...
3) I find the comment before hash_get_init_size a bit unclear/confusing.
It says this:
 * init_size should match the total number of elements allocated during
 * hash table creation, it could be zero for non-shared hash tables
 * depending on the value of nelem_alloc. For more explanation see
 * comments within this function.
 *
 * nelem_alloc parameter is not relevant for shared hash tables.
What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?
AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right? But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.
It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.
Then it says "nelem_alloc parameter is not relevant for shared hash
tables". What does "not relevant" mean? It should say it's ignored.
The bit "For more explanation see comments within this function" is not
great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.
regards
-- 
Tomas Vondra
			
		Вложения
- v6-0001-Account-for-all-the-shared-memory-allocated-by-ha.patch
- v6-0002-review.patch
- v6-0003-align-using-CACHELINESIZE-to-match-ShmemAllocRaw.patch
- v6-0004-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
- v6-0005-review.patch
- v6-0006-Add-cacheline-padding-between-heavily-accessed-ar.patch
- v6-0007-review.patch
Hi Tomas,
1) alignment
There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).
I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.
I have added MAXALIGN to specific allocations, such as HASHHDR and
HASHSEGMENT, with the expectation that allocations in multiples of this,
like dsize * HASHSEGMENT, would automatically align.
Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...
The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...
The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".
The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.
Thank you for your explanation. I had a similar understanding. However, 
I believed that MAXALIGN and CACHEALIGN are primarily performance optimizations
that do not impact the correctness of the code. This assumption is based on the fact
that I have not observed any failures on GitHub CI, even when changing the alignment
in this part of the code.
I believed that MAXALIGN and CACHEALIGN are primarily performance optimizations
that do not impact the correctness of the code. This assumption is based on the fact
that I have not observed any failures on GitHub CI, even when changing the alignment
in this part of the code.
Now, back to the patch. The code originally did this in ShmemInitStruct
hashp = ShmemInitStruct(...)
to allocate the hctl, and then
firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);
in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:
size = CACHELINEALIGN(size);
So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.
But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().
Yes, the patch reworks this to allocate all the shared memory at once.
It uses ShmemInitStruct which internally calls ShmemAllocRaw. So the whole chunk
of memory allocated is still CACHEALIGNed.
I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ (sizeof(HASHHDR) + \
+ ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+ ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
to
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ (MAXALIGN(sizeof(HASHHDR)) + \
+ ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+ ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.
Good catch. For a shared hash table, allocations need to be
CACHELINEALIGNED.
I think hash_get_init_size does not need to call CACHELINEALIGNED
explicitly as ShmemInitStruct already does this.
In that case, the size returned by hash_get_init_size just needs to
MAXALIGN required structs as per hash_create() requirements and CACHELINEALIGN
will be taken care of in ShmemInitStruct at the time of allocating the entire chunk.
But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.
I think existing code has occurrences of both i,.e aligning individual elements and
arrays.
A similar precedent exists in the function hash_estimate_size(), which only
applies maxalignment to the individual structs like HASHHDR, HASHELEMENT,
entrysize, but also an array of HASHBUCKET headers.
I agree with you that perhaps we don't need maxalignment for all of these structures. 
For ex, HASHBUCKET is a pointer to a linked list of elements, it might not require alignment
if the elements it points to are already aligned.
For ex, HASHBUCKET is a pointer to a linked list of elements, it might not require alignment
if the elements it points to are already aligned.
But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ CACHELINEALIGN(sizeof(HASHHDR) + \
+ ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+ ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.
Before this patch, following structures were allocated separately using ShmemAllocRaw
directory, each segment(seg_alloc) and a chunk of elements (element_alloc). Hence, 
I don't understand why v-0003* CACHEALIGNs  in the manner it does.
I think if we want to emulate the current behaviour we should do something like:
CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
+ CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
+ CACHELINEALIGN(init_size * elementSize);
Like you mentioned the only difference would be that we would be aligning all elements
at once instead of aligning individual partitions of elements.
I think if we want to emulate the current behaviour we should do something like:
CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
+ CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
+ CACHELINEALIGN(init_size * elementSize);
Like you mentioned the only difference would be that we would be aligning all elements
at once instead of aligning individual partitions of elements.
3) I find the comment before hash_get_init_size a bit unclear/confusing.
It says this:
* init_size should match the total number of elements allocated during
* hash table creation, it could be zero for non-shared hash tables
* depending on the value of nelem_alloc. For more explanation see
* comments within this function.
*
* nelem_alloc parameter is not relevant for shared hash tables.
What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?
by should match I mean - init_size  here  *is* equal to nelem in hash_create() .
AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right?
No. All the init_size elements are pre-allocated for shared hash table irrespective of 
nelem_alloc value.
For non-shared hash tables init_size elements are allocated only
if it is less than nelem_alloc, otherwise they are allocated as part of expansion.
 
nelem_alloc value.
For non-shared hash tables init_size elements are allocated only
if it is less than nelem_alloc, otherwise they are allocated as part of expansion.
But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.
It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.
The relationship is defined in this comment:
 /*
* For a shared hash table, preallocate the requested number of elements.
* This reduces problems with run-time out-of-shared-memory conditions.
*
* For a non-shared hash table, preallocate the requested number of
* elements if it's less than our chosen nelem_alloc. This avoids wasting
* space if the caller correctly estimates a small table size.
*/
hash_create code is confusing because the nelem_alloc named variable is used
in two different cases, In the above case nelem_alloc refers to the one
returned by choose_nelem_alloc function.
The other nelem_alloc determines the number of elements in each partition
for a partitioned hash table. This is not what is being referred to in the above
comment.
* For a shared hash table, preallocate the requested number of elements.
* This reduces problems with run-time out-of-shared-memory conditions.
*
* For a non-shared hash table, preallocate the requested number of
* elements if it's less than our chosen nelem_alloc. This avoids wasting
* space if the caller correctly estimates a small table size.
*/
hash_create code is confusing because the nelem_alloc named variable is used
in two different cases, In the above case nelem_alloc refers to the one
returned by choose_nelem_alloc function.
The other nelem_alloc determines the number of elements in each partition
for a partitioned hash table. This is not what is being referred to in the above
comment.
The bit "For more explanation see comments within this function" is not
great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.
I will improve the comment in the next version.
Thank you,
Rahila Syed
Thank you,
Rahila Syed
On 3/28/25 12:10, Rahila Syed wrote:
> Hi Tomas,
> 
> 
>     1) alignment
> 
>     There was a comment with a question whether we need to MAXALIGN the
>     chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
>     now it's part of one large allocation, which is then cut into pieces
>     (using pointer arithmetics).
> 
>     I was not sure whether we need to enforce some alignment, we briefly
>     discussed that off-list. I realize you chose to add the alignment, but I
>     haven't noticed any comment in the patch why it's needed, and it seems
>     to me it may not be quite correct.
> 
>  
> 
> I have added MAXALIGN to specific allocations, such as HASHHDR and
> HASHSEGMENT, with the expectation that allocations in multiples of this,
> like dsize * HASHSEGMENT, would automatically align.
> 
Yes, assuming the original allocation is aligned, maxalign-ing the
smaller allocations would ensure that. But there's still the issue with
aligning array elements.
> 
> 
>     Let me explain what I had in mind, and why I think the way v5 doesn't
>     actually do that. It took me a while before I understood what alignment
>     is about, and for a while it was haunting my patches, so hopefully this
>     will help others ...
> 
>     The "alignment" is about pointers (or addresses), and when a pointer is
>     aligned it means the address is a multiple of some number. For example
>     4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
>     while 0x00000101 is not. Sometimes we use data types to express the
>     alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
>     the alignment is always 2^k, so 1, 2, 4, 8, ...
> 
>     The primary reason for alignment is that some architectures require the
>     pointers to be well-aligned for a given data type. For example (int*)
>     needs to be int-aligned. If you have a pointer that's not 4B-aligned,
>     it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
>     like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
>     work, even if there might be a minor performance impact. Anyway, we
>     still enforce/expect correct alignment, because we may still support
>     some of those alignment-sensitive platforms, and it's "tidy".
> 
>     The other reason is that we sometimes use alignment to add padding, to
>     reduce contention when accessing elements in hot arrays. We want to
>     align to cacheline boundaries, so that a struct does not require
>     accessing more cachelines than really necessary. And also to reduce
>     contention - the more cachelines, the higher the risk of contention.
> 
>  
> Thank you for your explanation. I had a similar understanding. However,
> I believed that MAXALIGN and CACHEALIGN are primarily performance
> optimizations
> that do not impact the correctness of the code. This assumption is based
> on the fact
> that I have not observed any failures on GitHub CI, even when changing
> the alignment
> in this part of the code.
> 
I hope it didn't come over as explaining something you already know ...
Apologies if that's the case.
As for why the CI doesn't fail on this, that's because it runs on x86,
which tolerates alignment issues. AFAIK all "modern" platforms do. You'd
need some old platform (I recall we saw SIGBUG on powerpc and s390, or
something like that). It'd probably matter even on x86 with something
like AVX, but that's irrelevant for this patch.
I don't even know what is the performance impact on x86. I think it was
measurable years ago, but it got mostly negligible for recent CPUs.
Another reason is that some of those structs may be "naturally aligned"
because they happen to be multiples of MAXALIGN. For example:
HASHHDR -> 96 bytes
HASHELEMENT -> 16 bytes (this covers HASHBUCKET / HASHSEGMENT too)
> 
>     Now, back to the patch. The code originally did this in ShmemInitStruct
> 
>         hashp = ShmemInitStruct(...)
> 
>     to allocate the hctl, and then
> 
>         firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);
> 
>     in element_alloc(). But this means the "elements" allocation is aligned
>     to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:
> 
>         size = CACHELINEALIGN(size);
> 
>     So it distributes memory in multiples of 128B, and I believe it starts
>     at a multiple of 128B.
> 
>     But the patch reworks this to allocate everything at once, and thus it
>     won't get this alignment automatically. AFAIK that's not intentional,
>     because no one explicitly mentioned this. And it's may not be quite
>     desirable, judging by the comment in ShmemAllocRaw().
> 
> 
> Yes, the patch reworks this to allocate all the shared memory at once.
> It uses ShmemInitStruct which internally calls ShmemAllocRaw. So the
> whole chunk
> of memory allocated is still CACHEALIGNed. 
> 
>     I mentioned v5 adds alignment, but I think it does not quite do that
>     quite correctly. It adds alignment by changing the macros from:
> 
>     +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
>     +       (sizeof(HASHHDR) + \
>     +        ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
>     +        ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
> 
>     to
> 
>     +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
>     +       (MAXALIGN(sizeof(HASHHDR)) + \
>     +        ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
>     +        ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
> 
>     First, it uses MAXALIGN, but that's mostly my fault, because my comment
>     suggested that - the ShmemAllocRaw however and makes the case for using
>     CACHELINEALIGN.
> 
> 
> Good catch. For a shared hash table, allocations need to be
> CACHELINEALIGNED.  
> I think hash_get_init_size does not need to call CACHELINEALIGNED
> explicitly as ShmemInitStruct already does this.
> In that case, the size returned by hash_get_init_size just needs to
> MAXALIGN required structs as per hash_create() requirements and
> CACHELINEALIGN
> will be taken care of in ShmemInitStruct at the time of allocating the
> entire chunk.
> 
> 
>     But more importantly, it adds alignment to all hctl field, and to every
>     element of those arrays. But that's not what the alignment was supposed
>     to do - it was supposed to align arrays, not individual elements. Not
>     only would this waste memory, it would actually break direct access to
>     those array elements.
> 
> 
> I think existing code has occurrences of both i,.e aligning individual
> elements and 
> arrays.
> A similar precedent exists in the function hash_estimate_size(), which only
> applies maxalignment to the individual structs like HASHHDR, HASHELEMENT,
> entrysize, but also an array of HASHBUCKET headers. 
> 
> I agree with you that perhaps we don't need maxalignment for all of
> these structures.
> For ex, HASHBUCKET is a pointer to a linked list of elements, it might
> not require alignment
> if the elements it points to are already aligned.
> 
> 
>     But there's another detail - even before this patch, most of the stuff
>     was allocated at once by ShmemInitStruct(). Everything except for the
>     elements, so to replicate the alignment we only need to worry about that
>     last part. So I think this should do:
> 
>  
> 
>     +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
>     +    CACHELINEALIGN(sizeof(HASHHDR) + \
>     +     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
>     +     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
> 
>     This is what the 0003 patch does. There's still one minor difference, in
>     that we used to align each segment independently - each element_alloc()
>     call allocated a new CACHELINEALIGN-ed chunk, while now have just a
>     single chunk. But I think that's OK.
> 
>  
> Before this patch, following structures were allocated separately using
> ShmemAllocRaw
> directory, each segment(seg_alloc) and a chunk of elements
> (element_alloc). Hence, 
> I don't understand why v-0003* CACHEALIGNs  in the manner it does.
> 
I may not have gotten it quite right. The intent was to align it the
same way as before, and the allocation used to be sized like this:
Size
hash_get_shared_size(HASHCTL *info, int flags)
{
    Assert(flags & HASH_DIRSIZE);
    Assert(info->dsize == info->max_dsize);
    return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
}
But I got confused a bit. I think you're correct here:
> I think if we want to emulate the current behaviour we should do
> something like:
> CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
>                 + CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
>                 + CACHELINEALIGN(init_size * elementSize);
> 
> Like you mentioned the only difference would be that we would be
> aligning all elements
> at once instead of aligning individual partitions of elements.
> 
Right. I'm still not convinced if this makes any difference, or whether
this alignment was merely a consequence of using ShmemAlloc(). I don't
want to make this harder to understand unnecessarily.
Let's keep this simple - without additional alignment. I'll think about
it a bit more, and maybe add it before commit.
> 
>     3) I find the comment before hash_get_init_size a bit unclear/confusing.
>     It says this:
> 
>      * init_size should match the total number of elements allocated during
>      * hash table creation, it could be zero for non-shared hash tables
>      * depending on the value of nelem_alloc. For more explanation see
>      * comments within this function.
>      *
>      * nelem_alloc parameter is not relevant for shared hash tables.
> 
>     What does "should match" mean here? Doesn't it *determine* the number of
>     elements allocated? What if it doesn't match?
> 
>  
> by should match I mean - init_size  here  *is* equal to nelem in
> hash_create() .
> 
> 
>     AFAICS it means the hash table is sized to expect init_size elements,
>     but only nelem_alloc elements are actually pre-allocated, right?
> 
>  
> No. All the init_size elements are pre-allocated for shared hash table
> irrespective of 
> nelem_alloc value.
> For non-shared hash tables init_size elements are allocated only
> if it is less than nelem_alloc, otherwise they are allocated as part of
> expansion.
>  
> 
>     But the
>     comment says it's init_size which determines the number of elements
>     allocated during creation. Confusing.
> 
>     It says "it could be zero ... depending on the value of nelem_alloc".
>     Depending how? What's the relationship.
> 
>  
> The relationship is defined in this comment:
>  /*
> * For a shared hash table, preallocate the requested number of elements.
> * This reduces problems with run-time out-of-shared-memory conditions.
> *
> * For a non-shared hash table, preallocate the requested number of
> * elements if it's less than our chosen nelem_alloc.  This avoids wasting
> * space if the caller correctly estimates a small table size.
> */
> 
> hash_create code is confusing because the nelem_alloc named variable is used
> in two different cases, In  the above case  nelem_alloc  refers to the one 
> returned by choose_nelem_alloc function.
> 
> The other nelem_alloc determines the number of elements in each partition
> for a partitioned hash table. This is not what is being referred to in
> the above 
> comment.
> 
>     The bit "For more explanation see comments within this function" is not
>     great, if only because there are not many comments within the function,
>     so there's no "more explanation". But if there's something important, it
>     should be in the main comment, preferably.
> 
>  
> I will improve the comment in the next version.
> 
OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
really used except for this bit:
+    if (init_size > nelem_alloc)
+        element_alloc = false;
Can't we determine before calling the function, to make it a bit less
confusing?
regards
-- 
Tomas Vondra
			
		Hi Tomas,
Right. I'm still not convinced if this makes any difference, or whether
this alignment was merely a consequence of using ShmemAlloc(). I don't
want to make this harder to understand unnecessarily.
Yeah, it makes sense. 
Let's keep this simple - without additional alignment. I'll think about
it a bit more, and maybe add it before commit.
OK.
> I will improve the comment in the next version.
>
OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
really used except for this bit:
+ if (init_size > nelem_alloc)
+ element_alloc = false;
Can't we determine before calling the function, to make it a bit less
confusing?
Yes, we could determine whether the pre-allocated elements are zero before
calling the function, I have fixed it accordingly in the attached 0001 patch.
Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
passed this information as a boolean variable-initial_elems. If it is false,
no elements are pre-allocated.
Please find attached the v7-series, which incorporates your review patches
and addresses a few remaining comments.
Thank you,
Rahila Syed
Вложения
On 3/31/25 01:01, Rahila Syed wrote: > ... > > > > I will improve the comment in the next version. > > > > OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not > really used except for this bit: > > + if (init_size > nelem_alloc) > + element_alloc = false; > > Can't we determine before calling the function, to make it a bit less > confusing? > > > Yes, we could determine whether the pre-allocated elements are zero before > calling the function, I have fixed it accordingly in the attached 0001 > patch. > Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've > passed this information as a boolean variable-initial_elems. If it is > false, > no elements are pre-allocated. > > Please find attached the v7-series, which incorporates your review patches > and addresses a few remaining comments. > I think it's almost committable. Attached is v8 with some minor review adjustments, and updated commit messages. Please read through those and feel free to suggest changes. I still found the hash_get_init_size() comment unclear, and it also referenced init_size, which is no longer relevant. I improved the comment a bit (I find it useful to mimic comments of nearby functions, so I did that too here). The "initial_elems" name was a bit confusing, as it seemed to suggest "number of elements", but it's a simple flag. So I renamed it to "prealloc", which seems clearer to me. I also tweaked (reordered/reformatted) the conditions a bit. For the other patch, I realized we can simply MemSet() the whole chunk, instead of resetting the individual parts. regards -- Tomas Vondra
Вложения
Hi,
I think it's almost committable. Attached is v8 with some minor review
adjustments, and updated commit messages. Please read through those and
feel free to suggest changes.
The changes look good to me. 
About the following question.
/* XXX what about segment size? should check have HASH_SEGMENT? */
Do you mean for a shared hash table should the caller have specified HASH_SEGMENT
in flags?
It appears that the current code does not require this change. All the shared hash tables seem
to have the default segment size.
I left the comment as it is as I am not sure if you intend to remove it or not.
 
About the following question.
/* XXX what about segment size? should check have HASH_SEGMENT? */
Do you mean for a shared hash table should the caller have specified HASH_SEGMENT
in flags?
It appears that the current code does not require this change. All the shared hash tables seem
to have the default segment size.
I left the comment as it is as I am not sure if you intend to remove it or not.
I still found the hash_get_init_size() comment unclear, and it also
referenced init_size, which is no longer relevant. I improved the
comment a bit (I find it useful to mimic comments of nearby functions,
so I did that too here). The "initial_elems" name was a bit confusing,
as it seemed to suggest "number of elements", but it's a simple flag. So
I renamed it to "prealloc", which seems clearer to me. I also tweaked
(reordered/reformatted) the conditions a bit.
I appreciate your edtis, the comment and code are clearer now.
PFA the patches after merging the review patches.
Thank you,
PFA the patches after merging the review patches.
Thank you,
Rahila Syed
Вложения
Thanks! I've pushed the first two parts, improving the shmem accounting. I've done a couple additional adjustments before commit, namely: 1) I've renamed hash_get_init_size() to just hash_get_size(). We've not indicated that it's "initial" allocation before, I don't think we need to indicate to do that now. So I've just removed the "shared" part, as that's no longer relevant. 2) I've removed one of the compute_buckets_and_segs() in hash_create, because AFAICS we don't need to recalculate the nsegs - we already have the correct value in the hctl, so I just used that. The fact that we call compute_buckets_and_segs() in multiple places, relying on getting the exact same result seemed a bit fragile to me. With this we now have just two callers (instead of 3), and I don't think we can do better - the first call is in hash_get_size(), and I don't see a convenient way to pass the info init_hash() where we do the 2nd call. 3) In 0002, I've reworked how ProcGlobalShmemSize() calculates the size of the required memory - the earlier patch added PGProcShmemSize(), but the ProcGlobalShmemSize() redid the calculation all over. I changed that to call the new function, and also added a new FastPathLockShmemSize() to do the same for the fast-path lock arrays (which I realized is now allocated as a separate area, so it'll be shown separately). 4) I realized PGProcShmemSize() did the calculation slightly differently from ProcGlobalShmemSize() - by directly multiplying the values, not using mul_size(). I don't recall why we use mul_size(), but it seems like the "usual" way. So I did it that way. I notice we still do the plain multiplication later when setting the ProcGlobal fields. Seems a bit weird, but we always did that - the patch does not really change that. I'll now mark this as committed. I haven't done about the alignment. My conclusion from the discussion was we don't quite need to do that, but if we do I think it's a matter for a separate patch - perhaps something like the 0003. Thanks for the patch, reviews, etc. -- Tomas Vondra
Hi,
I believe it's acceptable to allocate everything in a single block provided we are not trying
to individually free any of these, which we do only for the directory pointer in dir_realloc.
Additional segment allocation goes through seg_alloc and element allocations through
element_alloc, which do not free existing chunks but instead allocate new ones with
pointers in existing directories and segments.
Thus, as long as we don't reallocate the directory, which we avoid in the case of shared
hash tables, it should be safe to proceed with this change.
Please find attached the patch which removes the changes for non-shared hash tables
and keeps them for shared hash tables.
I tested this by running make-check, make-check world and the reproducer script shared
by David. I also ran pgbench to test creation and expansion of some of the
shared hash tables.
Thank you,
Rahila Syed
Analysis of the Bug in 0002 reported by David Rowley :
The 0001* patch allocates memory for the hash header, directory, segments,
and elements collectively for both shared and non-shared hash tables. While
this approach works well for shared hash tables, it presents an issue for non-
shared hash tables. Specifically, during the expand_table() process,
non-shared hash tables may reallocate a new directory and free the old one.
Since the directory's memory is no longer allocated individually, it cannot be freed
separately. This results in the following error:
ERROR: pfree called with invalid pointer 0x60a15edc44e0 (header 0x0000002000000008)
These allocations are done together to improve reporting of shared memory allocations
The 0001* patch allocates memory for the hash header, directory, segments,
and elements collectively for both shared and non-shared hash tables. While
this approach works well for shared hash tables, it presents an issue for non-
shared hash tables. Specifically, during the expand_table() process,
non-shared hash tables may reallocate a new directory and free the old one.
Since the directory's memory is no longer allocated individually, it cannot be freed
separately. This results in the following error:
ERROR: pfree called with invalid pointer 0x60a15edc44e0 (header 0x0000002000000008)
These allocations are done together to improve reporting of shared memory allocations
for shared hash tables. Similar change is done for non-shared hash tables only to maintain
consistency since hash_create code is shared between both types of hash tables.
consistency since hash_create code is shared between both types of hash tables.
One solution could be separating allocation of directory from rest of the allocations for 
the non-shared hash tables, but this approach would undermine the purpose of
doing the change for a non-shared hash table.
A better/safer solution would be to do this change only for shared hash tables and
the non-shared hash tables, but this approach would undermine the purpose of
doing the change for a non-shared hash table.
A better/safer solution would be to do this change only for shared hash tables and
exclude the non-shared hash tables.
I believe it's acceptable to allocate everything in a single block provided we are not trying
to individually free any of these, which we do only for the directory pointer in dir_realloc.
Additional segment allocation goes through seg_alloc and element allocations through
element_alloc, which do not free existing chunks but instead allocate new ones with
pointers in existing directories and segments.
Thus, as long as we don't reallocate the directory, which we avoid in the case of shared
hash tables, it should be safe to proceed with this change.
Please find attached the patch which removes the changes for non-shared hash tables
and keeps them for shared hash tables.
I tested this by running make-check, make-check world and the reproducer script shared
by David. I also ran pgbench to test creation and expansion of some of the
shared hash tables.
Thank you,
Rahila Syed
Вложения
Hi, I'm sorry, but I'm afraid this won't make it into PG18 :-( AFAICS the updated patch is correct / not buggy for non-shared hash tables, but considering I missed a pretty serious flaw before pushing the original patch, and that the tests didn't catch that either ... The risk/benefit is not really worth it for PG18. I think this needs a bit more work/testing. The dynahash is a bit too critical piece, I don't want to be reverting a patch a second time. Sorry, I realize it's not a great outcome ... On 4/4/25 17:45, Rahila Syed wrote: > Hi, > > Analysis of the Bug <https://www.postgresql.org/message- > id/3d36f787-8ba5-4aa2-abb1-7ade129a7b0e%40vondra.me> in 0002 reported by > David Rowley : > The 0001* patch allocates memory for the hash header, directory, segments, > and elements collectively for both shared and non-shared hash tables. While > this approach works well for shared hash tables, it presents an issue > for non- > shared hash tables. Specifically, during the expand_table() process, > non-shared hash tables may reallocate a new directory and free the old one. > Since the directory's memory is no longer allocated individually, it > cannot be freed > separately. This results in the following error: > > ERROR: pfree called with invalid pointer 0x60a15edc44e0 (header > 0x0000002000000008) > > These allocations are done together to improve reporting of shared > memory allocations > for shared hash tables. Similar change is done for non-shared hash > tables only to maintain > consistency since hash_create code is shared between both types of hash > tables. > Thanks, that matches my conclusions after looking into this. > One solution could be separating allocation of directory from rest of > the allocations for > the non-shared hash tables, but this approach would undermine the > purpose of > doing the change for a non-shared hash table. > I don't follow. Why would that undermine the change for non-shared hash tables? Why should this affect non-shared hash tables at all? The goal was to improve accounting for shared hash tables. > A better/safer solution would be to do this change only for shared hash > tables and > exclude the non-shared hash tables. > > I believe it's acceptable to allocate everything in a single block > provided we are not trying > to individually free any of these, which we do only for the directory > pointer in dir_realloc. > Additional segment allocation goes through seg_alloc and element > allocations through > element_alloc, which do not free existing chunks but instead allocate > new ones with > pointers in existing directories and segments. > Thus, as long as we don't reallocate the directory, which we avoid in > the case of shared > hash tables, it should be safe to proceed with this change. > Right, I believe that's correct. But I admit I find the current dynahash code a bit confusing, especially in how it mixes handling of non-shared and shared hash tables. (Not the fault of this patch, ofc.) > Please find attached the patch which removes the changes for non-shared > hash tables > and keeps them for shared hash tables. > > I tested this by running make-check, make-check world and the reproducer > script shared > by David. I also ran pgbench to test creation and expansion of some of the > shared hash tables. > Thanks. I'm afraid the existing regression tests can't tell us much, considering it didn't fail once with the reverted patch :-( I did check the coverage in: https://coverage.postgresql.org/src/backend/utils/hash/dynahash.c.gcov.html and sure enough, dir_realloc() is not executed once. And there's a couple more pieces that have the same issue (e.g. hash_freeze or parts of get_hash_entry). I realize that's not the fault of this patch, but would you be willing to improve that? It'd certainly make me less concerned if we improved this first. regards -- Tomas Vondra
On Fri, Apr 4, 2025 at 9:15 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
>
> Please find attached the patch which removes the changes for non-shared hash tables
> and keeps them for shared hash tables.
>
  /* Allocate initial segments */
+ i = 0;
  for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
  {
- *segp = seg_alloc(hashp);
- if (*segp == NULL)
- return false;
+ /* Assign initial segments, which are also pre-allocated */
+ if (hashp->isshared)
+ {
+ *segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+ MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
+ }
+ else
+ {
+ *segp = seg_alloc(hashp);
+ i++;
+ }
In the non-shared hash table case, previously, we used to return false
from here when seg_alloc() fails, but now it seems to be proceeding
without returning false. Is there a reason for the same?
> I tested this by running make-check, make-check world and the reproducer script shared
> by David. I also ran pgbench to test creation and expansion of some of the
> shared hash tables.
>
This covers the basic tests for this patch. I think we should do some
low-level testing of both shared and non-shared hash tables by having
a contrib module or such (we don't need to commit such a contrib
module, but it will give us confidence that the low-level data
structure allocation change is thoroughly tested). We also need to
focus on negative tests where there is insufficient memory in the
system.
--
With Regards,
Amit Kapila.