Обсуждение: Safer hash table initialization macro
Hi hackers,
Currently to create a hash table we do things like:
A) create a struct, say:
typedef struct SeenRelsEntry
{
Oid rel_id;
int list_index;
} SeenRelsEntry;
where the first member is the hash key, and then later:
B)
ctl.keysize = sizeof(Oid);
ctl.entrysize = sizeof(SeenRelsEntry);
ctl.hcxt = CurrentMemoryContext;
seen_rels = hash_create("find_all_inheritors temporary table",
32, /* start small and extend */
&ctl,
I can see 2 possible issues:
1)
We manually specify the type for keysize, which could become incorrect (from the
start) or if the key member's type changes.
2)
It may be possible to remove the key member without the compiler noticing it.
Take this example and remove:
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 929bb53b620..eb11976afef 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -36,7 +36,6 @@
*/
typedef struct SeenRelsEntry
{
- Oid rel_id; /* relation oid */
int list_index; /* its position in output list(s) */
} SeenRelsEntry;
That would compile without any issues because this rel_id member is not
referenced in the code (for this particular example). That's rare but possible.
But then, on my machine, during make check:
TRAP: failed Assert("!found"), File: "nodeModifyTable.c", Line: 5157, PID: 140430
The reason is that the struct member access is done only for bytes level
operations (within the hash related macros). So it's easy to think that this
member is unused (because it is not referenced in the code).
I'm thinking about what kind of safety we could put in place to better deal with
1) and 2).
What about adding a macro that:
- requests the key member name
- ensures that it is at offset 0
- computes the key size based on the member
Something like:
"
#define HASH_ELEM_INIT(ctl, entrytype, keymember) \
do { \
StaticAssertStmt(offsetof(entrytype, keymember) == 0, \
#keymember " must be first member in " #entrytype); \
(ctl).keysize = sizeof(((entrytype *)0)->keymember); \
(ctl).entrysize = sizeof(entrytype); \
} while (0)
"
That way:
- The key member is explicitly referenced in the code (preventing "unused"
false positives)
- The key size is automatically computed from the actual member type (preventing
type mismatches)
- We enforce that the key is at offset 0
An additional benefit: it avoids repeating the "keysize =" followed by "entrysize ="
in a lot of places in the code (currently about 100 times).
If that sounds like a good idea, I could work on a patch doing so.
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, 1 Dec 2025 at 14:45, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Thoughts?
I think the hashtable creation API in postgres is so terrible that it
actively discourages usage. At Citus we definitely had the problem
that we would use Lists for cases where a hashtable was preferable
perf wise, just because the API was so terrible.
That's why I eventually implemented some wrapper helper functions to
make it less verbose and error prone to use in by far the most common
patterns we had[1].
So I'm definitely in favor of improving this API (probably by adding a
few new functions). I have a few main thoughts on what could be
improved:
1. Automatically determine keysize and entrysize given a keymember and
entrytype (like you suggested).
2. Autodect most of the flags.
a. HASH_PARTITION, HASH_SEGMENT, HASH_FUNCTION, HASH_DIRSIZE,
HASH_COMPARE, HASH_KEYCOPY, HASH_ALLOC can all be simply be detected
based on the relevant fields from HASHCTL. Passing them in explicitly
is just duplication that causes code noise and is easy to forget
accidentally.
b. HASH_ELEM is useless noise because it is required
c. HASH_BLOBS could be the default (and maybe use HASH_STRINGS by
default if the keytype is char*)
3. Default to CurrentMemoryContext instead of TopMemoryContext. Like
all of the other functions that allocate, because right now it's way
too easy to accidentally use TopMemoryContext when you did not intend
to.
4. Have a creation function that doesn't require HASHCTL at all (just
takes entrytype and keymember and maybe a version of this that takes a
memorycontext).
[1]:
https://github.com/citusdata/citus/blob/ae2eb65be082d52db646b68a644474e24bc6cea1/src/include/distributed/hash_helpers.h#L74
Hi, On Mon, Dec 01, 2025 at 03:44:41PM +0100, Jelte Fennema-Nio wrote: > On Mon, 1 Dec 2025 at 14:45, Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Thoughts? > > I think the hashtable creation API in postgres is so terrible that it > actively discourages usage. Thanks for sharing your thoughts! > So I'm definitely in favor of improving this API (probably by adding a > few new functions). I have a few main thoughts on what could be > improved: > > 1. Automatically determine keysize and entrysize given a keymember and > entrytype (like you suggested). PFA a patch introducing and using the new macro. Note that it also introduces HASH_ELEM_INIT_FULL for the rare cases where the whole struct is the key. Also one case remains untouched: $ git grep "entrysize = sizeof" "*.c" src/backend/replication/logical/relation.c: ctl.entrysize = sizeof(LogicalRepRelMapEntry); That's because the key is a member of a nested struct so that the new macro can not be used. As there is only one occurrence of it, I think we can keep it as it is. But we could create a dedicated macro for those cases if we feel the need. Now that I'm writing this, that might be a better idea: that way we'd avoid any "entrysize/keysize = " in the .c files. Also a nice side effect of using the macros: 138 insertions(+), 203 deletions(-) > 2. Autodect most of the flags. > a. HASH_PARTITION, HASH_SEGMENT, HASH_FUNCTION, HASH_DIRSIZE, > HASH_COMPARE, HASH_KEYCOPY, HASH_ALLOC can all be simply be detected > based on the relevant fields from HASHCTL. Passing them in explicitly > is just duplication that causes code noise and is easy to forget > accidentally. > b. HASH_ELEM is useless noise because it is required > c. HASH_BLOBS could be the default (and maybe use HASH_STRINGS by > default if the keytype is char*) > 3. Default to CurrentMemoryContext instead of TopMemoryContext. Like > all of the other functions that allocate, because right now it's way > too easy to accidentally use TopMemoryContext when you did not intend > to. > 4. Have a creation function that doesn't require HASHCTL at all (just > takes entrytype and keymember and maybe a version of this that takes a > memorycontext). Thanks for the above suggestions! I did not think so deep as you did during your Citus time, but will think about those too. I suggest we move forward one step at a time, first step being the new macros. Does that make sense to you? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com