Обсуждение: Should HashSetOp go away
Many years ago I ran into some problems doing maintenance tasks checking for short identifiers which existed in one table but not another. It would choose HashSetOp, which was memory inefficient, and it was also unaware of how memory inefficient it was, leading it to blow well past work_mem, by many fold. So if you set work_mem to a large value in order to get maintenance operations over with quickly while the system is basically single-user, it could cause crashes. It certainly isn't the only part of PostgreSQL which is memory inefficient, but it did seem particularly egregious.
I noticed some changes in this code v18, so wanted to revisit the issue. Under commit 27627929528e, it looks like it got 25% more memory efficient, but it thinks it got 40% more efficient, so the memory use got better but the estimation actually got worse.
Using the data:
create table jj as select lpad(x::text,15,'0') from generate_series(1,10000000) f(x);
And the dummy query:
select lpad from jj except select lpad from jj;
It goes from needing a work_mem of at least 270MB to choose HashSetOp where it actually uses 1.3GB (as determined by 'max resident size' from log_executor_stats, which is not perfect but should be pretty close--I intentionally used a small shared_buffers so that it didn't contribute much to the memory usage) in v17 to needing work_mem of at least 160MB while actually using 1.0GB in 18devel commit 276279. Under 18.0, it is slightly but not meaningfully different from commit 276279.
I was thinking of ways to improve the memory usage (or at least its estimation) but decided maybe it would be better if HashSetOp went away entirely. As far as I can tell HashSetOp has nothing to recommend it other than the fact that it already exists. If we instead used an elaboration on Hash Anti Join, then it would automatically get spilling to disk, parallel operations, better estimation, and the benefits of whatever micro optimizations people lavish on the highly used HashJoin machinery but not the obscure, little-used HashSetOp.
It would need to elaborate the HashAntiJoin so that it can deduplicate one input (in the case of EXCEPT) or count the other input (in the case of EXCEPT ALL).
Is there some reason this is not feasible?
Yes, I could (and did) rewrite my query to force it to use the AntiJoin, but why should people need to do that when the planner can do it instead?
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes:
> I noticed some changes in this code v18, so wanted to revisit the issue.
> Under commit 27627929528e, it looks like it got 25% more memory efficient,
> but it thinks it got 40% more efficient, so the memory use got better but
> the estimation actually got worse.
Hmm, so why not fix that estimation?
> I was thinking of ways to improve the memory usage (or at least its
> estimation) but decided maybe it would be better if HashSetOp went away
> entirely. As far as I can tell HashSetOp has nothing to recommend it other
> than the fact that it already exists. If we instead used an elaboration on
> Hash Anti Join, then it would automatically get spilling to disk, parallel
> operations, better estimation, and the benefits of whatever micro
> optimizations people lavish on the highly used HashJoin machinery but not
> the obscure, little-used HashSetOp.
This seems like a pretty bad solution. It would imply exporting the
complexities of duplicate-counting for EXCEPT ALL and INTERSECT ALL
modes into the hash-join logic. We don't need that extra complexity
there (it's more than enough of a mess already), and we don't need
whatever performance hit ordinary hash joins would take.
Also, I doubt the problem is confined to nodeSetOp. I think this is
fundamentally a complaint about BuildTupleHashTable and friends being
unable to spill to disk. Since we also use that logic for hashed
aggregates, RecursiveUnion, and hashed SubPlans, getting rid of
nodeSetOp isn't going to move the needle very far.
regards, tom lane
I wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> I noticed some changes in this code v18, so wanted to revisit the issue.
>> Under commit 27627929528e, it looks like it got 25% more memory efficient,
>> but it thinks it got 40% more efficient, so the memory use got better but
>> the estimation actually got worse.
> Hmm, so why not fix that estimation?
So I poked at this a little bit, and found a few factors affecting it:
* Tuple hash tables are typically given license to use twice work_mem;
see hash_mem_multiplier. Not sure if you were accounting for that
in your test.
* create_setop_path's required-space estimate of entrysize * numGroups
was rather lame before commit 5dfc19814, and it's even more so
afterwards. It's basically only accounting for the tuples themselves,
and not either the hashtable overhead or the SetOpStatePerGroupData
counter space. With wide tuples that might disappear into the noise,
but with only 16-ish data bytes per tuple it's all about the overhead.
On my machine this example uses 80 bytes per tuple in the "SetOp hash
table" context and another 16 or more in the simplehash hashtable.
So about triple what the planner thought.
* We can buy some of this back nearly for free, by switching the
SetOp hash table context to be a BumpContext not an AllocSetContext.
That doesn't move the needle too much in this example with
MEMORY_CONTEXT_CHECKING on, but with it off, the SetOp hash table
consumption drops to 48 bytes/tuple. (Also, for other tuple widths,
AllocSetContext's round-up-to-power-of-2 behavior could hurt a lot
more than it does here.)
* To do better, we probably need to take this computation out of the
planner and have execGrouping.c expose a function to estimate the
TupleHashTable size for N entries and such-and-such average data
width. That in turn will require simplehash.h to expose a function
for estimating the size of its tables, because AFAICS its callers are
not supposed to know such details.
Attached is a quick-hack patch to use a BumpContext for this purpose
in nodeSetOp.c. We should probably look at whether other users of
BuildTupleHashTable can do similarly. I haven't thought hard about
what better-factorized space estimation would look like.
regards, tom lane
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 4068481a523..ce42ea6a549 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -589,13 +589,15 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
/*
* If hashing, we also need a longer-lived context to store the hash
* table. The table can't just be kept in the per-query context because
- * we want to be able to throw it away in ExecReScanSetOp.
+ * we want to be able to throw it away in ExecReScanSetOp. We can use a
+ * BumpContext to save storage, because we will have no need to delete
+ * individual table entries.
*/
if (node->strategy == SETOP_HASHED)
setopstate->tableContext =
- AllocSetContextCreate(CurrentMemoryContext,
- "SetOp hash table",
- ALLOCSET_DEFAULT_SIZES);
+ BumpContextCreate(CurrentMemoryContext,
+ "SetOp hash table",
+ ALLOCSET_DEFAULT_SIZES);
/*
* initialize child nodes
On Mon, 27 Oct 2025 at 07:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Janes <jeff.janes@gmail.com> writes: > > I was thinking of ways to improve the memory usage (or at least its > > estimation) but decided maybe it would be better if HashSetOp went away > > entirely. As far as I can tell HashSetOp has nothing to recommend it other > > than the fact that it already exists. If we instead used an elaboration on > > Hash Anti Join, then it would automatically get spilling to disk, parallel > > operations, better estimation, and the benefits of whatever micro > > optimizations people lavish on the highly used HashJoin machinery but not > > the obscure, little-used HashSetOp. > > This seems like a pretty bad solution. It would imply exporting the > complexities of duplicate-counting for EXCEPT ALL and INTERSECT ALL > modes into the hash-join logic. We don't need that extra complexity > there (it's more than enough of a mess already), and we don't need > whatever performance hit ordinary hash joins would take. If Hash Joins did support IS NOT DISTINCT FROM clauses, then at least the non-ALL cases could be done with Hash Semi Join and Hash Anti Join for INTERSECT and EXCEPT, respectively, followed by a HashAgg. I doubt it would be any faster for the general case, but at least it would allow those setop queries to run when the inputs don't fit in memory. It's not ideal though, as when the planner underestimates, Hashed Setops could still blow up. David
David Rowley <dgrowleyml@gmail.com> writes:
> If Hash Joins did support IS NOT DISTINCT FROM clauses, then at least
> the non-ALL cases could be done with Hash Semi Join and Hash Anti Join
> for INTERSECT and EXCEPT, respectively, followed by a HashAgg. I doubt
> it would be any faster for the general case, but at least it would
> allow those setop queries to run when the inputs don't fit in memory.
> It's not ideal though, as when the planner underestimates, Hashed
> Setops could still blow up.
Yeah. As I hinted before, I think a better answer would be to teach
TupleHashTables to be able to spill to disk at need. No idea how
much work that would be, but it would fix all users of that code
not just one of them.
regards, tom lane