Обсуждение: scale parallel_tuple_cost by tuple width
While investigating a performance issue, I found that it was extremely difficult to get a parallel plan in some cases due to the fixed parallel_tuple_cost. But this cost is not really fixed - it's going to be larger for larger tuples. So this proposal adjusts the cost used according to how large we expect the results to be. The result is that in the common case where, say, you're getting a group id and some aggregates, a parallel plan is more likely to be chosen. By contrast, queries that generate very wide results will be less likely to choose parallel plans. The formula chosen does have a fixed cost piece built into it, which accounts for the shm_mq_sendv() and shm_mq_receive() synchronization that occurs regardless of width. The patch itself is pretty simple. Also attached is a benchmark report that I had claude create. Its main result shows a speedup of about 2.7x. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Вложения
Andrew Dunstan <andrew@dunslane.net> writes:
> While investigating a performance issue, I found that it was extremely
> difficult to get a parallel plan in some cases due to the fixed
> parallel_tuple_cost. But this cost is not really fixed - it's going to
> be larger for larger tuples. So this proposal adjusts the cost used
> according to how large we expect the results to be.
Unfortunately, I'm afraid that this is going to produce mostly
"garbage in, garbage out" estimates, because our opinion of how wide
tuples-in-flight are is pretty shaky. (See get_expr_width and
particularly get_typavgwidth, and note that we only have good
statistics-based numbers for plain Vars not function outputs.)
I agree that it could be useful to have some kind of adjustment here,
but I fear that linear scaling is putting way too much faith in the
quality of the data.
How many cases have you checked with this modified code? Did it
make the plan worse in any cases?
regards, tom lane
On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > While investigating a performance issue, I found that it was extremely > > difficult to get a parallel plan in some cases due to the fixed > > parallel_tuple_cost. But this cost is not really fixed - it's going to > > be larger for larger tuples. So this proposal adjusts the cost used > > according to how large we expect the results to be. > > Unfortunately, I'm afraid that this is going to produce mostly > "garbage in, garbage out" estimates, because our opinion of how wide > tuples-in-flight are is pretty shaky. (See get_expr_width and > particularly get_typavgwidth, and note that we only have good > statistics-based numbers for plain Vars not function outputs.) > I agree that it could be useful to have some kind of adjustment here, > but I fear that linear scaling is putting way too much faith in the > quality of the data. (I suspect you're saying this because of the "Benchmark 2" in the text file, which contains aggregates which return a varlena type, of which we won't estimate the width very well for...) Sure, it's certainly true that there are cases where we don't get the width estimate right, but that's not stopped us using them before. So why is this case so much more critical? We now also have GROUP BY before join abilities in the planner, which I suspect must also be putting trust into the very same thing. Also, varlena-returning Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why avoid making improvements in this area because we're not great at one of the things that could be in the targetlist? For the patch and the analysis: This reminds me of [1], where some reverse-engineering of costs from query run-times was done, which ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To get that for this case would require various tests with different tuple widths and ensuring that the costs scale linearly with the run-time of the query with the patched version. Of course, the test query would have to have perfect width estimates, but that could be easy enough to do by populating a text typed GROUP BY column and populating that with all the same width of text for each of the tests before increasing the width for the next test, using a fixed-width aggregate each time, e.g count(*). The "#define PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round number. It would be good to know how close this is to reality. Ideally, it would be good to see results from an Apple M<something> chip and recent x86. In my experience, these produce very different performance results, so it might be nice to find a value that is somewhere in the middle of what we get from those machines. I suspect having the GROUP BY column with text widths from 8 to 1024, increasing in powers of two would be enough data points. David [1] https://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
David Rowley <dgrowleyml@gmail.com> writes:
> On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Unfortunately, I'm afraid that this is going to produce mostly
>> "garbage in, garbage out" estimates, because our opinion of how wide
>> tuples-in-flight are is pretty shaky.
> Sure, it's certainly true that there are cases where we don't get the
> width estimate right, but that's not stopped us using them before. So
> why is this case so much more critical?
What I'm concerned about is that the estimated cost's dependency on
tuple width may be much stronger here than it has been in other uses.
That impression might be false, of course.
> For the patch and the analysis: This reminds me of [1], where some
> reverse-engineering of costs from query run-times was done, which
> ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
> get that for this case would require various tests with different
> tuple widths and ensuring that the costs scale linearly with the
> run-time of the query with the patched version.
Right, I think we really need more test cases to base these magic
numbers on. Such testing would likely also alleviate (or not...)
my discomfort with the width estimates being so poor.
regards, tom lane
On Tue, 31 Mar 2026 at 12:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What I'm concerned about is that the estimated cost's dependency on > tuple width may be much stronger here than it has been in other uses. > That impression might be false, of course. I think it's good to be concerned, but I think this is far from the worst place to put trust in the width estimates. We also use them in Memoize, and if we underestimate there, then we might end up with a Nested Loop -> Memoize plan instead of a Hash or Merge Join. If the actual Memoize cache hit ratio ends up much worse than expected due to wider-than-expected tuples, then the chosen plan might be well off being the optimal one. The execution costs of running a poorly chosen Nested Loop with a poorly caching Memoize can become quadratic. I think the parallel vs non-parallel problem is much more linear. I'm more concerned about the opposite problem of being too liberal and choosing parallel plans too often, resulting in worker exhaustion and poorer performance as a result of serially executing parallel plans. I suppose people could fix that by bumping up the parallel_setup_cost so that the planner favours reserving parallel workers for plans that get much bigger benefits from parallelisation. David