Обсуждение: Disk-based hash aggregate's cost model
We have a Postgres 13 open item for Disk-based hash aggregate, which is the only non-trivial open item. There is a general concern about how often we get disk-based hash aggregation when work_mem is particularly low, and recursion seems unavoidable. This is generally thought to be a problem in the costing. Tomas Vondra did some testing of the patch which led to discussion of this on the hash agg GUC megathread, here: https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development I am starting this new thread to deal with the open item. Any progress on this, Jeff? Please correct or expand on my summary of the open item if I got something wrong. Thanks -- Peter Geoghegan
On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
> We have a Postgres 13 open item for Disk-based hash aggregate, which
> is the only non-trivial open item. There is a general concern about
> how often we get disk-based hash aggregation when work_mem is
> particularly low, and recursion seems unavoidable. This is generally
> thought to be a problem in the costing.
We discussed two approaches to tweaking the cost model:
1. Penalize HashAgg disk costs by a constant amount. It seems to be
chosen a little too often, and we can reduce the number of plan
changes.
2. Treat recursive HashAgg spilling skeptically, and heavily penalize
recursive spilling.
The problem with approach #2 is that we have a default hash mem of 4MB,
and real systems have a lot more than that. In this scenario, recursive
spilling can beat Sort by a lot.
For instance:
Data:
create table text10m(t text collate "C.UTF-8", i int, n numeric);
insert into t10m
select s.g::text, s.g, s.g::numeric
from (
select (random()*1000000000)::int as g
from generate_series(1,10000000)) s;
vacuum (freeze,analyze) text10m;
Query: explain analyze select distinct t from text10m;
HashAgg: 10.5s
Sort+Distinct: 46s
I'm inclined toward option #1 for simplicity unless you feel strongly
about option #2. Specifically, I was thinking of a 1.5X penalty for
HashAgg disk costs.
Regards,
Jeff Davis
On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: >On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: >> We have a Postgres 13 open item for Disk-based hash aggregate, which >> is the only non-trivial open item. There is a general concern about >> how often we get disk-based hash aggregation when work_mem is >> particularly low, and recursion seems unavoidable. This is generally >> thought to be a problem in the costing. > >We discussed two approaches to tweaking the cost model: > >1. Penalize HashAgg disk costs by a constant amount. It seems to be >chosen a little too often, and we can reduce the number of plan >changes. > >2. Treat recursive HashAgg spilling skeptically, and heavily penalize >recursive spilling. > >The problem with approach #2 is that we have a default hash mem of 4MB, >and real systems have a lot more than that. In this scenario, recursive >spilling can beat Sort by a lot. > I think the main issue is that we're mostly speculating what's wrong. I've shared some measurements and symptoms, and we've discussed what might be causing it, but I'm not really sure we know for sure. I really dislike (1) because it seems more like "We don't know what's wrong so we'll penalize hashagg," kind of solution. A much more principled solution would be to tweak the costing accordingly, not just by adding some constant. For (2) it really depends if recursive spilling is really the problem here. In the examples I shared, the number of partitions/batches was very different, but the query duration was mostly independent (almost constant). FWIW I still haven't seen any explanation why the current code spills more data than the CP_SMALL_TLIST patch (which was reverted). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote:
>On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote:
>>On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
>>>We have a Postgres 13 open item for Disk-based hash aggregate, which
>>>is the only non-trivial open item. There is a general concern about
>>>how often we get disk-based hash aggregation when work_mem is
>>>particularly low, and recursion seems unavoidable. This is generally
>>>thought to be a problem in the costing.
>>
>>We discussed two approaches to tweaking the cost model:
>>
>>1. Penalize HashAgg disk costs by a constant amount. It seems to be
>>chosen a little too often, and we can reduce the number of plan
>>changes.
>>
>>2. Treat recursive HashAgg spilling skeptically, and heavily penalize
>>recursive spilling.
>>
>>The problem with approach #2 is that we have a default hash mem of 4MB,
>>and real systems have a lot more than that. In this scenario, recursive
>>spilling can beat Sort by a lot.
>>
>
>I think the main issue is that we're mostly speculating what's wrong.
>I've shared some measurements and symptoms, and we've discussed what
>might be causing it, but I'm not really sure we know for sure.
>
>I really dislike (1) because it seems more like "We don't know what's
>wrong so we'll penalize hashagg," kind of solution. A much more
>principled solution would be to tweak the costing accordingly, not just
>by adding some constant. For (2) it really depends if recursive spilling
>is really the problem here. In the examples I shared, the number of
>partitions/batches was very different, but the query duration was
>mostly independent (almost constant).
>
I've decided to look at the costing a bit more closely today, and see
why the costing is so much lower compared to sort/groupagg. I've used
the same 32GB dataset and query as in [1].
I've repeated tests for all the work_mem values, and I see the number of
batches are much lower, probably thanks to the HLL improvement:
2MB Planned: 64 Batches (old): 4160 Batches: 2977
4MB Planned: 128 Batches (old): 16512 Batches: 1569
8MB Planned: 256 Batches (old): 21488 Batches: 1289
64MB Planned: 32 Batches (old): 2720 Batches: 165
256MB Planned: 8 Batches (old): 8 Batches: 41
The impact on duration of the queries seems pretty negligible, though.
The plans with work_mem=64MB look like this:
1) hashagg
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11267293.86..11267293.86 rows=1 width=36) (actual time=213127.515..213127.517 rows=0 loops=1)
-> HashAggregate (cost=10229061.10..11267293.86 rows=6718533 width=36) (actual time=100862.623..212948.642
rows=6400000loops=1)
Group Key: lineitem.l_partkey
Planned Partitions: 32 Batches: 165 Memory Usage: 67857kB Disk Usage: 6802432kB
-> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.418..20344.631
rows=192000551loops=1)
Planning Time: 0.064 ms
Execution Time: 213441.986 ms
(7 rows)
2) groupagg
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=36755617.81..36755617.81 rows=1 width=36) (actual time=180029.594..180029.595 rows=0 loops=1)
-> GroupAggregate (cost=35214909.30..36755617.81 rows=6718533 width=36) (actual time=94017.788..179857.683
rows=6400000loops=1)
Group Key: lineitem.l_partkey
-> Sort (cost=35214909.30..35694886.14 rows=191990736 width=9) (actual time=94017.750..151138.727
rows=192000551loops=1)
Sort Key: lineitem.l_partkey
Sort Method: external merge Disk: 3742208kB
-> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.008..26831.264
rows=192000551loops=1)
Planning Time: 0.063 ms
Execution Time: 180209.580 ms
(9 rows)
I still don't understand why the hashagg is costed so low compared to
the sort (only about 1/3 of it), because both cases use exactly the same
estimates internally. cost_tuplesort ends up with
npages = 937455
nruns = 114.435396
input_bytes = 7679629440
log_runs = 1.0
while cost_agg uses
pages_read = 937455
pages_written = 937455
relation_size = 7679629440;
yet we end up with much lower estimate for hashagg. It however does seem
to me this is mostly due to non-I/O costs, considered by cost_tuplesort
and perhaps ignored by cost_agg? In particular, most of the sort cost
comes from this
*startup_cost = comparison_cost * tuples * LOG2(tuples);
So I'm wondering if the hashagg is not ignoring similar non-I/O costs
for the spilling case. In particular, the initial section computing
startup_cost seems to ignore that we may need to so some of the stuff
repeatedly - for example we'll repeat hash lookups for spilled tuples,
and so on.
The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even though
the estimates of pages / input_bytes are exactly the same. For hashagg
the iosnoop shows 5921MB reads and 7185MB writes, while sort only does
2895MB reads and 3655MB writes. Which kinda matches the observed sizes
of temp files in the two cases, so the input_bytes for sort seems to be
a bit overestimated.
regards
[1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote:
> So I'm wondering if the hashagg is not ignoring similar non-I/O costs
> for the spilling case. In particular, the initial section computing
> startup_cost seems to ignore that we may need to so some of the stuff
> repeatedly - for example we'll repeat hash lookups for spilled
> tuples,
> and so on.
To fix that, we'd also need to change the cost of in-memory HashAgg,
right?
> The other thing is that sort seems to be doing only about half the
> physical I/O (as measured by iosnoop) compared to hashagg, even
> though
> the estimates of pages / input_bytes are exactly the same. For
> hashagg
> the iosnoop shows 5921MB reads and 7185MB writes, while sort only
> does
> 2895MB reads and 3655MB writes. Which kinda matches the observed
> sizes
> of temp files in the two cases, so the input_bytes for sort seems to
> be
> a bit overestimated.
Hmm, interesting.
How reasonable is it to be making these kinds of changes to the cost
model right now? I think your analysis is solid, but I'm worried about
making more intrusive changes very late in the cycle.
I had originally tried to limit the cost model changes to the new plans
I am introducing -- that is, HashAgg plans expected to require disk.
That's why I came up with a rather arbitrary penalty.
Regards,
Jeff Davis
On Mon, Aug 31, 2020 at 11:34:34PM -0700, Jeff Davis wrote: >On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: >> So I'm wondering if the hashagg is not ignoring similar non-I/O costs >> for the spilling case. In particular, the initial section computing >> startup_cost seems to ignore that we may need to so some of the stuff >> repeatedly - for example we'll repeat hash lookups for spilled >> tuples, >> and so on. > >To fix that, we'd also need to change the cost of in-memory HashAgg, >right? > Why? I don't think we need to change costing of in-memory HashAgg. My assumption was we'd only tweak startup_cost for cases with spilling by adding something like (cpu_operator_cost * npartitions * ntuples). >> The other thing is that sort seems to be doing only about half the >> physical I/O (as measured by iosnoop) compared to hashagg, even >> though >> the estimates of pages / input_bytes are exactly the same. For >> hashagg >> the iosnoop shows 5921MB reads and 7185MB writes, while sort only >> does >> 2895MB reads and 3655MB writes. Which kinda matches the observed >> sizes >> of temp files in the two cases, so the input_bytes for sort seems to >> be >> a bit overestimated. > >Hmm, interesting. > FWIW I suspect some of this difference may be due to logical vs. physical I/O. iosnoop only tracks physical I/O sent to the device, but maybe we do much more logical I/O and it simply does not expire from page cache for the sort. It might behave differently for larger data set, longer query, ... >How reasonable is it to be making these kinds of changes to the cost >model right now? I think your analysis is solid, but I'm worried about >making more intrusive changes very late in the cycle. > >I had originally tried to limit the cost model changes to the new plans >I am introducing -- that is, HashAgg plans expected to require disk. >That's why I came up with a rather arbitrary penalty. > I don't know. I certainly understand the desire not to change things this late. OTOH I'm worried that we'll end up receiving a lot of poor plans post release. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote:
> Why? I don't think we need to change costing of in-memory HashAgg. My
> assumption was we'd only tweak startup_cost for cases with spilling
> by
> adding something like (cpu_operator_cost * npartitions * ntuples).
The code above (the in-memory case) has a clause:
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
which seems to account only for the hash calculation, because it's
multiplying by the number of grouping columns.
Your calculation would also use cpu_operator_cost, but just for the
lookup. I'm OK with that, but it's a little inconsistent to only count
it for the tuples that spill to disk.
But why multiply by the number of partitions? Wouldn't it be the depth?
A wide fanout will not increase the number of lookups.
> FWIW I suspect some of this difference may be due to logical vs.
> physical I/O. iosnoop only tracks physical I/O sent to the device,
> but
> maybe we do much more logical I/O and it simply does not expire from
> page cache for the sort. It might behave differently for larger data
> set, longer query, ...
That would suggest something like a penalty for HashAgg for being a
worse IO pattern. Or do you have another suggestion?
> I don't know. I certainly understand the desire not to change things
> this late. OTOH I'm worried that we'll end up receiving a lot of poor
> plans post release.
I was reacting mostly to changing the cost of Sort. Do you think
changes to Sort are required or did I misunderstand?
Regards,
Jeff Davis
On Tue, Sep 01, 2020 at 12:58:59PM -0700, Jeff Davis wrote: >On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote: >> Why? I don't think we need to change costing of in-memory HashAgg. My >> assumption was we'd only tweak startup_cost for cases with spilling >> by >> adding something like (cpu_operator_cost * npartitions * ntuples). > >The code above (the in-memory case) has a clause: > > startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples; > >which seems to account only for the hash calculation, because it's >multiplying by the number of grouping columns. > >Your calculation would also use cpu_operator_cost, but just for the >lookup. I'm OK with that, but it's a little inconsistent to only count >it for the tuples that spill to disk. > >But why multiply by the number of partitions? Wouldn't it be the depth? >A wide fanout will not increase the number of lookups. > Yeah, I think you're right it should be depth, not number of partitions. FWIW I don't know if this is enough to "fix" the costing, it's just something I noticed while looking at the code. >> FWIW I suspect some of this difference may be due to logical vs. >> physical I/O. iosnoop only tracks physical I/O sent to the device, >> but >> maybe we do much more logical I/O and it simply does not expire from >> page cache for the sort. It might behave differently for larger data >> set, longer query, ... > >That would suggest something like a penalty for HashAgg for being a >worse IO pattern. Or do you have another suggestion? > Possibly, yes. I think it'd be good to measure logical I/O (e.g. by adding some instrumentation to LogicalTapeSet) to see if this hypothesis is actually true. FWIW any thoughts about the different in temp size compared to CP_SMALL_TLIST? >> I don't know. I certainly understand the desire not to change things >> this late. OTOH I'm worried that we'll end up receiving a lot of poor >> plans post release. > >I was reacting mostly to changing the cost of Sort. Do you think >changes to Sort are required or did I misunderstand? > Not sure I'm following. I don't think anyone proposed changing costing for Sort. Or did I miss something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 1, 2020 at 2:19 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > FWIW I suspect some of this difference may be due to logical vs. > physical I/O. iosnoop only tracks physical I/O sent to the device, but > maybe we do much more logical I/O and it simply does not expire from > page cache for the sort. It might behave differently for larger data > set, longer query, ... There is also the fact that the LogicalTapeSetBlocks() instrumentation is known to have problems that we still need to fix: https://www.postgresql.org/message-id/CAH2-Wzn5PCBLUrrds=hD439LtWP+PD7ekRTd=8LdtqJ+KO5D1Q@mail.gmail.com I'm not suggesting that this is a significant factor here. But I can't rule it out just yet either. > I don't know. I certainly understand the desire not to change things > this late. OTOH I'm worried that we'll end up receiving a lot of poor > plans post release. I think that this needs to get fixed before release. -- Peter Geoghegan
On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote:
> So I'm wondering if the hashagg is not ignoring similar non-I/O costs
> for the spilling case. In particular, the initial section computing
> startup_cost seems to ignore that we may need to so some of the stuff
> repeatedly - for example we'll repeat hash lookups for spilled
> tuples,
> and so on.
I tried to analyze this using a slightly different approach: cost units
per second of runtime. Obviously this will vary based on the machine,
but it's interesting anyway.
All of the following fit in system memory. Schema, data, and queries
are at the end of this email.
A low value of cost-units/second-runtime means "more likely to be
chosen incorrectly" and a high value means "more likely to be missed
incorrectly".
Plan | work_mem | 10M | 100M INT4 | 100M | 10M
| | INT4 | (10M groups) | INT4 | TEXT
--------+----------+------+--------------+------+-----
HashAgg | 4MB | 88 | 63 | 82 | 78
HashAgg | 1TB | 41 | 37 | 33 | 38
Sort | 4MB | 182 | 188 | 174 | 37
Sort | 1TB | 184 | 231 | 189 | 30
Sort is consistent for a given datatype, but it seems that the
comparison cost is far from proportionate between data types.
HashAgg is consistent between the data types, but the disk costs play a
larger role (in this case, there is no actual disk access to worry
about, because it fits in system memory).
At least for these simple queries, Sort is punished unfairly for INT4,
but gets an unfair boost for TEXT.
It seems that we should make a change here, but to be conservative for
13, we should limit changes to the new plans, which are the first line
(HashAgg that spills).
The attached patch makes two adjustments: a 2X disk penalty for
HashAgg, and I also add:
spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost
The new results:
Plan | work_mem | 10M | 100M INT4 | 100M | 10M
| | INT4 | (10M groups) | INT4 | TEXT
--------+----------+------+--------------+------+-----
HashAgg | 4MB | 192 | 131 | 178 | 166
That's much more comparable to Sort for INT4, but makes the gap wider
for TEXT. Fortunately, at least for my simple queries, it just barely
avoids a plan change to Sort for the TEXT case (which is nearly 5X
slower than HashAgg).
I think this approach is reasonable for 13: it only changes the costing
for HashAgg plans that are expected to spill, it's fairly conservative
so we will not get lots of new bad plans, and it still seems to use
HashAgg for cases where it clearly should.
Note: in-memory HashAgg is still unfairly favored over Sort, at least
for these cases.
Regards,
Jeff Davis
create table t10m(i int);
insert into t10m select (random()*1000000000)::int from
generate_series(1,10000000);
explain analyze select distinct i from t10m;
create table t100m(i int);
insert into t100m select (random()*2000000000)::int from
generate_series(1,100000000);
explain analyze select distinct i from t100m;
-- 100m tuples, 10m groups, 10tuples/group
create table t100m_10(i int);
insert into t100m_10 select (random()*10000000)::int from
generate_series(1,100000000);
explain analyze select distinct i from t100m_10;
create table text10m(t text collate "C.UTF-8", i int, n numeric);
insert into text10m select s.g::text, s.g, s.g::numeric from (select
(random()*1000000000)::int as g from generate_series(1,10000000)) s;
explain analyze select distinct t from text10m;
Вложения
On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql@j-davis.com> wrote: > create table text10m(t text collate "C.UTF-8", i int, n numeric); > insert into text10m select s.g::text, s.g, s.g::numeric from (select > (random()*1000000000)::int as g from generate_series(1,10000000)) s; > explain analyze select distinct t from text10m; Note that you won't get what Postgres considers to be the C collation unless you specify "collate C" -- "C.UTF-8" is the C collation exposed by glibc. The difference matters a lot, because only the former can use abbreviated keys (unless you manually #define TRUST_STRXFRM). And even without abbreviated keys it's probably still significantly faster for other reasons. This doesn't undermine your point, because we don't take the difference into account in cost_sort() -- even though abbreviated keys will regularly make text sorts 2x-3x faster. My point is only that it would be more accurate to say that the costing unfairly boosts sorts on collated texts specifically. Though maybe not when an ICU collation is used (since abbreviated keys will be enabled generally). -- Peter Geoghegan
On Wed, 2020-09-02 at 17:35 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > create table text10m(t text collate "C.UTF-8", i int, n numeric);
> > insert into text10m select s.g::text, s.g, s.g::numeric from
> > (select
> > (random()*1000000000)::int as g from generate_series(1,10000000))
> > s;
> > explain analyze select distinct t from text10m;
>
> Note that you won't get what Postgres considers to be the C collation
> unless you specify "collate C" -- "C.UTF-8" is the C collation
> exposed
> by glibc. The difference matters a lot, because only the former can
> use abbreviated keys (unless you manually #define TRUST_STRXFRM). And
> even without abbreviated keys it's probably still significantly
> faster
> for other reasons.
Thank you. I reran with:
create table text10m2(t text collate "C", i int, n numeric);
-- same data, same queries
And the new table is:
Plan | work | 10M | 100M INT4 | 100M | 10M | 10M
| mem | INT4 | 10M grps | INT4 | TEXT | TEXTC
---------+------+------+-----------+------+------+------
HashAgg | 4MB | 88 | 63 | 82 | 78 | 80
HashAgg | 1TB | 41 | 37 | 33 | 38 | 43
Sort | 4MB | 182 | 188 | 174 | 37 | 146
Sort | 1TB | 184 | 231 | 189 | 30 | 149
HashAgg* | 4MB | 192 | 131 | 178 | 166 | 176
*: patched
For the 'COLLATE "C"' case, the costs still come out the almost the
same between HashAgg and Sort, but the runtimes are much closer. So
even if it did flip the plan from HashAgg to Sort, it goes from 9.5s
(HashAgg) to 12s (Sort), which is not so bad.
So the patched version looks good to me at this point. It accounts for
Tomas's observations about IO:
"The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even though
the estimates of pages / input_bytes are exactly the same."
by penalizing HashAgg disk costs by 2X.
The patch also accounts for his other observation about missing CPU
costs by costing the spilling. Tomas framed the CPU costs as the cost
of the extra lookups, but the extra lookups are only in the cases where
it misses in the hash table and needs to spill. So I think it's
reasonable to consider the extra lookups as a part of the spill cost.
The remaining problems are:
* comparison costs for Sort should be adjusted to make them relatively
consistent between data types
* in-memory HashAgg is unfairly favored in a lot of cases
I don't think either of those problems need to be (or should be) fixed
in 13, but we can revisit in 14 if there are reports of bad plans.
Regards,
Jeff Davis
On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote:
> FWIW any thoughts about the different in temp size compared to
> CP_SMALL_TLIST?
Are you referring to results from a while ago? In this thread I don't
see what you're referring to.
I tried in a simple case on REL_13_STABLE, with and without the
CP_SMALL_TLIST change, and I saw only a tiny difference. Do you have a
current case that shows a larger difference?
The only thing I can think of that might change is the size of the null
bitmap or how fields are aligned.
Regards,
Jeff Davis
On Thu, Sep 03, 2020 at 05:53:43PM -0700, Jeff Davis wrote: >On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote: >> FWIW any thoughts about the different in temp size compared to >> CP_SMALL_TLIST? > >Are you referring to results from a while ago? In this thread I don't >see what you're referring to. > >I tried in a simple case on REL_13_STABLE, with and without the >CP_SMALL_TLIST change, and I saw only a tiny difference. Do you have a >current case that shows a larger difference? > I'm referring to the last charts in the message from July 27, comparing behavior with CP_SMALL_TLIST fix vs. master (which reverted/replaced the CP_SMALL_TLIST bit). Those charts show that the CP_SMALL_TLIST resulted in smaller temp files (per EXPLAIN ANALYZE the difference is ~25%) and also lower query durations (also in the ~25% range). I can repeat those tests, if needed. [1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development >The only thing I can think of that might change is the size of the null >bitmap or how fields are aligned. > Maybe. Not sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote:
> Those charts show that the CP_SMALL_TLIST resulted in smaller temp
> files
> (per EXPLAIN ANALYZE the difference is ~25%) and also lower query
> durations (also in the ~25% range).
I was able to reproduce the problem, thank you.
Only two attributes are needed, so the CP_SMALL_TLIST projected schema
only needs a single-byte null bitmap.
But if just setting the attributes to NULL rather than projecting them,
the null bitmap size is based on all 16 attributes, bumping the bitmap
size to two bytes.
MAXALIGN(23 + 1) = 24
MAXALIGN(23 + 2) = 32
I think that explains it. It's not ideal, but projection has a cost as
well, so I don't think we necessarily need to do something here.
If we are motivated to improve this in v14, we could potentially have a
different schema for spilled tuples, and perform real projection at
spill time. But I don't know if that's worth the extra complexity.
Regards,
Jeff Davis
On Fri, Sep 04, 2020 at 11:31:36AM -0700, Jeff Davis wrote: >On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote: >> Those charts show that the CP_SMALL_TLIST resulted in smaller temp >> files >> (per EXPLAIN ANALYZE the difference is ~25%) and also lower query >> durations (also in the ~25% range). > >I was able to reproduce the problem, thank you. > >Only two attributes are needed, so the CP_SMALL_TLIST projected schema >only needs a single-byte null bitmap. > >But if just setting the attributes to NULL rather than projecting them, >the null bitmap size is based on all 16 attributes, bumping the bitmap >size to two bytes. > >MAXALIGN(23 + 1) = 24 >MAXALIGN(23 + 2) = 32 > >I think that explains it. It's not ideal, but projection has a cost as >well, so I don't think we necessarily need to do something here. > >If we are motivated to improve this in v14, we could potentially have a >different schema for spilled tuples, and perform real projection at >spill time. But I don't know if that's worth the extra complexity. > Thanks for the investigation and explanation. Wouldn't it be enough to just use a slot with smaller tuple descriptor? All we'd need to do is creating the descriptor in ExecInitAgg after calling find_hash_columns, and using it for rslot/wslot, and then "mapping" the attributes in hashagg_spill_tuple (which already almost does that, to the extra cost should be 0) and when reading the spilled tuples. So I'm not quite buying the argument that this would make measurable difference ... That being said, I won't insist on fixing this in v13 - at least we know what the issue is and we can fix it later. The costing seems like a more serious open item. OTOH I don't think this example is particularly extreme, and I wouldn't be surprised if we se even worse examples in practice - tables tend to be quite wide and aggregation of just a few columns seems likely. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-09-04 at 21:01 +0200, Tomas Vondra wrote:
> Wouldn't it be enough to just use a slot with smaller tuple
> descriptor?
> All we'd need to do is creating the descriptor in ExecInitAgg after
> calling find_hash_columns, and using it for rslot/wslot, and then
> "mapping" the attributes in hashagg_spill_tuple (which already almost
> does that, to the extra cost should be 0) and when reading the
> spilled
> tuples.
That's a good point, it's probably not much code to make it work.
> So I'm not quite buying the argument that this would make
> measurable difference ...
I meant "projection of all input tuples" (i.e. CP_SMALL_TLIST) has a
cost. If we project only at spill time, it should be fine.
Regards,
Jeff Davis
Hi,
I've tested the costing changes on the simplified TPC-H query, on two
different machines, and it seems like a clear improvement.
This is using the same cost/duration measure, which I think is pretty
neat way to look at this. Sure, it's imperfect (depends on which cost
and durations you actually take etc.), but it makes the comparisons
easier and for simple queries it does not matter that much.
The costing clearly depends on parameters like random_page_cost and how
it matches the hardware, but for the machine with SSD and default
random_page_cost the effect looks like this:
work_mem sort master patched
---------------------------------------
1MB 249 95 215
2MB 256 89 187
4MB 233 90 192
8MB 227 70 124
16MB 245 67 118
32MB 261 63 111
64MB 256 59 104
256MB 266 55 102
and with random_page_cost reduced to 1.5 it looks like this:
work_mem sort master patched
------------------------------------------
1MB 221 63 150
2MB 227 64 133
4MB 214 65 137
8MB 214 57 95
16MB 232 53 90
32MB 247 50 85
64MB 249 47 80
256MB 258 46 77
And on a machine with SATA RAID storage it looks like this:
work_mem sort master patched
-----------------------------------------
1MB 102 41 94
2MB 101 34 77
4MB 99 35 78
8MB 98 35 79
16MB 98 25 50
32MB 106 26 51
64MB 106 26 51
256MB 105 29 50
So yeah, the patched costing is much closer to sort (from the point of
this cost/duration metric), although for higher work_mem values there's
still a clear gap where the hashing seems to be under-costed by a factor
of ~2 or more.
I think this is simply showing that sort may the effect of increasing
work_mem is much more pronounced for sort/groupagg compared to hashagg.
For example on the SDD machine the duration changes like this:
work_mem hashagg groupagg
---------------------------------
1MB 217 201
2MB 178 195
4MB 176 186
8MB 160 176
16MB 168 163
32MB 180 153
64MB 189 143
256MB 204 138
and the SATA RAID storage seems to behave in a similar way (although the
difference is smaller).
So in general I think this costing change is reasonable. It might not go
far enough, but it certainly makes it probably makes it easier to tweak
the rest by changing random_page_cost etc. Not sure if we need/should
tweak the costing to reduce the effect of work_mem (on hashagg).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote:
> I've tested the costing changes on the simplified TPC-H query, on two
> different machines, and it seems like a clear improvement.
Thank you. Committed.
> So yeah, the patched costing is much closer to sort (from the point
> of
> this cost/duration metric), although for higher work_mem values
> there's
> still a clear gap where the hashing seems to be under-costed by a
> factor
> of ~2 or more.
There seems to be a cliff right after 4MB. Perhaps lookup costs on a
larger hash table?
Regards,
Jeff Davis
On Mon, Sep 07, 2020 at 01:55:28PM -0700, Jeff Davis wrote:
>On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote:
>> I've tested the costing changes on the simplified TPC-H query, on two
>> different machines, and it seems like a clear improvement.
>
>Thank you. Committed.
>
>> So yeah, the patched costing is much closer to sort (from the point
>> of
>> this cost/duration metric), although for higher work_mem values
>> there's
>> still a clear gap where the hashing seems to be under-costed by a
>> factor
>> of ~2 or more.
>
>There seems to be a cliff right after 4MB. Perhaps lookup costs on a
>larger hash table?
>
I assume you mean higher costs due to hash table outgrowing some sort of
CPU cache (L2/L3), right? Good guess - the CPU has ~6MB cache, but no.
This seems to be merely due to costing, because the raw cost/duration
looks like this:
work_mem cost duration
---------------------------------
1MB 20627403 216861
2MB 15939722 178237
4MB 15939722 176296
8MB 11252041 160479
16MB 11252041 168304
32MB 11252041 179567
64MB 11252041 189410
256MB 11252041 204206
This is unpatched master, with the costing patch it looks similar except
that the cost is about 2x higher. On the SATA RAID machine, it looks
like this:
work_mem cost duration
-----------------------------------
1MB 108358461 1147269
2MB 77381688 1004895
4MB 77381688 994853
8MB 77381688 980071
16MB 46404915 930511
32MB 46404915 902167
64MB 46404915 908757
256MB 46404915 926862
So roughly the same - the cost drops to less than 50%, but the duration
really does not. This is what I referred to when I said "Not sure if we
need/should tweak the costing to reduce the effect of work_mem (on
hashagg)."
For sort this seems to behave a bit more nicely - the cost and duration
(with increasing work_mem) are correlated quite well, I think.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services