JIT performance bug/regression & JIT EXPLAIN
От | Andres Freund |
---|---|
Тема | JIT performance bug/regression & JIT EXPLAIN |
Дата | |
Msg-id | 20190927072053.njf6prdl3vb7y7qb@alap3.anarazel.de обсуждение исходный текст |
Ответы |
Re: JIT performance bug/regression & JIT EXPLAIN
(Andres Freund <andres@anarazel.de>)
Re: JIT performance bug/regression & JIT EXPLAIN (Maciek Sakrejda <m.sakrejda@gmail.com>) Re: JIT performance bug/regression & JIT EXPLAIN (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
Hi, Unfortunately I found a performance regression for JITed query compilation introduced in 12, compared to 11. Fixed in one of the attached patches (v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch - which needs a better commit message). The first question is when to push that fix. I'm inclined to just do so now - as we still do JITed tuple deforming in most cases, as well as doing so in 11 in the places this patch fixes, the risk of that seems low. But I can also see an arguments for waiting after 12.0. For me the bigger question is about how to make sure we can write tests determining which parts of the querytree are JIT compiled and which are not. There's the above bug, and I'm also hunting a regression introduced somewhere during 11's lifetime, which suggests to me that we need better coverage. I also want to add new JIT logic, making this even more important. The reason that 11 didn't have tests verifying that certain parts of the plan tree are JIT compiled is that EXPLAIN doesn't currently show the relevant information, and it's not that trivial to do so. What I'd like to do is to add additional, presumably optional, output to EXPLAIN showing additional information about expressions. There's two major parts to doing so: 1) Find a way to represent the additional information attached to expressions, and provide show_expression et al with the ExprState to be able to do so. The additional information I think is necessary is a) is expression jit compiled b-d) is scan/outer/inner tuple deforming necessary, and if so, JIT compiled. We can't unconditionally JIT compile for tuple deforming, because there's a number of cases where the source slot doesn't have precisely the same tuple desc, and/or doesn't have the same type. 2) Expand EXPLAIN output to show expressions that currently aren't shown. Performance-wise the ones most critical that aren't currently visible, and that I know about, are: - Agg's combined transition function, we also currently don't display in any understandable way how many passes over the input we do (for grouping sets), nor how much memory is needed. - Agg's hash comparator (separate regression referenced above) - Hash/HashJoin's hashkeys/hjclauses For 1) think we need to change show_expression()/show_qual() etc to also pass down the corresponding ExprState if available (not available in plenty of cases, most of which are not particularly important). That's fairly mechanical. Then we need to add information about JIT to individual expressions. In the attached WIP patchset I've made that dependent on the new "jit_details" EXPLAIN option. When specified, new per-expression information is shown: - JIT-Expr: whether the expression was JIT compiled (might e.g. not be the case because no parent was provided) - JIT-Deform-{Scan,Outer,Inner}: wether necessary, and whether JIT accelerated. I don't like these names much, but ... For the deform cases I chose to display a) the function name if JIT compiled b) "false" if the expression is JIT compiled, deforming is necessary, but deforming is not JIT compiled (e.g. because the slot type wasn't fixed) c) "null" if not necessary, with that being omitted in text mode. So e.g in json format this looks like: "Filter": { "Expr": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone)", "JIT-Expr": "evalexpr_0_2", "JIT-Deform-Scan": "deform_0_3", "JIT-Deform-Outer": null, "JIT-Deform-Inner": null } and in text mode: Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone); JIT-Expr: evalexpr_0_2, JIT-Deform-Scan:deform_0_3 For now I chose to make Filter a group when both, not in text mode and jit_details on - otherwise it's unclear what the JIT fields would apply to. But that's pretty crappy, because it means that the 'shape' of the output depends on the jit_details option. I think if we were starting from scratch it'd make sense to alway have the Expression as it's own sub-node, so interpreting code doesn't have to know all the places an expression can be referenced from. But it's probably not too attractive to change that today? Somewhat independently the series also contains a patch that renames verbose mode's "Output" to project if the node projects. I find it pretty hard to interpret whether a node projects otherwise, and it's confusing when jit_details shows details only for some node's Output, but not for others. But the compat break due to that change is not small - perhaps we could instead mark that in another way? For 2) I've only started to improve the situation, but it's a pretty number of pretty crucial pieces. I first focussed adding information for Agg nodes, as a) those are typically performance sensitive in cases where JIT is beneficial b) the current instrumentation is really insufficient, especially in cases where multiple grouping sets are computed at the same time - I think it's effectilvey not interpretable. In verbose mode explain now shows per-phase output about the transition computation. E.g. for a grouping set query that can't be computed in one pass, it now displays something like MixedAggregate (cost=6083420.07..14022888.98 rows=10011685 width=64) Project: avg((l_linenumber)::bigint), count((l_partkey)::bigint), sum(l_quantity), l_linenumber, l_partkey, l_quantity Filter: (sum(lineitem.l_quantity) IS NOT NULL) Phase 2 using strategy "Sort": Sort Key: lineitem.l_partkey, lineitem.l_quantity Transition Function: 2 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 2 * int8inc_any(TRANS, (l_partkey)::bigint),2 * float8pl(TRANS, l_quantity) Sorted Group: lineitem.l_partkey, lineitem.l_quantity Sorted Group: lineitem.l_partkey Phase 1 using strategy "Sorted Input & All & Hash": Transition Function: 6 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 6 * int8inc_any(TRANS, (l_partkey)::bigint),6 * float8pl(TRANS, l_quantity) Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey, lineitem.l_quantity Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey Sorted Input Group: lineitem.l_linenumber All Group Hash Group: lineitem.l_quantity Hash Group: lineitem.l_quantity, lineitem.l_linenumber -> Sort (cost=6083420.07..6158418.50 rows=29999372 width=16) ... The N * indicates how many of the same transition functions are computed during that phase. I'm not sure that 'TRANS' is the best placeholder for the transition value here. Maybe $TRANS would be clearer? For a parallel aggregate the upper level looks like: Finalize HashAggregate (cost=610681.93..610682.02 rows=9 width=16) Project: l_tax, sum(l_quantity) Phase 0 using strategy "Hash": Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity))) Hash Group: lineitem.l_tax -> Gather (cost=610677.11..610681.70 rows=45 width=16) Output: l_tax, (PARTIAL sum(l_quantity)) Workers Planned: 5 -> Partial HashAggregate (cost=609677.11..609677.20 rows=9 width=16) Project: l_tax, PARTIAL sum(l_quantity) I've not done that yet, but I think it's way past time that we also add memory usage information to Aggregate nodes (both for the hashtable(s), and for internal sorts if those are performed for grouping sets). Which would also be very hard in the "current" format, as there's no representation of passes. With jit_details enabled, we then can show information about the aggregation function, and grouping functions: Phase 0 using strategy "Hash": Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity))); JIT-Expr: evalexpr_0_11, JIT-Deform-Outer: false Hash Group: lineitem.l_tax; JIT-Expr: evalexpr_0_8, JIT-Deform-Outer: deform_0_10, JIT-Deform-Inner: deform_0_9 Currently the "new" format is used when either grouping sets are in use (as the previous explain output was not particularly useful, and information about the passes is important), or if VERBOSE or JIT_DETAILS are specified. For HashJoin/Hash I've added 'Outer Hash Key' and 'Hash Key' for each key, but only in verbose mode. That's somewhat important because for HashJoins those currently are often the performance critical bit, because they'll commonly be the expressions that deform the slots from below. That display is somewhat redundant with HashJoins "Hash Cond", but they're evaluated separately. Under verbose that seems OK to me. With jit_details enabled, this e.g. looks like this: Hash Join (cost=271409.60..2326739.51 rows=30000584 width=250) Project: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity, lineitem.l_extendedprice,lineitem.l_discount, lineitem.l_tax, Inner Unique: true Hash Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)); JIT-Expr: evalexpr_0_7,JIT-Deform-Outer: deform_0_9, JIT-Deform-Inner: Outer Hash Key: lineitem.l_partkey; JIT-Expr: evalexpr_0_10, JIT-Deform-Outer: deform_0_11 Outer Hash Key: lineitem.l_suppkey; JIT-Expr: evalexpr_0_12, JIT-Deform-Outer: deform_0_13 -> Seq Scan on public.lineitem (cost=0.00..819684.84 rows=30000584 width=106) Output: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity,lineitem.l_extendedprice, lineitem.l_discount, lineitem.l -> Hash (cost=129384.24..129384.24 rows=3999824 width=144) Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost, partsupp.ps_comment Hash Key: partsupp.ps_partkey; JIT-Expr: evalexpr_0_0, JIT-Deform-Outer: deform_0_1 Hash Key: partsupp.ps_suppkey; JIT-Expr: evalexpr_0_2, JIT-Deform-Outer: deform_0_3 -> Seq Scan on public.partsupp (cost=0.00..129384.24 rows=3999824 width=144) Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost, partsupp.ps_comment JIT: Functions: 14 (6 for expression evaluation, 8 for tuple deforming) Options: Inlining true, Optimization true, Expressions true, Deforming true this also highlights the sad fact that we currently use a separate ExprState to compute each of the hash keys, and then "manually" invoke the hash function itself. That's bad both for interpreted execution, as we repeatedly pay executor startup overhead and don't even hit the fastpath, as well as for JITed execution, because we have more code to optimize (some of it pretty redundant, in particular the deforming). In both cases we suffer from the problem that we deform the tuple incrementally. A later patch in the series then uses the new explain output to add some tests for JIT, and then fixes two bugs, showing that the test output changes. Additionally I've also included a small improvement to the expression evaluation logic, which also changes output in the JIT test, as it should. Comments? Greetings, Andres Freund
Вложения
- v1-0001-jit-Instrument-function-purpose-separately-add-tr.patch
- v1-0002-Refactor-explain.c-to-pass-ExprState-down-to-show.patch
- v1-0003-Explain-Differentiate-between-a-node-projecting-o.patch
- v1-0004-Add-EXPLAIN-option-jit_details-showing-per-expres.patch
- v1-0005-jit-explain-remove-backend-lifetime-module-count-.patch
- v1-0006-WIP-explain-Show-per-phase-information-about-aggr.patch
- v1-0007-WIP-explain-Output-hash-keys-in-verbose-mode.patch
- v1-0008-jit-Add-tests.patch
- v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch
- v1-0010-jit-Fix-pessimization-of-execGrouping.c-compariso.patch
- v1-0011-Reduce-code-duplication-for-ExecJust-Var-operatio.patch
- v1-0012-Don-t-generate-EEOP_-_FETCHSOME-operations-for-sl.patch
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Michael PaquierДата:
Сообщение: Re: Refactoring of connection with password prompt loop for frontends
Следующее
От: Christoph BergДата:
Сообщение: Re: Unstable select_parallel regression output in 12rc1