JIT performance bug/regression & JIT EXPLAIN

Поиск

Список

Период

Сортировка

От	Andres Freund
Тема	JIT performance bug/regression & JIT EXPLAIN
Дата	27 сентября 2019 г. 10:20:53
Msg-id	20190927072053.njf6prdl3vb7y7qb@alap3.anarazel.de обсуждение исходный текст
Ответы	Re: JIT performance bug/regression & JIT EXPLAIN (Andres Freund <andres@anarazel.de>) Re: JIT performance bug/regression & JIT EXPLAIN (Maciek Sakrejda <m.sakrejda@gmail.com>) Re: JIT performance bug/regression & JIT EXPLAIN (Robert Haas <robertmhaas@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Hi,

Unfortunately I found a performance regression for JITed query
compilation introduced in 12, compared to 11. Fixed in one of the
attached patches (v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch
- which needs a better commit message).

The first question is when to push that fix. I'm inclined to just do so
now - as we still do JITed tuple deforming in most cases, as well as
doing so in 11 in the places this patch fixes, the risk of that seems
low. But I can also see an arguments for waiting after 12.0.


For me the bigger question is about how to make sure we can write tests
determining which parts of the querytree are JIT compiled and which are
not. There's the above bug, and I'm also hunting a regression introduced
somewhere during 11's lifetime, which suggests to me that we need better
coverage. I also want to add new JIT logic, making this even more
important.


The reason that 11 didn't have tests verifying that certain parts of the
plan tree are JIT compiled is that EXPLAIN doesn't currently show the
relevant information, and it's not that trivial to do so.

What I'd like to do is to add additional, presumably optional, output to
EXPLAIN showing additional information about expressions.

There's two major parts to doing so:

1) Find a way to represent the additional information attached to
   expressions, and provide show_expression et al with the ExprState to
   be able to do so.  The additional information I think is necessary is
   a) is expression jit compiled
   b-d) is scan/outer/inner tuple deforming necessary, and if so, JIT
   compiled.

   We can't unconditionally JIT compile for tuple deforming, because
   there's a number of cases where the source slot doesn't have
   precisely the same tuple desc, and/or doesn't have the same type.

2) Expand EXPLAIN output to show expressions that currently aren't
   shown. Performance-wise the ones most critical that aren't currently
   visible, and that I know about, are:
   - Agg's combined transition function, we also currently don't display
     in any understandable way how many passes over the input we do (for
     grouping sets), nor how much memory is needed.
   - Agg's hash comparator (separate regression referenced above)
   - Hash/HashJoin's hashkeys/hjclauses


For 1) think we need to change show_expression()/show_qual() etc to also
pass down the corresponding ExprState if available (not available in
plenty of cases, most of which are not particularly important). That's
fairly mechanical.

Then we need to add information about JIT to individual expressions. In
the attached WIP patchset I've made that dependent on the new
"jit_details" EXPLAIN option. When specified, new per-expression
information is shown:
- JIT-Expr: whether the expression was JIT compiled (might e.g. not be
  the case because no parent was provided)
- JIT-Deform-{Scan,Outer,Inner}: wether necessary, and whether JIT accelerated.

I don't like these names much, but ...

For the deform cases I chose to display
a) the function name if JIT compiled
b) "false" if the expression is JIT compiled, deforming is
   necessary, but deforming is not JIT compiled (e.g. because the slot type
   wasn't fixed)
c) "null" if not necessary, with that being omitted in text mode.

So e.g in json format this looks like:

"Filter": {
  "Expr": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone)",
  "JIT-Expr": "evalexpr_0_2",
  "JIT-Deform-Scan": "deform_0_3",
  "JIT-Deform-Outer": null,
  "JIT-Deform-Inner": null
}
and in text mode:

Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone); JIT-Expr: evalexpr_0_2,
JIT-Deform-Scan:deform_0_3
 

For now I chose to make Filter a group when both, not in text mode and
jit_details on - otherwise it's unclear what the JIT fields would apply
to. But that's pretty crappy, because it means that the 'shape' of the
output depends on the jit_details option.  I think if we were starting
from scratch it'd make sense to alway have the Expression as it's own
sub-node, so interpreting code doesn't have to know all the places an
expression can be referenced from. But it's probably not too attractive
to change that today?

Somewhat independently the series also contains a patch that renames
verbose mode's "Output" to project if the node projects. I find it
pretty hard to interpret whether a node projects otherwise, and it's
confusing when jit_details shows details only for some node's Output,
but not for others. But the compat break due to that change is not small
- perhaps we could instead mark that in another way?


For 2) I've only started to improve the situation, but it's a pretty
number of pretty crucial pieces.

I first focussed adding information for Agg nodes, as a) those are
typically performance sensitive in cases where JIT is beneficial b) the
current instrumentation is really insufficient, especially in cases
where multiple grouping sets are computed at the same time - I think
it's effectilvey not interpretable.

In verbose mode explain now shows per-phase output about the transition
computation. E.g. for a grouping set query that can't be computed in one
pass, it now displays something like

MixedAggregate  (cost=6083420.07..14022888.98 rows=10011685 width=64)
  Project: avg((l_linenumber)::bigint), count((l_partkey)::bigint), sum(l_quantity), l_linenumber, l_partkey,
l_quantity
  Filter: (sum(lineitem.l_quantity) IS NOT NULL)
  Phase 2 using strategy "Sort":
    Sort Key: lineitem.l_partkey, lineitem.l_quantity
    Transition Function: 2 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 2 * int8inc_any(TRANS,
(l_partkey)::bigint),2 * float8pl(TRANS, l_quantity)
 
    Sorted Group: lineitem.l_partkey, lineitem.l_quantity
    Sorted Group: lineitem.l_partkey
  Phase 1 using strategy "Sorted Input & All & Hash":
    Transition Function: 6 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 6 * int8inc_any(TRANS,
(l_partkey)::bigint),6 * float8pl(TRANS, l_quantity)
 
    Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey, lineitem.l_quantity
    Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey
    Sorted Input Group: lineitem.l_linenumber
    All Group
    Hash Group: lineitem.l_quantity
    Hash Group: lineitem.l_quantity, lineitem.l_linenumber
  ->  Sort  (cost=6083420.07..6158418.50 rows=29999372 width=16)
        ...

The N * indicates how many of the same transition functions are computed
during that phase.

I'm not sure that 'TRANS' is the best placeholder for the transition
value here. Maybe $TRANS would be clearer?

For a parallel aggregate the upper level looks like:

Finalize HashAggregate  (cost=610681.93..610682.02 rows=9 width=16)
  Project: l_tax, sum(l_quantity)
  Phase 0 using strategy "Hash":
    Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity)))
    Hash Group: lineitem.l_tax
  ->  Gather  (cost=610677.11..610681.70 rows=45 width=16)
        Output: l_tax, (PARTIAL sum(l_quantity))
        Workers Planned: 5
        ->  Partial HashAggregate  (cost=609677.11..609677.20 rows=9 width=16)
              Project: l_tax, PARTIAL sum(l_quantity)

I've not done that yet, but I think it's way past time that we also add
memory usage information to Aggregate nodes (both for the hashtable(s),
and for internal sorts if those are performed for grouping sets). Which
would also be very hard in the "current" format, as there's no
representation of passes.

With jit_details enabled, we then can show information about the
aggregation function, and grouping functions:
  Phase 0 using strategy "Hash":
    Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity))); JIT-Expr: evalexpr_0_11, JIT-Deform-Outer: false
    Hash Group: lineitem.l_tax; JIT-Expr: evalexpr_0_8, JIT-Deform-Outer: deform_0_10, JIT-Deform-Inner: deform_0_9


Currently the "new" format is used when either grouping sets are in use
(as the previous explain output was not particularly useful, and
information about the passes is important), or if VERBOSE or JIT_DETAILS
are specified.


For HashJoin/Hash I've added 'Outer Hash Key' and 'Hash Key' for each
key, but only in verbose mode. That's somewhat important because for
HashJoins those currently are often the performance critical bit,
because they'll commonly be the expressions that deform the slots from
below.  That display is somewhat redundant with HashJoins "Hash Cond",
but they're evaluated separately. Under verbose that seems OK to me.

With jit_details enabled, this e.g. looks like this:

 Hash Join  (cost=271409.60..2326739.51 rows=30000584 width=250)
   Project: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity,
lineitem.l_extendedprice,lineitem.l_discount, lineitem.l_tax,
 
   Inner Unique: true
   Hash Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)); JIT-Expr:
evalexpr_0_7,JIT-Deform-Outer: deform_0_9, JIT-Deform-Inner:
 
   Outer Hash Key: lineitem.l_partkey; JIT-Expr: evalexpr_0_10, JIT-Deform-Outer: deform_0_11
   Outer Hash Key: lineitem.l_suppkey; JIT-Expr: evalexpr_0_12, JIT-Deform-Outer: deform_0_13
   ->  Seq Scan on public.lineitem  (cost=0.00..819684.84 rows=30000584 width=106)
         Output: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber,
lineitem.l_quantity,lineitem.l_extendedprice, lineitem.l_discount, lineitem.l
 
   ->  Hash  (cost=129384.24..129384.24 rows=3999824 width=144)
         Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost,
partsupp.ps_comment
         Hash Key: partsupp.ps_partkey; JIT-Expr: evalexpr_0_0, JIT-Deform-Outer: deform_0_1
         Hash Key: partsupp.ps_suppkey; JIT-Expr: evalexpr_0_2, JIT-Deform-Outer: deform_0_3
         ->  Seq Scan on public.partsupp  (cost=0.00..129384.24 rows=3999824 width=144)
               Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost,
partsupp.ps_comment
 JIT:
   Functions: 14 (6 for expression evaluation, 8 for tuple deforming)
   Options: Inlining true, Optimization true, Expressions true, Deforming true

this also highlights the sad fact that we currently use a separate
ExprState to compute each of the hash keys, and then "manually" invoke
the hash function itself. That's bad both for interpreted execution, as
we repeatedly pay executor startup overhead and don't even hit the
fastpath, as well as for JITed execution, because we have more code to
optimize (some of it pretty redundant, in particular the deforming). In
both cases we suffer from the problem that we deform the tuple
incrementally.


A later patch in the series then uses the new explain output to add some
tests for JIT, and then fixes two bugs, showing that the test output
changes.

Additionally I've also included a small improvement to the expression
evaluation logic, which also changes output in the JIT test, as it
should.

Comments?

Greetings,

Andres Freund

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Michael Paquier
Дата: 27 сентября 2019 г., 10:14:01
Сообщение: Re: Refactoring of connection with password prompt loop for frontends

Следующее

От: Christoph Berg
Дата: 27 сентября 2019 г., 10:20:55
Сообщение: Re: Unstable select_parallel regression output in 12rc1

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

JIT performance bug/regression & JIT EXPLAIN

Вложения

Предыдущее

Следующее