On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:
> cycles idle
> 8505168 stalled-cycles-backend:u # 0.02% backend cycles idle
> 165442142326 instructions:u # 3.35 insn per cycle
> # 0.00 stalled
> cycles per insn
> 39409877343 branches:u # 3.945 G/sec
> 146350275 branch-misses:u # 0.37% of all branches
> patched
> cycles idle
> 24259785 stalled-cycles-backend:u # 0.05% backend cycles idle
> 213688149862 instructions:u # 4.29 insn per cycle
> # 0.00 stalled
> cycles per insn
> 44147675129 branches:u # 4.420 G/sec
> 14282567 branch-misses:u # 0.03% of all branches
> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses. This should have
Nice!
> helped contribute to the "insn per cycle" increase. 4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs.
"ins per cycle" is micro-ops retired (i.e. excludes those executed
speculatively on a mispredicted branch).
That article mentions that 6 micro-ops per cycle can enter the backend
from the frontend, but that can happen only with internally cached
ops, since only 4 instructions per cycle can be decoded. In specific
cases, CPUs can fuse multiple front-end instructions into a single
macro-op, which I think means a pair of micro-ops that can "travel
together" as one. The authors concluded further down that "Zen 4’s
reorder buffer is also special, because each entry can hold up to 4
NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of
fused NOPs are fused again at the rename stage."