Re: Make tuple deformation faster

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: Make tuple deformation faster
Дата
Msg-id CANWCAZZe63DHpCEttKKf-sgj7726QtE0Vwm4jCX42a9x1oJ+=g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Make tuple deformation faster  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-hackers
On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

> cycles idle
>            8505168      stalled-cycles-backend:u  #    0.02% backend cycles idle
>       165442142326      instructions:u            #    3.35  insn per cycle
>                                                   #    0.00  stalled
> cycles per insn
>        39409877343      branches:u                #    3.945 G/sec
>          146350275      branch-misses:u           #    0.37% of all branches

> patched

> cycles idle
>           24259785      stalled-cycles-backend:u  #    0.05% backend cycles idle
>       213688149862      instructions:u            #    4.29  insn per cycle
>                                                   #    0.00  stalled
> cycles per insn
>        44147675129      branches:u                #    4.420 G/sec
>           14282567      branch-misses:u           #    0.03% of all branches

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses.  This should have

Nice!

> helped contribute to the "insn per cycle" increase.  4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs.

"ins per cycle" is micro-ops retired (i.e. excludes those executed
speculatively on a mispredicted branch).

That article mentions that 6 micro-ops per cycle can enter the backend
from the frontend, but that can happen only with internally cached
ops, since only 4 instructions per cycle can be decoded. In specific
cases, CPUs can fuse multiple front-end instructions into a single
macro-op, which I think means a pair of micro-ops that can "travel
together" as one. The authors concluded further down that "Zen 4’s
reorder buffer is also special, because each entry can hold up to 4
NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of
fused NOPs are fused again at the rename stage."



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: pg_upgrade and logical replication
Следующее
От: vignesh C
Дата:
Сообщение: Re: Logical Replication of sequences