Обсуждение: Make tuple deformation faster

Поиск
Список
Период
Сортировка

Make tuple deformation faster

От
David Rowley
Дата:
Currently, TupleDescData contains the descriptor's attributes in a
variable length array of FormData_pg_attribute allocated within the
same allocation as the TupleDescData. According to my IDE,
sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
to attname being 64 bytes. The TupleDescData.attrs[] array could end
up quite large on tables with many columns and that could result in
poor CPU cache hit ratios when deforming tuples.

Instead, we could make TupleDescData contain an out-of-line pointer to
the array of FormData_pg_attribute and have a much more compact
inlined array of some other struct that much more densely contains the
fields required for tuple deformation. attname and many of the other
fields are not required to deform a tuple.

I've attached a patch series which does this.

0001: Just fixes up some missing usages of TupleDescAttr(). (mostly
missed by me, apparently :-( )
0002: Adjusts the TupleDescData.attrs array to make it out of line. I
wanted to make sure nothing weird happened by doing this before doing
the bulk of the other changes to add the new struct.
0003: Adds a very compact 8-byte struct named TupleDescDeformAttr,
which can be used for tuple deformation. 8 columns fits on a 64-byte
cacheline rather than 13 cachelines.
0004: Adjusts the attalign to change it from char to uint8.  See below.

The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
rather than a char containing 'c', 's', 'i' or 'd'. This allows much
more simple code in the att_align_nominal() macro. What's in master is
quite a complex expression to evaluate every time we deform a column
as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
just store that numeric value in the struct that macro can become a
simple TYPEALIGN() so the operation becomes simple bit masking rather
than a poorly branch predictable series of compare and jump.

The state of this patch series is "proof of concept". I think the
current state should be enough to get an idea of the rough amount of
code churn this change would cause and also an idea of the expected
performance of the change. It certainly isn't in a finished state.
I've not put much effort into updating comments or looking at READMEs
to see what's now outdated.

I also went with trying to patch a bunch of additional boolean columns
from pg_attribute so they just take up 1 bit of space in the attflags
field in the new struct.  I've not tested the performance of expanding
this out so these use 1 bool field each.  That would make the struct
bigger than 8 bytes.  Having the struct be a power-of-2 size is also
beneficial as it allows fast bit-shifting to be used to get the array
element address rather than a more complex (and slower) LEA
instruction. I could try making the struct 16 bytes and see if there
are any further wins by avoiding the bitwise AND on the
TupleDescDeformAttr.attflags field.

To test the performance of this, I tried using the attached script
which creates a table where the first column is a variable length
column and the final column is an int.  The query I ran to test the
performance inserted 1 million rows into this table and performed a
sum() on the final column.  The attached graph shows that the query is
30% faster than master with 15 columns between the first and last
column. For fewer columns, the speedup is less. This is quite a
deform-heavy query so it's not like it speeds up every table with that
column arrangement by 30%, but certainly, some queries could see that
much gain and even more seems possible. I didn't go to a great deal of
trouble to find the most deform-heavy workload.

I'll stick this in the July CF. It would be good to get some feedback
on the idea and feedback on whether more work on this is worthwhile.

As mentioned, the 0001 patch just fixes up the missing usages of the
TupleDescAttr() macro. I see no reason not to commit this now.

Thanks

David

Вложения

Re: Make tuple deformation faster

От
Andy Fan
Дата:
David Rowley <dgrowleyml@gmail.com> writes:

> Currently, TupleDescData contains the descriptor's attributes in a
> variable length array of FormData_pg_attribute allocated within the
> same allocation as the TupleDescData. According to my IDE,
> sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
> to attname being 64 bytes. The TupleDescData.attrs[] array could end
> up quite large on tables with many columns and that could result in
> poor CPU cache hit ratios when deforming tuples.
...
>
> To test the performance of this, I tried using the attached script
> which creates a table where the first column is a variable length
> column and the final column is an int.  The query I ran to test the
> performance inserted 1 million rows into this table and performed a
> sum() on the final column.  The attached graph shows that the query is
> 30% faster than master with 15 columns between the first and last
> column.

Yet another a wonderful optimization! I just want to know how did you
find this optimization (CPU cache hit) case and think it worths some
time. because before we invest our time to optimize something, it is
better know that we can get some measurable improvement after our time
is spend. At for this case, 30% is really a huge number even it is a
artificial case.

Another case is Andrew introduced NullableDatum 5 years ago and said using
it in TupleTableSlot could be CPU cache friendly, I can follow that, but
how much it can improve in an ideal case, is it possible to forecast it
somehow? I ask it here because both cases are optimizing for CPU cache..


-- 
Best Regards
Andy Fan




Re: Make tuple deformation faster

От
David Rowley
Дата:
On Mon, 1 Jul 2024 at 21:17, Andy Fan <zhihuifan1213@163.com> wrote:
> Yet another a wonderful optimization! I just want to know how did you
> find this optimization (CPU cache hit) case and think it worths some
> time. because before we invest our time to optimize something, it is
> better know that we can get some measurable improvement after our time
> is spend. At for this case, 30% is really a huge number even it is a
> artificial case.
>
> Another case is Andrew introduced NullableDatum 5 years ago and said using
> it in TupleTableSlot could be CPU cache friendly, I can follow that, but
> how much it can improve in an ideal case, is it possible to forecast it
> somehow? I ask it here because both cases are optimizing for CPU cache..

Have a look at:

perf stat --pid=<backend pid>

On my AMD Zen4 machine running the 16 extra column test from the
script in my last email, I see:

$ echo master && perf stat --pid=389510 sleep 10
master

 Performance counter stats for process id '389510':

           9990.65 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                 0      page-faults:u             #    0.000 /sec
       49407204156      cycles:u                  #    4.945 GHz
          18529494      stalled-cycles-frontend:u #    0.04% frontend
cycles idle
           8505168      stalled-cycles-backend:u  #    0.02% backend cycles idle
      165442142326      instructions:u            #    3.35  insn per cycle
                                                  #    0.00  stalled
cycles per insn
       39409877343      branches:u                #    3.945 G/sec
         146350275      branch-misses:u           #    0.37% of all branches

      10.001012132 seconds time elapsed

$ echo patched && perf stat --pid=380216 sleep 10
patched

 Performance counter stats for process id '380216':

           9989.14 msec task-clock:u              #    0.998 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                 0      page-faults:u             #    0.000 /sec
       49781280456      cycles:u                  #    4.984 GHz
          22922276      stalled-cycles-frontend:u #    0.05% frontend
cycles idle
          24259785      stalled-cycles-backend:u  #    0.05% backend cycles idle
      213688149862      instructions:u            #    4.29  insn per cycle
                                                  #    0.00  stalled
cycles per insn
       44147675129      branches:u                #    4.420 G/sec
          14282567      branch-misses:u           #    0.03% of all branches

      10.005034271 seconds time elapsed

You can see the branch predictor has done a *much* better job in the
patched code vs master with about 10x fewer misses.  This should have
helped contribute to the "insn per cycle" increase.  4.29 is quite
good for postgres. I often see that around 0.5. According to [1]
(relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
micro-op cache". I'm unsure how micro-ops translate to "insn per
cycle" that's shown in perf stat. I thought 4-5 was about the maximum
pipeline size from today's era of CPUs. Maybe someone else can explain
better than I can. In more simple terms, generally, the higher the
"insn per cycle", the better. Also, the lower all of the idle and
branch miss percentages are that's generally also better. However,
you'll notice that the patched version has more front and backend
stalls. I assume this is due to performing more instructions per cycle
from improved branch prediction causing memory and instruction stalls
to occur more frequently, effectively (I think) it's just hitting the
next bottleneck(s) - memory and instruction decoding. At least, modern
CPUs should be able to out-pace RAM in many workloads, so perhaps it's
not that surprising that "backend cycles idle" has gone up due to such
a large increase in instructions per cycle due to improved branch
prediction.

It would be nice to see this tested on some modern Intel CPU. A 13th
series or 14th series, for example, or even any intel from the past 5
years would be better than nothing.

David

[1] https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/



Re: Make tuple deformation faster

От
Matthias van de Meent
Дата:
On Mon, 1 Jul 2024 at 10:56, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Currently, TupleDescData contains the descriptor's attributes in a
> variable length array of FormData_pg_attribute allocated within the
> same allocation as the TupleDescData. According to my IDE,
> sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
> to attname being 64 bytes. The TupleDescData.attrs[] array could end
> up quite large on tables with many columns and that could result in
> poor CPU cache hit ratios when deforming tuples.
>
> Instead, we could make TupleDescData contain an out-of-line pointer to
> the array of FormData_pg_attribute and have a much more compact
> inlined array of some other struct that much more densely contains the
> fields required for tuple deformation. attname and many of the other
> fields are not required to deform a tuple.

+1

> I've attached a patch series which does this.
>
> 0001: Just fixes up some missing usages of TupleDescAttr(). (mostly
> missed by me, apparently :-( )
> 0002: Adjusts the TupleDescData.attrs array to make it out of line. I
> wanted to make sure nothing weird happened by doing this before doing
> the bulk of the other changes to add the new struct.
> 0003: Adds a very compact 8-byte struct named TupleDescDeformAttr,
> which can be used for tuple deformation. 8 columns fits on a 64-byte
> cacheline rather than 13 cachelines.

Cool, that's similar to, but even better than, my patch from 2021 over at [0].

One thing I'm slightly concerned about is that this allocates another
8 bytes for each attribute in the tuple descriptor. While that's not a
lot when compared with the ->attrs array, it's still quite a lot when
we might not care at all about this data; e.g. in temporary tuple
descriptors during execution, in intermediate planner nodes.

Did you test for performance gains (or losses) with an out-of-line
TupleDescDeformAttr array? One benefit from this would be that we
could reuse the deform array for suffix truncated TupleDescs, reuse of
which currently would require temporarily updating TupleDesc->natts
with a smaller value; but with out-of-line ->attrs and ->deform_attrs,
we could reuse these arrays between TupleDescs if one is shorter than
the other, but has otherwise fully matching attributes. I know that
btree split code would benefit from this, as it wouldn't have to
construct a full new TupleDesc when it creates a suffix-truncated
tuple during page splits.

> 0004: Adjusts the attalign to change it from char to uint8.  See below.
>
> The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
> rather than a char containing 'c', 's', 'i' or 'd'. This allows much
> more simple code in the att_align_nominal() macro. What's in master is
> quite a complex expression to evaluate every time we deform a column
> as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
> just store that numeric value in the struct that macro can become a
> simple TYPEALIGN() so the operation becomes simple bit masking rather
> than a poorly branch predictable series of compare and jump.

+1, that's something I'd missed in my patches, and is probably the
largest contributor to the speedup.

> I'll stick this in the July CF. It would be good to get some feedback
> on the idea and feedback on whether more work on this is worthwhile.

Do you plan to remove the ->attcacheoff catalog field from the
FormData_pg_attribute, now that (with your patch) it isn't used
anymore as a placeholder field for fast (de)forming of tuples?

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/CAEze2Wh8-metSryZX_Ubj-uv6kb%2B2YnzHAejmEdubjhmGusBAg%40mail.gmail.com



Re: Make tuple deformation faster

От
David Rowley
Дата:
On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Cool, that's similar to, but even better than, my patch from 2021 over at [0].

I'll have a read of that. Thanks for pointing it out.

> One thing I'm slightly concerned about is that this allocates another
> 8 bytes for each attribute in the tuple descriptor. While that's not a
> lot when compared with the ->attrs array, it's still quite a lot when
> we might not care at all about this data; e.g. in temporary tuple
> descriptors during execution, in intermediate planner nodes.

I've not done it in the patch, but one way to get some of that back is
to ditch pg_attribute.attcacheoff. There's no need for it after this
patch.  That's only 4 out of 8 bytes, however. I think in most cases
due to FormData_pg_attribute being so huge, the aset.c power-of-2
roundup behaviour will be unlikely to cross a power-of-2 boundary.

The following demonstrates which column counts that actually makes a
difference with:

select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as
old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c)
where position('1' in old_bytes::bit(32)::text) != position('1' in
new_bytes::bit(32)::text);

That returns just 46 column counts out of 1024 where we cross a power
of 2 boundaries with the patched code that we didn't cross in master.
Of course, larger pallocs will result in a malloc() directly, so
perhaps that's not a good measure.  At least for smaller column counts
it should be mainly the same amount of memory used. There are only 6
rows in there for column counts below 100. I think if we were worried
about memory there are likely 100 other things we could do to reclaim
some. It would only take some shuffling of fields in RelationData. I
count 50 bytes of holes in that struct out of the 488 bytes. There are
probably a few that could be moved without upsetting the
struct-field-order-lords too much.

> Did you test for performance gains (or losses) with an out-of-line
> TupleDescDeformAttr array? One benefit from this would be that we
> could reuse the deform array for suffix truncated TupleDescs, reuse of
> which currently would require temporarily updating TupleDesc->natts
> with a smaller value; but with out-of-line ->attrs and ->deform_attrs,
> we could reuse these arrays between TupleDescs if one is shorter than
> the other, but has otherwise fully matching attributes. I know that
> btree split code would benefit from this, as it wouldn't have to
> construct a full new TupleDesc when it creates a suffix-truncated
> tuple during page splits.

No, but it sounds easy to test as patch 0002 moves that out of line
and does nothing else.

> > 0004: Adjusts the attalign to change it from char to uint8.  See below.
> >
> > The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
> > rather than a char containing 'c', 's', 'i' or 'd'. This allows much
> > more simple code in the att_align_nominal() macro. What's in master is
> > quite a complex expression to evaluate every time we deform a column
> > as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
> > just store that numeric value in the struct that macro can become a
> > simple TYPEALIGN() so the operation becomes simple bit masking rather
> > than a poorly branch predictable series of compare and jump.
>
> +1, that's something I'd missed in my patches, and is probably the
> largest contributor to the speedup.

I think so too and I did consider if we should try and do that to
pg_attribute, renaming the column to attalignby. I started but didn't
finish a patch for that.

> > I'll stick this in the July CF. It would be good to get some feedback
> > on the idea and feedback on whether more work on this is worthwhile.
>
> Do you plan to remove the ->attcacheoff catalog field from the
> FormData_pg_attribute, now that (with your patch) it isn't used
> anymore as a placeholder field for fast (de)forming of tuples?

Yes, I plan to do that once I get more confidence I'm on to a winner here.

Thanks for having a look at this.

David



Re: Make tuple deformation faster

От
Matthias van de Meent
Дата:
On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > One thing I'm slightly concerned about is that this allocates another
> > 8 bytes for each attribute in the tuple descriptor. While that's not a
> > lot when compared with the ->attrs array, it's still quite a lot when
> > we might not care at all about this data; e.g. in temporary tuple
> > descriptors during execution, in intermediate planner nodes.
>
> I've not done it in the patch, but one way to get some of that back is
> to ditch pg_attribute.attcacheoff. There's no need for it after this
> patch.  That's only 4 out of 8 bytes, however.

FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
doesn't have any fields that require 8-byte alignment? Only on 64-bit
systems we align the tuples on pages with 8-byte alignment, but
in-memory arrays of the struct wouldn't have to deal with that, AFAIK.

> I think in most cases
> due to FormData_pg_attribute being so huge, the aset.c power-of-2
> roundup behaviour will be unlikely to cross a power-of-2 boundary.

ASet isn't the only allocator, but default enough for this to make sense, yes.

> The following demonstrates which column counts that actually makes a
> difference with:
>
> select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as
> old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c)
> where position('1' in old_bytes::bit(32)::text) != position('1' in
> new_bytes::bit(32)::text);
>
> That returns just 46 column counts out of 1024 where we cross a power
> of 2 boundaries with the patched code that we didn't cross in master.
> Of course, larger pallocs will result in a malloc() directly, so
> perhaps that's not a good measure.  At least for smaller column counts
> it should be mainly the same amount of memory used. There are only 6
> rows in there for column counts below 100. I think if we were worried
> about memory there are likely 100 other things we could do to reclaim
> some. It would only take some shuffling of fields in RelationData. I
> count 50 bytes of holes in that struct out of the 488 bytes. There are
> probably a few that could be moved without upsetting the
> struct-field-order-lords too much.

I'd love for RelationData to be split into IndexRelation,
TableRelation, ForeignTableRelation, etc., as there's a lot of wastage
caused by exclusive fields, too.

> > > 0004: Adjusts the attalign to change it from char to uint8.  See below.
> > >
> > > The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
> > > rather than a char containing 'c', 's', 'i' or 'd'. This allows much
> > > more simple code in the att_align_nominal() macro. What's in master is
> > > quite a complex expression to evaluate every time we deform a column
> > > as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
> > > just store that numeric value in the struct that macro can become a
> > > simple TYPEALIGN() so the operation becomes simple bit masking rather
> > > than a poorly branch predictable series of compare and jump.
> >
> > +1, that's something I'd missed in my patches, and is probably the
> > largest contributor to the speedup.
>
> I think so too and I did consider if we should try and do that to
> pg_attribute, renaming the column to attalignby. I started but didn't
> finish a patch for that.

I'm not sure we have a pg_type entry that that supports numeric
attalign values without increasing the size of the field, as the one
single-byte integer-like type (char) is always used as a printable
character, and implied to always be printable through its usage in
e.g. nodeToString infrastructure.

I'd love to have a better option, but we don't seem to have one yet.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Make tuple deformation faster

От
David Rowley
Дата:
On Mon, 1 Jul 2024 at 23:42, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> > > One thing I'm slightly concerned about is that this allocates another
> > > 8 bytes for each attribute in the tuple descriptor. While that's not a
> > > lot when compared with the ->attrs array, it's still quite a lot when
> > > we might not care at all about this data; e.g. in temporary tuple
> > > descriptors during execution, in intermediate planner nodes.
> >
> > I've not done it in the patch, but one way to get some of that back is
> > to ditch pg_attribute.attcacheoff. There's no need for it after this
> > patch.  That's only 4 out of 8 bytes, however.
>
> FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
> doesn't have any fields that require 8-byte alignment? Only on 64-bit
> systems we align the tuples on pages with 8-byte alignment, but
> in-memory arrays of the struct wouldn't have to deal with that, AFAIK.

Yeah, 4-byte alignment.  "out of 8 bytes" I was talking about is
sizeof(TupleDescDeformAttr), which I believe is the same "another 8
bytes" you had mentioned. What I meant was that deleting attcacheoff
only reduces FormData_pg_attribute by 4 bytes per column and adding
TupleDescDeformAttr adds 8 per column, so we still use 4 more bytes
per column with the patch.

I really doubt the 4 bytes extra memory is a big concern here. It
would be more concerning for patch that wanted to do something like
change NAMEDATALEN to 128, but I think the main concern with that
would be even slower tuple deforming. Additional memory would also be
concerning, but I doubt that's more important than the issue of making
all queries slower due to slower tuple deformation, which is what such
a patch would result in.

> > I think in most cases
> > due to FormData_pg_attribute being so huge, the aset.c power-of-2
> > roundup behaviour will be unlikely to cross a power-of-2 boundary.
>
> ASet isn't the only allocator, but default enough for this to make sense, yes.

It's the only allocator we use for allocating TupleDescs, so other
types and their behaviour are not relevant here.

> I'm not sure we have a pg_type entry that that supports numeric
> attalign values without increasing the size of the field, as the one
> single-byte integer-like type (char) is always used as a printable
> character, and implied to always be printable through its usage in
> e.g. nodeToString infrastructure.
>
> I'd love to have a better option, but we don't seem to have one yet.

yeah, select typname from pg_Type where typalign = 'c' and typlen=1;
has just bool and char.

I'm happy to keep going with this version of the patch unless someone
points out some good reason that we should go with the alternative
instead.

David



Re: Make tuple deformation faster

От
Andy Fan
Дата:
David Rowley <dgrowleyml@gmail.com> writes:

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses.  This should have
> helped contribute to the "insn per cycle" increase.  4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs. Maybe someone else can explain
> better than I can. In more simple terms, generally, the higher the
> "insn per cycle", the better. Also, the lower all of the idle and
> branch miss percentages are that's generally also better. However,
> you'll notice that the patched version has more front and backend
> stalls. I assume this is due to performing more instructions per cycle
> from improved branch prediction causing memory and instruction stalls
> to occur more frequently, effectively (I think) it's just hitting the
> next bottleneck(s) - memory and instruction decoding. At least, modern
> CPUs should be able to out-pace RAM in many workloads, so perhaps it's
> not that surprising that "backend cycles idle" has gone up due to such
> a large increase in instructions per cycle due to improved branch
> prediction.

Thanks for the answer, just another area desvers to exploring.

> It would be nice to see this tested on some modern Intel CPU. A 13th
> series or 14th series, for example, or even any intel from the past 5
> years would be better than nothing.

I have two kind of CPUs.

a). Intel Xeon Processor (Icelake) for my ECS
b). Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz at Mac.

My ECS reports "<not supported> branch-misses", probabaly because it
runs in virtualization software , and Mac doesn't support perf yet :( 

-- 
Best Regards
Andy Fan




Re: Make tuple deformation faster

От
Matthias van de Meent
Дата:
On Tue, 2 Jul 2024 at 02:23, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Mon, 1 Jul 2024 at 23:42, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote:
> > >
> > > On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
> > > <boekewurm+postgres@gmail.com> wrote:
> > > > One thing I'm slightly concerned about is that this allocates another
> > > > 8 bytes for each attribute in the tuple descriptor. While that's not a
> > > > lot when compared with the ->attrs array, it's still quite a lot when
> > > > we might not care at all about this data; e.g. in temporary tuple
> > > > descriptors during execution, in intermediate planner nodes.
> > >
> > > I've not done it in the patch, but one way to get some of that back is
> > > to ditch pg_attribute.attcacheoff. There's no need for it after this
> > > patch.  That's only 4 out of 8 bytes, however.
> >
> > FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
> > doesn't have any fields that require 8-byte alignment? Only on 64-bit
> > systems we align the tuples on pages with 8-byte alignment, but
> > in-memory arrays of the struct wouldn't have to deal with that, AFAIK.
>
> Yeah, 4-byte alignment.  "out of 8 bytes" I was talking about is
> sizeof(TupleDescDeformAttr), which I believe is the same "another 8
> bytes" you had mentioned. What I meant was that deleting attcacheoff
> only reduces FormData_pg_attribute by 4 bytes per column and adding
> TupleDescDeformAttr adds 8 per column, so we still use 4 more bytes
> per column with the patch.

I see I was confused, thank you for clarifying. As I said, the
concerns were only small; 4 more bytes only in memory shouldn't matter
much in the grand scheme of things.

> I'm happy to keep going with this version of the patch

+1, go for it.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Make tuple deformation faster

От
John Naylor
Дата:
On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

> cycles idle
>            8505168      stalled-cycles-backend:u  #    0.02% backend cycles idle
>       165442142326      instructions:u            #    3.35  insn per cycle
>                                                   #    0.00  stalled
> cycles per insn
>        39409877343      branches:u                #    3.945 G/sec
>          146350275      branch-misses:u           #    0.37% of all branches

> patched

> cycles idle
>           24259785      stalled-cycles-backend:u  #    0.05% backend cycles idle
>       213688149862      instructions:u            #    4.29  insn per cycle
>                                                   #    0.00  stalled
> cycles per insn
>        44147675129      branches:u                #    4.420 G/sec
>           14282567      branch-misses:u           #    0.03% of all branches

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses.  This should have

Nice!

> helped contribute to the "insn per cycle" increase.  4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs.

"ins per cycle" is micro-ops retired (i.e. excludes those executed
speculatively on a mispredicted branch).

That article mentions that 6 micro-ops per cycle can enter the backend
from the frontend, but that can happen only with internally cached
ops, since only 4 instructions per cycle can be decoded. In specific
cases, CPUs can fuse multiple front-end instructions into a single
macro-op, which I think means a pair of micro-ops that can "travel
together" as one. The authors concluded further down that "Zen 4’s
reorder buffer is also special, because each entry can hold up to 4
NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of
fused NOPs are fused again at the rename stage."



Re: Make tuple deformation faster

От
David Rowley
Дата:
On Tue, 16 Jul 2024 at 00:13, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Tue, 2 Jul 2024 at 02:23, David Rowley <dgrowleyml@gmail.com> wrote:
> > I'm happy to keep going with this version of the patch
>
> +1, go for it.

I've attached an updated patch series which are a bit more polished
than the last set. I've attempted to document the distinction between
FormData_pg_attribute and the abbreviated struct and tried to give an
indication of which one should be used.

Apart from that, changes include:

* I pushed the v1-0001 patch, so that's removed from the patch series.
* Rename TupleDescDeformAttr struct. It's now called CompactAttribute.
* Rename TupleDescDeformAttr() macro. It's now called TupleDescCompactAttr()
* Other macro renaming. e.g. ATT_IS_PACKABLE_FAST to COMPACT_ATTR_IS_PACKABLE
* In 0003, renamed CompactAttribute.attalign to attalignby to make it
easier to understand the distinction between the align char and the
number of bytes.
* Added 0004 patch to remove pg_attribute.attcacheoff.

There are a few more things that could be done to optimise a few more
things. For example, a bunch of places still use att_align_nominal().
With a bit more work, these could use att_nominal_alignby(). I'm not
yet sure of the cleanest way to do this. Adding alignby to the
typecache might be one way, or maybe just a function that converts the
attalign to the number of bytes. This would be useful in all places
where att_align_nominal() is used in loops, as converting the char to
the number of bytes would only be done once rather than once per loop.
I feel like this patch series is probably big enough for now, so I'd
like to opt for those improvements to take place as follow-on work.

I'll put this up for the CF bot to run with for a bit as the patch has
needed a rebase since I pushed the v1-0001 patch.

David

Вложения