Обсуждение: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode

Поиск
Список
Период
Сортировка

Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode

От
Hans Buschmann
Дата:
INTENTION

Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with
array-orienteddata should be drastically accelerated by using SIMD instructions on modern hardware. 

I want to introduce this style of programming with the example of hex_encode:
- operates on arrays (bytea)
- simple algorithm
- in some situations partially limiting the performance (e.g pg_dump)

IMPLEMENTATION GUIDELINES

The main goal ist to accelerate common cases on the most common hardware by exploiting all the resources the hardware
delivers.
The following guidelines took me to a first implementation:

- restrict on 64 -bit architectures
    These are the dominant server architectures, have the necessary data formats and corresponding registers and
operatinginstructions 
- start with Intel x86-64 SIMD instructions:
    This is the vastly most used platform, available for development and in practical use
- don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures can
jumpin and implement comparable algorithms 
- fallback to the established implementation in postgres in non appropriate cases or on user request (GUC)
- implementation of leaf function/procedures in assembly language
    These consist mostly of a central loop without calling subroutines or doing additionally branching

- coding for maximum hardware usage instead of elegant programming
    Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standard
implementaionin C 

- isolated footprint by integrating it only in the specific subroutine (here hex-encode)
    This ensures that the requirements for fast execution are met (e.g. buffer sizes) and no repeated checks are needed
likein a library use case. 

- trying to keep both vector execution ports always doing useful work by avoiding waits for latencies

- trying to access memory in linear fashion (reading from input buffer, writing to output buffer) to avaoid internal
cacheproblems 
- focus optimization for the most advanced SIMD instruction set: AVX512
    This provides the most advanced instructions and  quite a lot of large registers to aid in latency avoiding

- if possible provide fallback implementations on older SIMD standards (e.g. AVX2 or SSE2)
    This is usefull on many older servers and client processors, but due to their too little number of registers
latencyavoiding or full execution queues cannot be fully achieved. 


IMPLEMENTATION DETAILS

- The loops implementing the algorithm are written in NASM assembler:
    NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions
implementedand is fast. 

- The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported:
    This includes Linux and Windows as the most important ones

- The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses:
    The assembly implementation gives you the full control over the registers (unlike intrinsics)

- Multiple dependency chains work interleaved to minimize latencies:
    Coding is often interspersed and using almost all registers available.

- Some instructions (Moves, zeroing) are executed outside the processor execution ports:
    These don’t consume execution cyles on a port but their latency has to be considered.

- Some vector instructions (multiply add) have latencies of 5 for example:
    This means that after the instruction is issued, the processor has to wait 5 cycles until the result can be used in
thesame dependency chain. To avoid this and keep all vector execution ports (p0 and p5) busy you have to have 9 other
instructionsin between doing work on other streams of the algorithm to maximize hardware usage and overall performance. 

- All loops are implemented as separate C-callable functions (according to the OS calling convention):
    They are all leaf functions by calling no other subroutines.

- The decision which implementation is choosen is done at the caller side by a special dispatcher routine:
    The caller handles the architectural capabilites (instruction sets available) and knows the required work: There is
oftena suitable minimum amount of work required for efficently calling a provided implementation. 

- Loops should be run at least 2-4 times to compensate for initializing overhead:
    This implicits a certain amount of minimum work count based on the specific SIMD implementations

- The loops terminate after detecting an error (e.g. wrong input data) and return the succesfull completed amount of
work:
    The standard linear implementation takes over with the already established error-handling.

- The loops work optimally with some extra output buffer space at the end to be able to overshoot in the last round:
    Nonethless the correct amount of work is returned to the caller and a vector size of output buffer following the
realresult is zeroed out (Currently disabled!) 

- the loop may preload some data after the input buffer but assures that the following page boundary is never crossed
toavoid any access violation: 
    This makes no harm to the memory system because the output buffer has a supplemental buffer at the end, but this
couldbe changed to leaving the tail handling to the standard implementaion if deemed unsupportable (as for now). 


(to be continued...)


(continued)

PERFORMANCE

First Level: execution units (focused on AVX512)

Every modern processor has at least 2 vector execution units (p1 and p5 on Intel) which execute a different set of
instructionsin a pipelined fashion. Some simple classes of instructions (logical, arithmetic) can be executed on both
ports.The results of a short operation are available in the next cycle for another instruction, which in its total form
adependancy chain. 
Longer executions provide their results only after some more clock cycles, so the latency increases from at least 1 to
highernumbers. 

This constellation implies that a single dependancy chain never can exhaust the full processor capabilities. To fight
theselatencies multiple interleaving dependancy chains should be used. 
Instructions with long latencies (e.g. memory accesses) should be issued long in advance before using their results.

In most cases only the two vector execution ports are the ultimate bottleneck, since the processor can execute memory
reads,memory writes, scalar instructions and branches on other specialized units or avoid them totaly (register
zeroing).

The hex_encode algorithm executes 5 instructions (uops to be correct) on p5, 3 on p1 (or arbitrary) and 1 load and 2
storeuops. 

Assuming a processor with 2.5 GHz (for simplicity) we have 0.5 billion vectors processed per second, which gives
64bytes*0.5billion=32 GB maximum processed per second. 
In normal database units this is really a HUGE number (and it is using only ONE core!).
But in this case the doubled amount of results inc comparison to the source has to be written to memory (64GB/sec),
whichexceeds the possibilities of normal desktop processors. 

As another example I may present the checksum algorithm, which is only read-intensive and uses 3 uops on p5 as the
bottleneckpath/vector. 

On a 3GHz processor checksum can process 64 GB per sec and core.

It is interesting to check the performance levels on the upcoming new generation of XEON (Sapphire Rapids in 2022),
whichwill have much increased memory bandwith (8channels DDR5, up to 12 on AMD Genoa), which will have some special
modelswith HBM2 memory stacks and which has 2 execution units for stores to match the read capacity of also 2
instructions/cycle.


Second Level: alignment, caches and memory

Elder processor generations had a VERY big impact for page split accesses to memory, which will occur on vector data
whenthey are not naturally aligned. 

Giving the future developments I would consider even 128 byte or 256 byte alignment, since it may be possible to get
1024or 2048 bit vectors (already specified in ARM architecture). 

On the level of caches one must consider „cache thrashing“ when the accesses to the caches exceed the associative
maximumaf a cache. In some algorithms (very high parallel checksum calculations with copy function) you could overload
asingle cacheline address in the case of too many parallel accesses. In these cases you can start the algorithm (on the
fixedsize blocks) a little bit delayed, so that some algorithm chains access vector n and some others vectors n+1
interleavedin the execution loop. 

Memory should be accessed in natural order to maximize the use of processor cache prefetching.

All accesses should be optimized to use the registers where possible: long latencies of memory access and some initial
instructionscan be combined in early issued instructions used only much later in time. 

The memory latencies lead to data preloading, where the data for the next round of the loop are loaded at the first
possiblemoment when target registers are available. This is crucial for latency fighting in many algorithms. 


Third level: Internal data structures

Vector operations work best with array oriented structures (here a bytea datatype or a shared buffer block for checksum
calculation).
Clobbering individual scalar data (32/64 bit scalars) into vectors is much slower and really stresses the memory
subsystem.

This implies a focus on more „struct of arrays“ as „array of structures“ in postgres, which seems difficult in postgres
dueto its established structure and long heritage. 

By exploring the code more deeply (than my knowledge so far) it should be easy to identify many more places for simple
algorithmsworking on array structures. 


Fourth Level: Vertical integration

The base of most algorithm is the loading of the data into registers, doing some algorithmic calculations and write it
out.
Subsequent steps are coded in another layer (e.g. copying to storage, trimming the data for output etc.). This often
requiresto read the data again and do some other transformations. 

Vertical integration combines some simple steps for better memory utilization.

As an example I think of pg_dump to dump a huge amount of bytea data (not uncommon in real applications). Most of these
dataare in toast tables, often uncompressed due to their inherant structure. The dump must read the toast pages into
memory,decompose the page, hexdump the content, put the result in an output buffer and trigger the I/O. By integrating
allthese steps into one big performance improvements can be achieved (but naturally not here in my first
implementation!).


Fifth Level: Pooling

Some algorithm are so fast that they need to work on multiple datastreams at once to fully utilize a processor core.
Oneexample is checksum calculation. 
To saturate the processor capabilities with large vectors you have to do the checksum on multiple pages in parallel
(e.g.2, 4 or 8). 
This occurs often in real life (loading shared buffers to memory, flushing shared buffers to disk, precaching the
sharedbuffers etc.). 

Some pooling (collect up to 16 shared buffer blocks in a pool) allows a fast checksumming for blocks which are now
processedin a serial fashion. 
This requires some adoption at some isolated parts in the postgres code base and turns a serial procedure into a
parallelprocessing for objects treated in the same fashion at the (nearly) same time. 


BENCHMARKS:

I have included a little benchmark program. It is not very sophisticated and fancy, but allows to estimate the
performanceof commonly used processors. 

It requires nasm to be installed/downloaded (on linux or Windows).

It executes the hexdump algorithm one million times on the binary of nasm (2.15.05 current version).

The benchmark simply runs (for about 1000 sec), the user has to count the time himself.
The binary of nasm (using it as benchmark source data) is included as the source data in

HEX_BENCH_DATA_1300KB.asm

(please adjust the location where you downloaded nasm.exe on windows).

The binary (of each architecture) has a size of 1356 KB on windows and 1718 KB on linux.

The commands to build the binary are (also found in hex_bench.asm)

on Windows:

:: commands to build on Windows (nasm and golink in the path)
nasm -f WIN64 -g hex_bench.asm     -l hex_bench.lis
nasm -f WIN64 -g hex_x86_64.asm    -l hex_x86_64.lis
nasm -f WIN64 -g HEX_BENCH_DATA_1300KB.asm
golink /console hex_bench.obj hex_x86_64.obj HEX_BENCH_DATA_1300KB.obj

Golink is a small utility linker on Windows found under:

http://www.godevtool.com/

on Linux:

# commands to build on LINUX
nasm -f elf64 -g hex_bench.asm     -l hex_bench.lis
nasm -f elf64 -g hex_x86_64.asm    -l hex_x86_64.lis
nasm -f elf64 -g HEX_BENCH_DATA_1300KB.asm
ld -o hex_bench hex_bench.o hex_x86_64.o HEX_BENCH_DATA_1300KB.o

The selected hex_encode_routine is hardcoded to hex_encode_avx512bw (please choose another implementation on processors
notsupporting AVX512 by changing the comments in hex_bench.asm) 

The best result I could achieve was roughly 95 seconds for 1 Million dumps of 1718 KB on a Tigerlake laptop using
AVX512.This gives about 18 GB/s source-hexdumping rate on a single core! 

In another run with postgres the time to hexdump about half a million tuples with a bytea column yeilding about 6 GB of
outputreduced the time from about 68 seconds to 60 seconds, which clearly shows the postgres overhead for executing the
copycommand on such a data set. 

SQL> Copy col_bytearray from my_table  to 'N:/ZZ_SAV/my_hexdump.sql';

(This was on a customers dataset not reproduced here).




POSTGRES INTEGRATION (HELP NEEDED)

The architecture-dependant introduction of vector routines requires some integration efforts into Postgres.

I have designed a concept for easy integration and extensibility, but some concrete steps need support from others due
tomy restricted knowledge of the whole system. 

(For now this global configuration is part on the top of encode.c, but it certainly must be moved to a more adequate
placefor initialization). 

The main concept tries to match the CPU capabilities with the requirements of a certain implementation. This is not
onlyfor hex_encode but for an arbitrary number of algorithms implemented in an accelerated version (here SIMD vectors,
butothers may be possible too). 

We have a global array called valid_impl_id_arr indicating all the implementations capabable running on the current
CPU.

An implementor defines an algorithm and gets an invariant ID (here ALGORITHM_ID_HEX_ENCODE, should be kept in a global
header).

These Ids are valid for all architectures, even if there exists no accelerated version yet.

In internal arrays (see hex_x86_64.asm) all possible implementations are stored according with their requirements (CPU
features,minimum length etc.). 

In the initialization phase of the running exe (backend or forend in the future) the current cpu_capabilities are
checkedonce and the maximum valid implementation index is stored in the global visible valid_impl_id_arr. 

The highest requirements have the highest Index, so the capabilities are checked in decreasing index order.

For example (hex_encode): We have 4 implementations, but on an AVX2-only machine the
valid_impl_id_arr[ALGORITHM_ID_HEX_ENCODE]is only set to 3, because the requirements of AVX512BW are not met. There is
alwaysindex zero indicating the algorithm has no valid implementation or the CPU has no sufficiant capabilities. 

To disable an algorithm totally from being accelerated the masking by an algotithm_disable_mask is provided, which is
normallyall zero but could be set to disable a certain amount of algorithms by ORing (1«ALGORITHM_ID_constants). This
emergencydisablement should be kept in a GUC and applied only at image initialization time. 

The CPU-capabilites are determined by cpuid instructions (on x86-64) and defined in cpu_capabilties_x86_64.asm.

But this scheme is not restricted to Intel ISA only. Other hardware architectures (most probably ARM, POWER or RISCV)
areidentified by different CPU_IS_ARCH_xxx constants (numbers from 1-7) and implementers get the specific CPU
capabilitiesin their own fashion which may be totally different to Intel ISA. 

So every CPU has its cpu_capabilities_unmasked value as an unique int64 value.
This value is normally copied 1:1 to the global cpu_capabilities, but for testing or in emergency it is masked by a
configurationmask simulating a certain CPU. This allows a developer to test the implementations for lower-class cpus
withoutthe need for the specific hardware. 
This cpu_capabilities_mask defaults to -1 (all bits 1) and should be derived also from a GUC.

For up to 63 algorithms we need 2 int64 GUC values to selectively disable certain parts of accelerated implementation.

Help is greatly appreciated to code this concepts with GUCs and put the globals and their initialization at the right
place.


TOOL CHAIN (HELP NEEDED)

On x86-64 I use nasm (Netwide assembler) because its well maintained, fast, instruction complete  and covers multiple
objectformat. 

The assembler routines should work on most x86-64 operating systems, but for the moment only elf64 and WIN64 output
formatsare supported. 

The standard calling convention is followed mostly in the LINUX style, on Windows the parameters are moved around
accordingly.The same assembler-source-code can be used on both platforms. 

Website for downloading the win binary / rpm repo

https://nasm.us/

I have updated the makefile to include the nasm command and the nasm flags, but I need help to make these based on
configure.

I also have no knowledge on other operating systems (MAC-OS etc.)

The calling conventions can be easily adopted if they differ but somebody else should jump in for testing.

If absolutely needed, nasm allows cross-assembling for a different platform, so the objects could be provided in a
libraryfor these cases. 

For Windows the nasm support must be integrated into the generation of the *.vcxproj for Visual Studio.

I found the VSNASM project on github which explains how to integrate Nasm into VisualStudio.

https://github.com/ShiftMediaProject/VSNASM

But I really need help by an expert to integrate it in the perl building process.

My internal development on windows is using manually assembly/linking so far.


I would much appreciate if someone else could jump in for a patch to configure-integration and another patch for
.vcxobjintegration. 


OUTLOOK

Once the toolchain and global postgres integration is done (these are totally new for me) this kind of vector (or
matrixperhaps) acceleration is quite easy. 

By identifying simple algorithms and using some architecture knowledge of the choosen platform a new implementation is
easilycoded and debugged because the complexity is often limited (performance optimization may be a challenge). 

The integration to postgres remains quite locally and is not very invasive.

The acceleration for the specific algorithm is really huge, despite it puts the focus on other bottlenecks in the
currentcode base. This makes the base algorithms almost disappear in CPU-usage and extends the scale to the dimensions
ofTerabytes. 

The whole architecture is thereby not limited to Intel ISA (even if this is certainly the most common real world use
case)and can be easily adopted to other hardware architectures. 

I have some other algorithms already in the pipeline, formost hex_decode (which must be debugged and checked for error
handling),but during implementation i stumbled over base64_encode/decode which has also its implementations coded. 

I only want to start with a first project (hex_encode/hex_decode) targetting PG15 if possible and approved by the
community.Then I’ll try to polish/debug/document the whole project to finish it to a committable state. 

There is much room for other implementations (checksum verification/setting, aggregation, numeric datatype, merging,
generate_series,integer and floating point output …) which could be addressed later on. 

Due to my different background (not really a C hacker) I need some help from some experts in specific areas. For coding
Intelvector assembly for the project I can provide some help with tips and revisions. 

I have CC-included some people of the project who offered help or where already involved in this coding area.

Thank you all very much for your patience with this new project

Hans Buschmann
Вложения
On Fri, Dec 31, 2021 at 9:32 AM Hans Buschmann <buschmann@nidsa.net> wrote:

> Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with
array-orienteddata should be drastically accelerated by using SIMD instructions on modern hardware. 

Hi Hans,

I have experimented with SIMD within Postgres last year, so I have
some idea of the benefits and difficulties. I do think we can profit
from SIMD more, but we must be very careful to manage complexity and
maximize usefulness. Hopefully I can offer some advice.

> - restrict on 64 -bit architectures
>         These are the dominant server architectures, have the necessary data formats and corresponding registers and
operatinginstructions 
> - start with Intel x86-64 SIMD instructions:
>         This is the vastly most used platform, available for development and in practical use
> - don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures
canjump in and implement comparable algorithms 
> - fallback to the established implementation in postgres in non appropriate cases or on user request (GUC)

These are all reasonable goals, except GUCs are the wrong place to
choose hardware implementations -- it should Just Work.

> - coding for maximum hardware usage instead of elegant programming
>         Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the
standardimplementaion in C 

-1

Maintaining good programming style is a key goal of the project. There
are certainly non-elegant parts in the code, but that has a cost and
we must consider tradeoffs carefully. I have read some of the
optimized code in glibc and it is not fun. They at least know they are
targeting one OS and one compiler -- we don't have that luxury.

> - focus optimization for the most advanced SIMD instruction set: AVX512
>         This provides the most advanced instructions and  quite a lot of large registers to aid in latency avoiding

-1

AVX512 is a hodge-podge of different instruction subsets and are
entirely lacking on some recent Intel server hardware. Also only
available from a single chipmaker thus far.

> - The loops implementing the algorithm are written in NASM assembler:
>         NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions
implementedand is fast. 

> - The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are
supported:
>         This includes Linux and Windows as the most important ones

> - The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses:
>         The assembly implementation gives you the full control over the registers (unlike intrinsics)

On the other hand, intrinsics are easy to integrate into a C codebase
and relieve us from thinking about object formats. A performance
feature that happens to work only on common OS's is probably fine from
the user point of view, but if we have to add a lot of extra stuff to
make it work at all, that's not a good trade off. "Mostly independent"
of the OS is not acceptable -- we shouldn't have to think about the OS
at all when the coding does not involve OS facilities (I/O, processes,
etc).

> As an example I think of pg_dump to dump a huge amount of bytea data (not uncommon in real applications). Most of
thesedata are in toast tables, often uncompressed due to their inherant structure. The dump must read the toast pages
intomemory, decompose the page, hexdump the content, put the result in an output buffer and trigger the I/O. By
integratingall these steps into one big performance improvements can be achieved (but naturally not here in my first
implementation!).

Seems like a reasonable area to work on, but I've never measured.

> The best result I could achieve was roughly 95 seconds for 1 Million dumps of 1718 KB on a Tigerlake laptop using
AVX512.This gives about 18 GB/s source-hexdumping rate on a single core! 
>
> In another run with postgres the time to hexdump about half a million tuples with a bytea column yeilding about 6 GB
ofoutput reduced the time from about 68 seconds to 60 seconds, which clearly shows the postgres overhead for executing
thecopy command on such a data set. 

I don't quite follow -- is this patched vs. unpatched Postgres? I'm
not sure what's been demonstrated.

> The assembler routines should work on most x86-64 operating systems, but for the moment only elf64 and WIN64 output
formatsare supported. 
>
> The standard calling convention is followed mostly in the LINUX style, on Windows the parameters are moved around
accordingly.The same assembler-source-code can be used on both platforms. 

> I have updated the makefile to include the nasm command and the nasm flags, but I need help to make these based on
configure.
>
> I also have no knowledge on other operating systems (MAC-OS etc.)
>
> The calling conventions can be easily adopted if they differ but somebody else should jump in for testing.

As I implied earlier, this is way too low-level. If we have to worry
about obj formats and calling conventions, we'd better be getting
something *really* amazing in return.

> But I really need help by an expert to integrate it in the perl building process.

> I would much appreciate if someone else could jump in for a patch to configure-integration and another patch for
.vcxobjintegration. 

It's a bit presumptuous to enlist others for specific help without
general agreement on the design, especially on the most tedious parts.
Also, here's a general engineering tip: If the non-fun part is too
complex for you to figure out, that might indicate the fun part is too
ambitious. I suggest starting with a simple patch with SSE2 (always
present on x86-64) intrinsics, one that anyone can apply and test
without any additional work. Then we can evaluate if the speed-up in
the hex encoding case is worth some additional complexity. As part of
that work, it might be good to see if some portable improved algorithm
is already available somewhere.

> There is much room for other implementations (checksum verification/setting, aggregation, numeric datatype, merging,
generate_series,integer and floating point output …) which could be addressed later on. 

Float output has already been pretty well optimized. CRC checksums
already have a hardware implementation on x86 and Arm. I don't know of
any practical workload where generate_series() is too slow.
Aggregation is an interesting case, but I'm not sure what the current
bottlenecks are.

--
John Naylor
EDB: http://www.enterprisedb.com