Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode

Поиск
Список
Период
Сортировка
От Hans Buschmann
Тема Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode
Дата
Msg-id 3be446a75a8145d4b558d1f4e6fff22f@W2012-02.nidsa.loc
обсуждение исходный текст
Ответы AW: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode  (Hans Buschmann <buschmann@nidsa.net>)
Список pgsql-hackers
INTENTION

Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with
array-orienteddata should be drastically accelerated by using SIMD instructions on modern hardware. 

I want to introduce this style of programming with the example of hex_encode:
- operates on arrays (bytea)
- simple algorithm
- in some situations partially limiting the performance (e.g pg_dump)

IMPLEMENTATION GUIDELINES

The main goal ist to accelerate common cases on the most common hardware by exploiting all the resources the hardware
delivers.
The following guidelines took me to a first implementation:

- restrict on 64 -bit architectures
    These are the dominant server architectures, have the necessary data formats and corresponding registers and
operatinginstructions 
- start with Intel x86-64 SIMD instructions:
    This is the vastly most used platform, available for development and in practical use
- don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures can
jumpin and implement comparable algorithms 
- fallback to the established implementation in postgres in non appropriate cases or on user request (GUC)
- implementation of leaf function/procedures in assembly language
    These consist mostly of a central loop without calling subroutines or doing additionally branching

- coding for maximum hardware usage instead of elegant programming
    Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standard
implementaionin C 

- isolated footprint by integrating it only in the specific subroutine (here hex-encode)
    This ensures that the requirements for fast execution are met (e.g. buffer sizes) and no repeated checks are needed
likein a library use case. 

- trying to keep both vector execution ports always doing useful work by avoiding waits for latencies

- trying to access memory in linear fashion (reading from input buffer, writing to output buffer) to avaoid internal
cacheproblems 
- focus optimization for the most advanced SIMD instruction set: AVX512
    This provides the most advanced instructions and  quite a lot of large registers to aid in latency avoiding

- if possible provide fallback implementations on older SIMD standards (e.g. AVX2 or SSE2)
    This is usefull on many older servers and client processors, but due to their too little number of registers
latencyavoiding or full execution queues cannot be fully achieved. 


IMPLEMENTATION DETAILS

- The loops implementing the algorithm are written in NASM assembler:
    NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions
implementedand is fast. 

- The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported:
    This includes Linux and Windows as the most important ones

- The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses:
    The assembly implementation gives you the full control over the registers (unlike intrinsics)

- Multiple dependency chains work interleaved to minimize latencies:
    Coding is often interspersed and using almost all registers available.

- Some instructions (Moves, zeroing) are executed outside the processor execution ports:
    These don’t consume execution cyles on a port but their latency has to be considered.

- Some vector instructions (multiply add) have latencies of 5 for example:
    This means that after the instruction is issued, the processor has to wait 5 cycles until the result can be used in
thesame dependency chain. To avoid this and keep all vector execution ports (p0 and p5) busy you have to have 9 other
instructionsin between doing work on other streams of the algorithm to maximize hardware usage and overall performance. 

- All loops are implemented as separate C-callable functions (according to the OS calling convention):
    They are all leaf functions by calling no other subroutines.

- The decision which implementation is choosen is done at the caller side by a special dispatcher routine:
    The caller handles the architectural capabilites (instruction sets available) and knows the required work: There is
oftena suitable minimum amount of work required for efficently calling a provided implementation. 

- Loops should be run at least 2-4 times to compensate for initializing overhead:
    This implicits a certain amount of minimum work count based on the specific SIMD implementations

- The loops terminate after detecting an error (e.g. wrong input data) and return the succesfull completed amount of
work:
    The standard linear implementation takes over with the already established error-handling.

- The loops work optimally with some extra output buffer space at the end to be able to overshoot in the last round:
    Nonethless the correct amount of work is returned to the caller and a vector size of output buffer following the
realresult is zeroed out (Currently disabled!) 

- the loop may preload some data after the input buffer but assures that the following page boundary is never crossed
toavoid any access violation: 
    This makes no harm to the memory system because the output buffer has a supplemental buffer at the end, but this
couldbe changed to leaving the tail handling to the standard implementaion if deemed unsupportable (as for now). 


(to be continued...)


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Adding CI to our tree
Следующее
От: Hans Buschmann
Дата:
Сообщение: AW: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode