Proposal for Updating CRC32C with AVX-512 Algorithm.

Поиск
Список
Период
Сортировка
От Amonson, Paul D
Тема Proposal for Updating CRC32C with AVX-512 Algorithm.
Дата
Msg-id BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com
обсуждение исходный текст
Ответы RE: Proposal for Updating CRC32C with AVX-512 Algorithm.
Список pgsql-hackers
Hi,

Comparing the current SSE4.2 implementation of the CRC32C algorithm in Postgres, to an optimized AVX-512 algorithm [0]
weobserved significant gains. The result was a ~6.6X average multiplier of increased performance measured on 3
differentIntel products. Details below. The AVX-512 algorithm in C is a port of the ISA-L library [1] assembler code. 

Workload call size distribution details (write heavy):
   * Average was approximately around 1,010 bytes per call
   * ~80% of the calls were under 256 bytes
   * ~20% of the calls were greater than or equal to 256 bytes up to the max buffer size of 8192

The 256 bytes is important because if the buffer is smaller, it makes sense fallback to the existing implementation.
Thisis because the AVX-512 algorithm needs a minimum of 256 bytes to operate. 

Using the above workload data distribution,
at 0%    calls < 256 bytes, a 841% improvement on average for crc32c functionality was observed.
at 50%   calls < 256 bytes, a 758% improvement on average for crc32c functionality was observed.
at 90%   calls < 256 bytes, a 44% improvement on average for crc32c functionality was observed.
at 97.6% calls < 256 bytes, the workload's crc32c performance breaks-even.
at 100%  calls < 256 bytes, a 14% regression is seen when using AVX-512 implementation.

The results above are averages over 3 machines, and were measured on: Intel Saphire Rapids bare metal, and using EC2 on
AWScloud: Intel Saphire Rapids (m7i.2xlarge) and Intel Ice Lake (m6i.2xlarge). 

Summary Data (Saphire Rapids bare metal, AWS m7i-2xl, and AWS m6i-2xl):
+---------------------+-------------------+-------------------+-------------------+--------------------+
| Rates in Bytes/us   |     Bare Metal    |    AWS m6i-2xl    |   AWS m7i-2xl     |                    |
| (Larger is Better)  +---------+---------+---------+---------+---------+---------+ Overall Multiplier |
|                     | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 |                    |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 256-8192    |  12,046 |  83,196 |   7,471 |  39,965 |  11,867 |  84,589 |        6.62        |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 64 - 255    |  16,865 |  15,909 |   9,209 |   7,363 |  12,496 |  10,046 |        0.86        |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
                                                    |  Weighted Multiplier [*]    |        1.44        |
                                                    +-----------------------------+--------------------+
There was no evidence of AVX-512 frequency throttling from perf data, which stayed steady during the test.

Feedback on this proposed improvement is appreciated. Some questions:
1) This AVX-512 ISA-L derived code uses BSD-3 license [2]. Is this compatible with the PostgreSQL License [3]? They
bothappear to be very permissive licenses, but I am not an expert on licenses.  
2) Is there a preferred benchmark I should run to test this change?

If licensing is a non-issue, I can post the initial patch along with my Postgres benchmark function patch for further
review.

Thanks,
Paul

[0] https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
[1] https://github.com/intel/isa-l
[2] https://opensource.org/license/bsd-3-clause
[3] https://opensource.org/license/postgresql

[*] Weights used were 90% of requests less than 256 bytes, 10% greater than or equal to 256 bytes.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thom Brown
Дата:
Сообщение: Re: Document NULL
Следующее
От: Cary Huang
Дата:
Сообщение: Re: Support tid range scan in parallel?