At 2014-11-11 16:56:00 +0530, ams@2ndQuadrant.com wrote:
>
> I'm working on this (first speeding up the default calculation using
> slice-by-N, then adding support for the SSE4.2 CRC instruction on
> top).
I've done the first part in the attached patch, and I'm working on the
second (especially the bits to issue CPUID at startup and decide which
implementation to use).
As a benchmark, I ran pg_xlogdump --stats against 11GB of WAL data (674
segments) generated by running a total of 2M pgbench transactions on a
db initialised with scale factor 25. The tests were run on my i5-3230
CPU, and the code in each case was compiled with "-O3 -msse4.2" (and
without --enable-debug). The profile was dominated by the CRC
calculation in ValidXLogRecord.
With HEAD's CRC code:
bin/pg_xlogdump --stats wal/000000010000000000000001 29.81s user 3.56s system 77% cpu 43.274 total
bin/pg_xlogdump --stats wal/000000010000000000000001 29.59s user 3.85s system 75% cpu 44.227 total
With slice-by-4 (a minor variant of the attached patch; the results are
included only for curiosity's sake, but I can post the code if needed):
bin/pg_xlogdump --stats wal/000000010000000000000001 13.52s user 3.82s system 48% cpu 35.808 total
bin/pg_xlogdump --stats wal/000000010000000000000001 13.34s user 3.96s system 48% cpu 35.834 total
With slice-by-8 (i.e. the attached patch):
bin/pg_xlogdump --stats wal/000000010000000000000001 7.88s user 3.96s system 34% cpu 34.414 total
bin/pg_xlogdump --stats wal/000000010000000000000001 7.85s user 4.10s system 34% cpu 35.001 total
(Note the progressive reduction in user time from ~29s to ~8s.)
Finally, just for comparison, here's what happens when we use the
hardware instruction via gcc's __builtin_ia32_crc32xx intrinsics
(i.e. the additional patch I'm working on):
bin/pg_xlogdump --stats wal/000000010000000000000001 3.33s user 4.79s system 23% cpu 34.832 total
There are a number of potential micro-optimisations, I just wanted to
submit the obvious thing first and explore more possibilities later.
-- Abhijit