Improve compression speeds in pg_lzcompress.c

Поиск
Список
Период
Сортировка
От Takeshi Yamamuro
Тема Improve compression speeds in pg_lzcompress.c
Дата
Msg-id 50EA7976.5060809@lab.ntt.co.jp
обсуждение исходный текст
Ответы Re: Improve compression speeds in pg_lzcompress.c  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Improve compression speeds in pg_lzcompress.c  (Andres Freund <andres@2ndquadrant.com>)
Re: Improve compression speeds in pg_lzcompress.c  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi, hackers,

The attached is a patch to improve compression speeds with loss of
compression ratios in backend/utils/adt/pg_lzcompress.c. Recent
modern compression techniques like google LZ4 and Snappy inspreid
me to write this patch. Thre are two points of my patch:

1. Skip at most 255 literals that might be incompressible
   during pattern matching for LZ compression.

2. Update a hash table every PGLZ_HASH_GAP literals.

A sequence of literals is typically mixed up with compressible parts
and incompressible ones. Then, IMHO that it is reasonable to skip
PGLZ_SKIP_SIZE literals every a match is not found. The skipped multiple
literals are just copied to the output buffer, so pglz_out_literal() is
re-written (and renamed pglz_out_literals) so as to copy multiple
bytes, not a single byte.

And also, the current implementation updates a hash table for every a single
literal. However, as the updates obviously eat much processor time, skipping
the updates dynamically improves compression speeds.

I've done quick comparison tests with a Xeon 5670 processor.
A sequence logs of Apache hadoop and TREC GOV2 web data were used
as test sets. The former is highly compressible (low entroy) and the
other is difficult to compress (high entropy).

*******************
                         Compression Speed (Ratio)
Apache hadoop logs:
gzip                            78.22MiB/s ( 5.31%)
bzip2                            3.34MiB/s ( 3.04%)
lz4                            939.45MiB/s ( 9.17%)
pg_lzcompress(original)         37.80MiB/s (11.76%)
pg_lzcompress(patch apaplied)   99.42MiB/s (14.19%)

TREC GOV2 web data:
gzip                            21.22MiB/s (32.66%)
bzip2                            8.61MiB/s (27.86%)
lz4                            250.98MiB/s (49.82%)
pg_lzcompress(original)         20.44MiB/s (50.09%)
pg_lzcompress(patch apaplied)   48.67MiB/s (61.87%)

*******************

Obviously, both the compression ratio and the speed in the current
implementation are inferior to those in gzip. And, my patch
loses gzip and bzip2 in view of compression ratios though, the
compression speed overcomes those in gzip and bzip2.

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

regards,
--
----
Takeshi Yamamuro
NTT Cyber Communications Laboratory Group
Software Innovation Center
(Open Source Software Center)
Tel: +81-3-5860-5057 Fax: +81-3-5463-5490
Mail:yamamuro.takeshi@lab.ntt.co.jp

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: question: foreign key constraints and AccessExclusive locks
Следующее
От: Shigeru Hanada
Дата:
Сообщение: Re: PATCH: optimized DROP of multiple tables within a transaction