Proposal: custom compression methods

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Proposal: custom compression methods
Дата
Msg-id CAPpHfdsdTA5uZeq6MNXL5ZRuNx+Sig4ykWzWEAfkC6ZKMDy6=Q@mail.gmail.com
обсуждение исходный текст
Ответы Re: Proposal: custom compression methods  (Craig Ringer <craig@2ndquadrant.com>)
Re: Proposal: custom compression methods  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Proposal: custom compression methods  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
Hackers,

I'd like to propose a new feature: "Custom compression methods".

Motivation

Currently when datum doesn't fit the page PostgreSQL tries to compress it using PGLZ algorithm. Compression of particular attributes could be turned on/off by tuning storage parameter of column. Also, there is heuristics that datum is not compressible when its first KB is not compressible. I can see following reasons for improving this situation.

 * Heuristics used for detection of compressible data could be not optimal. We already met this situation with jsonb.
 * For some data distributions there could be more effective compression methods than PGLZ. For example:
     * For natural languages we could use predefined dictionaries which would allow us to compress even relatively short strings (which are not long enough for PGLZ to train its dictionary).
     * For jsonb/hstore we could implement compression methods which have dictionary of keys. This could be either static predefined dictionary or dynamically appended dictionary with some storage.
     * For jsonb and other container types we can implement compression methods which would allow extraction of particular fields without decompression of full value.

Therefore, it would be nice to make compression methods pluggable.

Design

Compression methods would be stored in pg_compress system catalog table of following structure:

compname        name
comptype  oid
compcompfunc  regproc
compdecompfunc  regproc

Compression methods could be created by "CREATE COMPRESSION METHOD" command and deleted by "DROP COMPRESSION METHOD" command.

CREATE COMPRESSION METHOD compname [FOR TYPE comptype_name]
    WITH COMPRESS FUNCTION compcompfunc_name
         DECOMPRESS FUNCTION compdecompfunc_name;
DROP COMPRESSION METHOD compname;

Signatures of compcompfunc and compdecompfunc would be similar pglz_compress and pglz_decompress except compression strategy. There is only one compression strategy in use for pglz (PGLZ_strategy_default). Thus, I'm not sure it would be useful to provide multiple strategies for compression methods.

extern int32 compcompfunc(const char *source, int32 slen, char *dest);
extern int32 compdecompfunc(const char *source, int32 slen, char *dest, int32 rawsize);

Compression method could be type-agnostic (comptype = 0) or type specific (comptype != 0). Default compression method is PGLZ.

Compression method of column would be stored in pg_attribute table. Dependencies between columns and compression methods would be tracked in pg_depend preventing dropping compression method which is currently in use. Compression method of the attribute could be altered by ALTER TABLE command.

ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD compname;

Since mixing of different compression method in the same attribute would be hard to manage (especially dependencies tracking), altering attribute compression method would require a table rewrite.

Implementation details

Catalog changes, new commands, dependency tracking etc are mostly mechanical stuff with no fundamental problems. The hardest part seems to be providing seamless integration of custom compression methods into existing code.

It doesn't seems hard to add extra parameter with compression method to toast_compress_datum. However, PG_DETOAST_DATUM should call custom decompress function with only knowledge of datum. That means that we should somehow conceal knowledge of compression method into datum. The solution could be putting compression method oid right after varlena header. Putting this on-disk would cause storage overhead and break backward compatibility. Thus, we can add this oid right after reading datum from the page. This could be the weakest point in the whole proposal and I'll be very glad for better ideas.

P.S. I'd like to thank Petr Korobeinikov <pkorobeinikov@gmail.com> who started work on this patch and sent me draft of proposal in Russian.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: PATCH: add pg_current_xlog_flush_location function