Compression of tables

Поиск

Список

Период

Сортировка

От	Thomas Munro
Тема	Compression of tables
Дата	10 декабря 2013 г. 10:18:52
Msg-id	CADLWmXWgpXayQQpL6XDj3i2KvUoPN2sjPbf1xUodzJ4t-+WnVA@mail.gmail.com обсуждение исходный текст
Ответы	Re: Compression of tables
Список	pgsql-hackers

Дерево обсуждения

I have been wondering what the minimum useful heap table compression

system would be for Postgres, in order to reduce disk footprint of

large mostly static datasets. Do you think an approach similar to the

static row-level compression of DB2 could make sense? I imagine

something like this:

1. You have a table which already has data in it.

2. You run a COMPRESS operation, which builds a static dictionary,

and rewrites the whole table with compressed frozen tuples. Frozen

tuples have CTIDs just like regular tuples, and can be pointed to by

indexes. They are decompressed on the fly when needed.

Clearly things get tricky once you need to update rows. Assume for

simplicity that future UPDATEs and INSERTs produce normal,

non-compressed tuples that would only be compressed by a subsequent

COMPRESS operation. The question is how to deal with the existing

compressed rows when UPDATEd or DELETEd. Some approaches:

1. Just don't allow updates of compressed rows (!).

2. Exclusively lock the whole page when trying to update any

compressed row, while you explode it into regular uncompressed tuples

on new pages which you can work on (!).

3. Pull the minimum header fields out of the compressed tuples so

that the MVCC machinery can work, to support updates of compressed

tuples. Perhaps just the t_xmax, t_ctid values (t_xmin == frozen is

implied), so that a writer can update them. This means an overhead of

at least 10 bytes per tuple over the compressed size (plus the item

offsets in the page header).

4. Something far cleverer.

Well, these are straw-man suggestions really and I probably don't

understand enough about PG internals (MVCC and implications for

VACUUM) to be making them. But I'm curious to know if anyone has

researched something like this.

For example, I have a system that occupies a couple of TB on disk, but

only a few to a few hundred MB per day change, mostly adding data to

an active partition. I periodically run CLUSTER on any partition that

has pg_stat.correlation < 0.9 (this effectively just re-CLUSTERs the

active one). At the same times I would COMPRESS, and the DB could

potentially fit on much smaller SSDs.

Most commercial database systems I encounter these days are using

compression of some sort (more sophisticated than the above,

with dynamic dictionaries, and sometimes column based storage etc).

Thanks

Thomas

В списке pgsql-hackers по дате отправления:

Предыдущее

От: "Etsuro Fujita"
Дата: 10 декабря 2013 г., 10:06:18
Сообщение: Re: Get more from indices.

Следующее

От: Simon Riggs
Дата: 10 декабря 2013 г., 10:38:07
Сообщение: Re: Bug in VACUUM reporting of "removed %d row versions" in 9.2+

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Compression of tables

Предыдущее

Следующее