Обсуждение: Force re-compression with lz4

Поиск
Список
Период
Сортировка

Force re-compression with lz4

От
Florents Tselai
Дата:
Hello,

I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.

I’ve altered the column to use the new lz4 compression, but that only applies to new rows.

What’s the recommended way of triggering the re-evaluation for pre-existing rows? 

I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);

delete from t where id=;

insert into t(id, text) values (id, text_corpus);

Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.

Is there a smarter way to do this ?

Thanks,
Flo

Re: Force re-compression with lz4

От
Ron
Дата:
On 10/17/21 10:12 AM, Florents Tselai wrote:
Hello,

I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.

I’ve altered the column to use the new lz4 compression, but that only applies to new rows.

What’s the recommended way of triggering the re-evaluation for pre-existing rows? 

I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);

delete from t where id=;

insert into t(id, text) values (id, text_corpus);

Because it's all in one transaction?

Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.

Is there a smarter way to do this ?

Even with in-place compression, you've got to read the uncompressed data.

Does your shell script process one record at a time?  Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
COPY t FROM '/some/file.csv';

--
Angular momentum makes the world go 'round.

Re: Force re-compression with lz4

От
Ron
Дата:
On 10/17/21 11:36 AM, Ron wrote:
On 10/17/21 10:12 AM, Florents Tselai wrote:
Hello,

I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.

I’ve altered the column to use the new lz4 compression, but that only applies to new rows.

What’s the recommended way of triggering the re-evaluation for pre-existing rows? 

I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);

delete from t where id=;

insert into t(id, text) values (id, text_corpus);

Because it's all in one transaction?

Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.

Is there a smarter way to do this ?

Even with in-place compression, you've got to read the uncompressed data.

Does your shell script process one record at a time?  Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
I forgot to mention:
VACUUM t;
COPY t FROM '/some/file.csv';

--
Angular momentum makes the world go 'round.

Re: Force re-compression with lz4

От
Magnus Hagander
Дата:
On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai <florents.tselai@gmail.com> wrote:
Hello,

I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.

I’ve altered the column to use the new lz4 compression, but that only applies to new rows.

What’s the recommended way of triggering the re-evaluation for pre-existing rows? 

I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);

delete from t where id=;

insert into t(id, text) values (id, text_corpus);

Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.

Is there a smarter way to do this ?


It should be enough to VACUUM FULL the table. (but it has to be VACUUM FULL, not a regular vacuum). Or CLUSTER. 

--

Re: Force re-compression with lz4

От
Adrian Klaver
Дата:
On 10/17/21 10:17, Magnus Hagander wrote:
> On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai 
> <florents.tselai@gmail.com <mailto:florents.tselai@gmail.com>> wrote:

> 
>     Is there a smarter way to do this ?
> 
> 
> It should be enough to VACUUM FULL the table. (but it has to be VACUUM 
> FULL, not a regular vacuum). Or CLUSTER.

With the proviso that this will require double the existing space, 
~670GB, until the operation is completed.

> 
> -- 
>   Magnus Hagander
>   Me: https://www.hagander.net/ <http://www.hagander.net/>
>   Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Force re-compression with lz4

От
Florents Tselai
Дата:
Yes, That COPY-delete-COPY sequence is what I ended up doing.
Unfortunately can’t use ranges as the PK its a text string.

On 17 Oct 2021, at 7:36 PM, Ron <ronljohnsonjr@gmail.com> wrote:

On 10/17/21 10:12 AM, Florents Tselai wrote:
Hello,

I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.

I’ve altered the column to use the new lz4 compression, but that only applies to new rows.

What’s the recommended way of triggering the re-evaluation for pre-existing rows? 

I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);

delete from t where id=;

insert into t(id, text) values (id, text_corpus);

Because it's all in one transaction?

Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.

Is there a smarter way to do this ?

Even with in-place compression, you've got to read the uncompressed data.

Does your shell script process one record at a time?  Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
COPY t FROM '/some/file.csv';

--
Angular momentum makes the world go 'round.

Re: Force re-compression with lz4

От
Florents Tselai
Дата:
I did look into VACUUM(full) for it’s PROCESS_TOAST option which makes sense, but the thing is I already had a cron-ed
VACUUM(full) which I ended up disabling a while back; exactly because of the double-space requirement. 
The DB has already a 1TB size and occupying another 600MB would require some hassle. Thus, the external script approach
makesmore sense. 


> On 17 Oct 2021, at 8:28 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
> On 10/17/21 10:17, Magnus Hagander wrote:
>> On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai <florents.tselai@gmail.com <mailto:florents.tselai@gmail.com>>
wrote:
>
>>    Is there a smarter way to do this ?
>> It should be enough to VACUUM FULL the table. (but it has to be VACUUM FULL, not a regular vacuum). Or CLUSTER.
>
> With the proviso that this will require double the existing space, ~670GB, until the operation is completed.
>
>> --
>>  Magnus Hagander
>>  Me: https://www.hagander.net/ <http://www.hagander.net/>
>>  Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com




Re: Force re-compression with lz4

От
"Daniel Verite"
Дата:
    Florents Tselai wrote:

> I have a table storing mostly text data (40M+ rows) that has
> pg_total_relation_size ~670GB.
> I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4
> compression.

You could start experimenting with data samples rather than the
full contents.

FWIW, in my case I've found that the compression ratio of lz4 was only
marginally better than pglz (like 2% on text).
As for decompression time, it doesn't seem to differ significantly
from pglz, so overall, recompressing existing data did not seem
worth the trouble.

However lz4 appears to be much faster to compress than pglz, so its
benefit is clear in terms of CPU usage for future insertions.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: Force re-compression with lz4

От
Michael Paquier
Дата:
On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote:
> I did look into VACUUM(full) for it’s PROCESS_TOAST option which
> makes sense, but the thing is I already had a cron-ed VACUUM (full)
> which I ended up disabling a while back; exactly because of the
> double-space requirement.

Please note that VACUUM FULL does not enforce a recompression on
existing values.  See commit dbab0c0, that disabled this choice as it
introduced a noticeable performance penality in some cases when
looking at the compression type of the vacuumed table attributes:
=# CREATE TABLE cmdata(f1 text COMPRESSION pglz);
CREATE TABLE
=# INSERT INTO cmdata VALUES(repeat('1234567890', 1000));
INSERT 0 1
=# SELECT pg_column_compression(f1) FROM cmdata;
 pg_column_compression
-----------------------
  pglz
(1 row)
=# ALTER TABLE cmdata ALTER COLUMN f1 SET COMPRESSION lz4;
ALTER TABLE
=# VACUUM FULL cmdata;
VACUUM
=# SELECT pg_column_compression(f1) FROM cmdata;
 pg_column_compression
-----------------------
  pglz
(1 row)
--
Michael

Вложения

Re: Force re-compression with lz4

От
Francisco Olarte
Дата:
On Sun, 17 Oct 2021 at 21:04, Florents Tselai <florents.tselai@gmail.com> wrote:
> Yes, That COPY-delete-COPY sequence is what I ended up doing.
> Unfortunately can’t use ranges as the PK its a text string.

Unless you have a really weird PK and have trouble calculating bounds,
text strings are sortable and fine to use as ranges. If you use
half-open intervals you no not even need to query them all (i.e., do a
batch fo key<'aa', then one for >='aa', <'ab', repeat, terminate with
a >='zz' ( substitute letter pairs for whatever you think will
partition your keys in adequate chunks ) ( you can find adequate
bounds scanning the pkindex and skipping, just rememberto sue
half-open intervals and cover all the key domain ).

Francisco Olarte.



Re: Force re-compression with lz4

От
Florents Tselai
Дата:
Oh, that’s good to know then. So besides ALTER COMPRESSION for future inserts there’s not much one can do for
pre-existingvalues 

I think it makes sense to update/ add more info to the docs on this as well, since other people in the thread expected
thisto work that way too. 
Maybe at some point, even allow an explicit option to be defined during VACUUM ?


> On 18 Oct 2021, at 8:18 AM, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote:
>> I did look into VACUUM(full) for it’s PROCESS_TOAST option which
>> makes sense, but the thing is I already had a cron-ed VACUUM (full)
>> which I ended up disabling a while back; exactly because of the
>> double-space requirement.
>
> Please note that VACUUM FULL does not enforce a recompression on
> existing values.  See commit dbab0c0, that disabled this choice as it
> introduced a noticeable performance penality in some cases when
> looking at the compression type of the vacuumed table attributes:
> =# CREATE TABLE cmdata(f1 text COMPRESSION pglz);
> CREATE TABLE
> =# INSERT INTO cmdata VALUES(repeat('1234567890', 1000));
> INSERT 0 1
> =# SELECT pg_column_compression(f1) FROM cmdata;
> pg_column_compression
> -----------------------
>  pglz
> (1 row)
> =# ALTER TABLE cmdata ALTER COLUMN f1 SET COMPRESSION lz4;
> ALTER TABLE
> =# VACUUM FULL cmdata;
> VACUUM
> =# SELECT pg_column_compression(f1) FROM cmdata;
> pg_column_compression
> -----------------------
>  pglz
> (1 row)
> --
> Michael




Re: Force re-compression with lz4

От
Michael Paquier
Дата:
On Mon, Oct 18, 2021 at 09:57:11AM +0300, Florents Tselai wrote:
> Oh, that’s good to know then. So besides ALTER COMPRESSION for
> future inserts there’s not much one can do for pre-existing values

The posting style of the mailing list is to not top-post, so if you
could avoid breaking the logic of the thread, that would be nice :)

> I think it makes sense to update/ add more info to the docs on this
> as well, since other people in the thread expected this to work that
> way too.

There is some documentation, as changing the compression for an
existing table is part of ALTER TABLE:
https://www.postgresql.org/docs/current/sql-altertable.html

"This does not cause the table to be rewritten, so existing data may
still be compressed with other compression methods. If the table is
restored with pg_restore, then all values are rewritten with the
configured compression method."

> Maybe at some point, even allow an explicit option to be defined during VACUUM ?

That's a part where we disagreed as it should not be VACUUM's work to
do that.  The option would have a limited impact as it comes to users
that would do a one-time operation most likely part of an upgrade, so
I don't think that this would be adapted to have anyway.
--
Michael

Вложения

Re: Force re-compression with lz4

От
Magnus Hagander
Дата:


On Mon, Oct 18, 2021 at 7:18 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote:
> I did look into VACUUM(full) for it’s PROCESS_TOAST option which
> makes sense, but the thing is I already had a cron-ed VACUUM (full)
> which I ended up disabling a while back; exactly because of the
> double-space requirement.

Please note that VACUUM FULL does not enforce a recompression on
existing values.  See commit dbab0c0, that disabled this choice as it
introduced a noticeable performance penality in some cases when
looking at the compression type of the vacuumed table attributes:

Oh dang, I missed that this was reverted. Thanks for pointing that out!

--

Re: Force re-compression with lz4

От
Mladen Gogala
Дата:
On 10/18/21 01:07, Michael Paquier wrote:
> CPU-speaking, LZ4 is*much*  faster than pglz when it comes to
> compression or decompression with its default options.  The
> compression ratio is comparable between both, still LZ4 compresses in
> average less than PGLZ.
> --
> Michael

LZ4 works much better with deduplication tools like Data Domain or Data 
Domain Boost (client side deduplication). With zip or gzip compression, 
deduplication ratios are much lower than with LZ4. Most of the modern 
backup tools (DD, Veeam, Rubrik, Commvault) support deduplication. LZ4 
algorithm uses less CPU than zip, gzip or bzip2 and works much better 
with deduplication algorithms employed by the backup tools. This is 
actually a very big and positive change.

Disclosure:

I used to work for Commvault as a senior PS engineer. Commvault was the 
first tool on the market to combine LZ4 and deduplication.

Regards


-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com




Re: Force re-compression with lz4

От
Adrian Klaver
Дата:
On 10/18/21 06:41, Mladen Gogala wrote:
> 
> On 10/18/21 01:07, Michael Paquier wrote:
>> CPU-speaking, LZ4 is*much*  faster than pglz when it comes to
>> compression or decompression with its default options.  The
>> compression ratio is comparable between both, still LZ4 compresses in
>> average less than PGLZ.
>> -- 
>> Michael
> 
> LZ4 works much better with deduplication tools like Data Domain or Data 
> Domain Boost (client side deduplication). With zip or gzip compression, 
> deduplication ratios are much lower than with LZ4. Most of the modern 
> backup tools (DD, Veeam, Rubrik, Commvault) support deduplication. LZ4 
> algorithm uses less CPU than zip, gzip or bzip2 and works much better 
> with deduplication algorithms employed by the backup tools. This is 
> actually a very big and positive change.

Not sure how much this applies to the Postgres usage of lz4. As I 
understand it, this is only used internally for table compression. When 
using pg_dump compression gzip is used. Unless you pipe plain text 
output through some other program.

> 
> Disclosure:
> 
> I used to work for Commvault as a senior PS engineer. Commvault was the 
> first tool on the market to combine LZ4 and deduplication.
> 
> Regards
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Force re-compression with lz4

От
Mladen Gogala
Дата:
On 10/18/21 11:01, Adrian Klaver wrote:
> ot sure how much this applies to the Postgres usage of lz4. As I 
> understand it, this is only used internally for table compression. 
> When using pg_dump compression gzip is used. Unless you pipe plain 
> text output through some other program. 


This applies when using a 3rd party commercial backup tool with 
deduplication. You'd be surprised how many people do that. One tool to 
backup them all, one tool to find them and on the LTO cartridge backup 
them. I apologize for this cheesy paraphrase of Tolkien, but I couldn't 
resist. Long story short, for the 3rd party backup tools LZO4 
compression will yield better deduplication ratios than other forms of 
compression, thereby saving you space.

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com