Re: [PATCH] Compression dictionaries for JSONB

Поиск
Список
Период
Сортировка
От Nikita Malakhov
Тема Re: [PATCH] Compression dictionaries for JSONB
Дата
Msg-id CAN-LCVMg6ntnrjWFbHnuWEAMiJa_07+3bgHyaLApJu_igw9Y4w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Compression dictionaries for JSONB  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

On updating dictionary -

>You cannot "just" refresh a dictionary used once to compress an
>object, because you need it to decompress the object too.

and when you have many - updating an existing dictionary requires
going through all objects compressed with it in the whole database.
It's a very tricky question how to implement this feature correctly.
Also, there are some thoughts on using JSON schema to optimize
storage for JSON objects.
(That's applied to the TOAST too, so at first glance we've decided
to forbid dropping or changing TOAST implementations already
registered in a particular database.)

In my experience, in modern world, even with fast SSD storage
arrays, with large database (about 40-50 Tb) we had disk access
as a bottleneck more often than CPU, except for the cases with
a lot of parallel execution threads for a single query (Oracle).

On Mon, Feb 6, 2023 at 10:33 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-02-06 16:16:41 +0100, Matthias van de Meent wrote:
> On Mon, 6 Feb 2023 at 15:03, Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > Hi,
> >
> > I see your point regarding the fact that creating dictionaries on a
> > training set is too beneficial to neglect it. Can't argue with this.
> >
> > What puzzles me though is: what prevents us from doing this on a page
> > level as suggested previously?
>
> The complexity of page-level compression is significant, as pages are
> currently a base primitive of our persistency and consistency scheme.

+many

It's also not all a panacea performance-wise, datum-level decompression can
often be deferred much longer than page level decompression. For things like
json[b], you'd hopefully normally have some "pre-filtering" based on proper
columns, before you need to dig into the json datum.

It's also not necessarily that good, compression ratio wise. Particularly for
wider datums you're not going to be able to remove much duplication, because
there's only a handful of tuples. Consider the case of json keys - the
dictionary will often do better than page level compression, because it'll
have the common keys in the dictionary, which means the "full" keys never will
have to appear on a page, whereas page-level compression will have the keys on
it, at least once.

Of course you can use a dictionary for page-level compression too, but the
gains when it works well will often be limited, because in most OLTP usable
page-compression schemes I'm aware of, you can't compress a page all that far
down, because you need a small number of possible "compressed page sizes".


> > More similar data you compress the more space and disk I/O you save.
> > Additionally you don't have to compress/decompress the data every time
> > you access it. Everything that's in shared buffers is uncompressed.
> > Not to mention the fact that you don't care what's in pg_attribute,
> > the fact that schema may change, etc. There is a table and a
> > dictionary for this table that you refresh from time to time. Very
> > simple.
>
> You cannot "just" refresh a dictionary used once to compress an
> object, because you need it to decompress the object too.

Right. That's what I was trying to refer to when mentioning that we might need
to add a bit of additional information to the varlena header for datums
compressed with a dictionary.

Greetings,

Andres Freund


--
Regards,

--
Nikita Malakhov
Postgres Professional 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Time delayed LR (WAS Re: logical replication restrictions)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: A problem in deconstruct_distribute_oj_quals