Re: [PATCH] Compression dictionaries for JSONB

Поиск
Список
Период
Сортировка
От Jacob Champion
Тема Re: [PATCH] Compression dictionaries for JSONB
Дата
Msg-id CAAWbhmgtHaVdh4cDHcDWOd7m4KL6RfBcdtYgy5WDSKN=Sucb5w@mail.gmail.com
обсуждение исходный текст
Ответ на [PATCH] Compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Ответы Re: [PATCH] Compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Список pgsql-hackers
On Wed, Jun 1, 2022 at 1:44 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> This is a follow-up thread to `RFC: compression dictionaries for JSONB` [1]. I would like to share my current
progressin order to get early feedback. The patch is currently in a draft state but implements the basic functionality.
Idid my best to account for all the great feedback I previously got from Alvaro and Matthias. 

I'm coming up to speed with this set of threads -- the following is
not a complete review by any means, and please let me know if I've
missed some of the history.

> SELECT * FROM pg_dict;
>   oid  | dicttypid | dictentry
> -------+-----------+-----------
>  39476 |     39475 | aaa
>  39477 |     39475 | bbb
> (2 rows)

I saw there was some previous discussion about dictionary size. It
looks like this approach would put all dictionaries into a shared OID
pool. Since I don't know what a "standard" use case is, is there any
risk of OID exhaustion for larger deployments with many dictionaries?
Or is 2**32 so comparatively large that it's not really a serious
concern?

> When `mydict` sees 'aaa' in the document, it replaces it with the corresponding code, in this case - 39476. For more
detailsregarding the compression algorithm and choosen compromises please see the comments in the patch. 

I see the algorithm description, but I'm curious to know whether it's
based on some other existing compression scheme, for the sake of
comparison. It seems like it shares similarities with the Snappy
scheme?

Could you talk more about what the expected ratios and runtime
characteristics are? Best I can see is that compression runtime is
something like O(n * e * log d) where n is the length of the input, e
is the maximum length of a dictionary entry, and d is the number of
entries in the dictionary. Since e and d are constant for a given
static dictionary, how well the dictionary is constructed is
presumably important.

> In pg_type `mydict` has typtype = TYPTYPE_DICT. It works the same way as TYPTYPE_BASE with only difference:
corresponding`<type>_in` (pg_type.typinput) and `<another-type>_<type>` (pg_cast.castfunc) procedures receive the
dictionaryOid as a `typmod` argument. This way the procedures can distinguish `mydict1` from `mydict2` and use the
propercompression dictionary. 
>
> The approach with alternative `typmod` role is arguably a bit hacky, but it was the less invasive way to implement
thefeature I've found. I'm open to alternative suggestions. 

Haven't looked at this closely enough to develop an opinion yet.

> Current limitations (todo):
> - ALTER TYPE is not implemented

That reminds me. How do people expect to generate a "good" dictionary
in practice? Would they somehow get the JSONB representations out of
Postgres and run a training program over the blobs? I see some
reference to training functions in the prior threads but don't see any
breadcrumbs in the code.

> - Alternative compression algorithms. Note that this will not require any further changes in the catalog, only the
valueswe write to pg_type and pg_cast will differ. 

Could you expand on this? I.e. why would alternative algorithms not
need catalog changes? It seems like the only schemes that could be
used with pg_catalog.pg_dict are those that expect to map a byte
string to a number. Is that general enough to cover other standard
compression algorithms?

> Open questions:
> - Dictionary entries are currently stored as NameData, the same type that is used for enums. Are we OK with the
accompanyinglimitations? Any alternative suggestions? 

It does feel a little weird to have a hard limit on the entry length,
since that also limits the compression ratio. But it also limits the
compression runtime, so maybe it's a worthwhile tradeoff.

It also seems strange to use a dictionary of C strings to compress
binary data; wouldn't we want to be able to compress zero bytes too?

Hope this helps,
--Jacob



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: showing effective max_connections
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: silence compiler warning in brin.c