Re: RFC: compression dictionaries for JSONB

Поиск
Список
Период
Сортировка
От Matthias van de Meent
Тема Re: RFC: compression dictionaries for JSONB
Дата
Msg-id CAEze2WjkH=s-2b1R=GqTvfe5caO9ds0UwE9Jf5W-hbsm-kMmCA@mail.gmail.com
обсуждение исходный текст
Ответ на RFC: compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Ответы Re: RFC: compression dictionaries for JSONB  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
On Fri, 8 Oct 2021 at 11:47, Aleksander Alekseev
<aleksander@timescale.com> wrote:
> This is a follow-up thread to `Add ZSON extension to /contrib/` [1].
> The ZSON extension introduces a new type called ZSON, which is 100%
> compatible with JSONB but uses a shared dictionary of strings most
> frequently used by given JSONB documents for compression. See the
> thread for more details.

Great to see that you're still working on this! It would be great if
we could get this into postgres. As such, I hope you can provide some
clarifications on my questions and comments.

> According to the feedback I got, the community generally liked the
> idea of adding an across-rows and across-tables compression capability
> to JSONB. What the community didn't like was:
>
> 1. Introducing a new data type in order to archive this;
> 2. Updating compression dictionaries manually;

Well, I for one would like access to manually add entries to the
dictionary. What I'm not interested in is being required to manually
update the dictionary; but the ability to manually insert into the
dictionary however is much appreciated.

> 3. Some implementation details of ZSON, such as limited dictionary
> size (2 ** 16 entries) and an extensive usage of gettimeofday() system
> call;
>
> There was also a request for proof of the usefulness of this feature
> in practice.

More compact JSONB is never a bad idea: one reason to stick with JSON
over JSONB is that JSON can use significantly less space than JSONB,
if stored properly. So, improving the disk usage of JSONB is not
really a bad idea.

>
> == Proposal ==
>
> The proposal is to add the support of compression dictionaries to JSONB.
>
> In order to do this, the SQL syntax should be modified. The proposed
> syntax is based on Matthias van de Meent's idea [6]:

Seems fine

> ```
> CREATE TYPE <type-name> AS JSONB_DICTIONARY (
>   learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... },

I'm having trouble understanding how this learn_by field would be used:

If stored as strings, they would go out of date when tables or columns
are renamed or dropped.
Similarly, you'd want to update the dictionary with common values in
columns of that type; generally not columns of arbitrary other types.
You can't in advance know the names of tables and columns, so that
would add a burden of maintenance to the user when they add / change /
remove a column of the dictionary type. Instead of storing 'use update
data from table X column Y' in the type, I think that adding it as a
column option would be the better choice.

I agree with an option for auto-update, though I don't think we have
enough information to determine the default value (I'd err to the side
of caution, going with 'off').

>   autoupdate = false, -- true by default
>   -- optional: other arguments, min/max string lengths, etc
> );
> ```

For dump/restore I think it would be very useful to allow export &
import of these dictionaries, so that restored databases don't have
the problem of starting cold.

As such, `ALTER TYPE  jsondict ADD ENTRY entry_value` would probably
be useful, and maybe even `CREATE TYPE dict AS JSONB_DICTIONARY
('"entry_1"', '"entry_two"', '"entry_three"') WITH (option =
optional)`

> Basically, this is an equivalent of zson_learn [7]. It will create an
> id -> string dictionary in the PostgreSQL catalog. When the user
> chooses `autoupdate = true`, the dictionary will be updated
> automatically by PostgreSQL (e.g. during the VACUUM). This is the
> default value. The dictionary can also be updated manually:
>
> ```
> SELECT jsonb_update_dictionary("type-name");
> ```

I'm a bit on the fence about this. We do use this for sequences, but
alternatively we might want to use ALTER TYPE jsondict;

> Other than that, the type works like a regular one. All the usual
> ALTER TYPE / DROP TYPE semantics are applicable. All the operators
> available to JSONB are also available to <type-name>.
>
> Internally <type-name> is represented similar to JSONB. However, the
> strings from the dictionary are replaced with varints.

How do you propose to differentiate actual integers with these keyed
strings, and / or actual strings with varints? Replacing _all_ strings
doesn't seem like such a great idea.

Related comment below.

> This idea was
> borrowed from Tomas Vondra [8]. The dictionary size is limited to
> 2**28 entries. The limit can be easily extended in the future if
> necessary. Also <type-name> stores the version of the dictionary used
> to compress the data. All in all, this is similar to how ZSON works.

I appreciate this idea, but using that varint implementation is not a
choice I'd make. In the jsonb storage format, we already encode the
length of each value, so varint shouldn't be necessary here. Next, as
everything in jsonb storage is 4-byte aligned, a uint32 should
suffice, or if we're being adventurous, we might even fit a uint29
identifier in the length field instead (at the cost of full backwards
incompatibility).

Lastly, we don't have a good format for varint now (numeric is close,
but has significant overhead), so I'd say we should go with a
fixed-size integer and accept that limitation.

My own suggestion would be updating JSONB on-disk format with the following:

```
 /* values stored in the type bits */
 #define JENTRY_ISSTRING 0x00000000
 #define JENTRY_ISNUMERIC 0x10000000
 #define JENTRY_ISBOOL_FALSE 0x20000000
 #define JENTRY_ISBOOL_TRUE 0x30000000
 #define JENTRY_ISNULL 0x40000000
 #define JENTRY_ISCONTAINER 0x50000000 /* array or object */
+#define JENTRY_ISSYMBOL 0x60000000 /* Lookup in dictionary */
```

And then store the symbol in the JEntry (either in the
JENTRY_OFFLENMASK or in the actual referred content), whilst maybe
using some bits in this for e.g. type hints (whether the item in the
dictionary is an array, object, string or numeric).

I really would like this to support non-string types, because jsonb
structures can grow quite large, even with only small strings: e.g.
`{..., "tags": {"customer": "blabla"}}` could be dictionaried to
`{..., "tags": {'1: '2}`, but potentially also to `{... "tags": '1}`.
Of these, the second would be more efficient overall for storage and
retrieval..

> The first implementation always decompresses <type-name> entirely.
> Partial compression and decompression can always be added
> transparently to the user.

Are you talking about the TOAST compression and decompression, or are
you talking about a different compression scheme? If a different
scheme, is it only replacing the strings in the jsonb-tree with their
directory identifiers, and replacing the symbols in the jsonb-tree
with text (all through the JSONB internals), or are you proposing an
actual compression scheme over the stored jsonb bytes (effectively
wrapping the jsonb IO functions)?

Overall, I'm glad to see this take off, but I do want some
clarifications regarding the direction that this is going towards.


Kind regards,

Matthias



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bharath Rupireddy
Дата:
Сообщение: Re: Compressing temporary files
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: dfmgr additional ABI version fields