Re: [Proposal] Page Compression for OLTP

Поиск
Список
Период
Сортировка
От chenhj
Тема Re: [Proposal] Page Compression for OLTP
Дата
Msg-id 3836fc7f.2e0d.1723b06228f.Coremail.chjischj@163.com
обсуждение исходный текст
Ответ на Re: [Proposal] Page Compression for OLTP  (chenhj <chjischj@163.com>)
Ответы Re: [Proposal] Page Compression for OLTP  (chenhj <chjischj@163.com>)
Список pgsql-hackers
Sorry, There may be a problem with the display format of the previous mail. So resend it
----------------------------------------------------------------------------------------------------

At 2020-05-21 15:04:55, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote:

>
>Hello,
>
>My 0.02, some of which may just show some misunderstanding on my part:
>
>  - Could this be proposed as some kind of extension, provided that enough
>    hooks are available? ISTM that foreign tables and/or alternative
>    storage engine (aka ACCESS METHOD) provide convenient APIs which could
>    fit the need for these? Or are they not appropriate? You seem to
>    suggest that there are not.
>
>    If not, what could be done to improve API to allow what you are seeking
>    to do? Maybe you need a somehow lower-level programmable API which does
>    not exist already, or at least is not exported already, but could be
>    specified and implemented with limited effort? Basically you would like
>    to read/write pg pages to somewhere, and then there is the syncing
>    issue to consider. Maybe such a "page storage" API could provide
>    benefit for some specialized hardware, eg persistent memory stores,
>    so there would be more reason to define it anyway? I think it might
>    be valuable to give it some thoughts.

Thank you for giving so many comments.
In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things.
A similar explanation was mentioned in Nikolay P's email.

The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it.

>  - Could you maybe elaborate on how your plan differs from [4] and [5]?

My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored.

However, the most important difference is that I hope to avoid the need for GC through the design of the page layout.

https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net

>> The most difficult thing in CFS development is certainly
>> defragmentation. In CFS it is done using background garbage collection,
>> by one or one
>> GC worker processes. The main challenges were to minimize its
>> interaction with normal work of the system, make it fault tolerant and
>> prevent unlimited growth of data segments.

>> CFS is not introducing its own storage manager, it is mostly embedded in
>> existed Postgres file access layer (fd.c, md.c). It allows to reused
>> code responsible for mapping relations and file descriptors cache. As it
>> was recently discussed in hackers, it may be good idea to separate the
>> questions "how to map blocks to filenames and offsets" and "how to
>> actually perform IO". In this it will be easier to implement compressed
>> storage manager.


>  - Have you consider keeping page headers and compressing tuple data
>    only?

In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page.
When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified.
This is unacceptable because it requires modifying the shared buffer and recalculating the checksum.

However, it should be feasible to put this flag in the compressed address file.
The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO.
Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined.

>  - I'm not sure there is a point in going below the underlying file
>    system blocksize, quite often 4 KiB? Or maybe yes? Or is there
>    a benefit to aim at 1/4 even if most pages overflow?

My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage.
The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data.

In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ.

>  - ISTM that your approach entails 3 "files". Could it be done with 2?
>    I'd suggest that the possible overflow pointers (coa) could be part of
>    the headers so that when reading the 3.1 page, then the header would
>    tell where to find the overflow 3.2, without requiring an additional
>    independent structure with very small data in it, most of it zeros.
>    Possibly this is not possible, because it would require some available
>    space in standard headers when the is page is not compressible, and
>    there is not enough. Maybe creating a little room for that in
>    existing headers (4 bytes could be enough?) would be a good compromise.
>    Hmmm. Maybe the approach I suggest would only work for 1/2 compression,
>    but not for other target ratios, but I think it could be made to work
>    if the pointer can entail several blocks in the overflow table.

My solution is optimized for scenarios where the original page can be compressed to only need one compressed block to store,
In this scenario, only 1 IO is required for reading and writing, and there is no need to access additional overflow address file and overflow data file.

Your suggestion reminded me. The performance difference may not be as big as I thought (testing and comparison is required). If I give up the pursuit of "only one IO", the file layout can be simplified.

For example, it is simplified to the following form, only two files (the following example uses a compressed block size of 4KB)

# Page storage(Plan B)

Use the compress address file to store the compressed block pointer, and the Compress data file stores the compressed block data.

compress address file:
 
        0       1       2       3
+=======+=======+=======+=======+=======+
| head  |  1    |    2  | 3,4   |   5   |
+=======+=======+=======+=======+=======+

compress address file saves the following information for each page

-Compressed size (when size is 0, it means uncompressed format)
-Block number occupied in Compress data file

By the way, I want to access the compress address file through mmap, just like snapfs
https://github.com/postgrespro/snapfs/blob/pg_snap/src/backend/storage/file/snapfs.c

Compress data file:

0         1         2          3         4
+=========+=========+==========+=========+=========+
| data1   | data2   | data3_1  | data3_2 | data4   |
+=========+=========+==========+=========+=========+
|    4K   |


# Page storage(Plan C)

Further, since the size of the compress address file is fixed, the above address file and data file can also be combined into one file

        0       1       2     123071    0         1         2
+=======+=======+=======+     +=======+=========+=========+
| head  |  1    |    2  | ... |       | data1   | data2   | ...  
+=======+=======+=======+     +=======+=========+=========+
  head  |              address        |          data          |

If the difference in performance is so negligible, maybe Plan C is a better solution. (Are there any other problems?)

>
>  - Maybe the compressed and overflow table could become bloated somehow,
>    which would require the vaccuuming implementation and add to the
>    complexity of the implementation?
>

Vacuuming is what I try to avoid.

As I explained in the first email, even without vaccuum, bloating should not become a serious problem.

>>However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression.
>>...
>>And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression.

Best Regards
Chen Huajun

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andy Fan
Дата:
Сообщение: Re: Planning counters in pg_stat_statements (using pgss_store)
Следующее
От: Noah Misch
Дата:
Сообщение: Re: Problem with pg_atomic_compare_exchange_u64 at 32-bit platforms