Re: Column store in Greenplum

Поиск
Список
Период
Сортировка
От Haribabu Kommi
Тема Re: Column store in Greenplum
Дата
Msg-id CAJrrPGcQkoCD=zSBAndHP0dAyAgFEc2bU1K7erZKP0Hz4MuiKg@mail.gmail.com
обсуждение исходный текст
Ответ на Column store in Greenplum  (Asim R P <apraveen@pivotal.io>)
Ответы Re: Column store in Greenplum  (AJG <ayden@gera.co.nz>)
Список pgsql-hackers

On Fri, Jun 8, 2018 at 7:34 AM, Asim R P <apraveen@pivotal.io> wrote:
Hi,

In the pluggable storage unconference session in Ottawa,
column oriented storage was a key point of discussion.  We would like
to share an overview of Greenplum's column store.  We hope that this
will aid a discussion to carve out a pluggable storage interface.
This is just an overview, source code is where the truth is:
https://github.com/greenplum-db/gpdb

Greenplum introduced "appendonly" as a table type to support row as
well as column orientation.  Design goal for appendonly tables is to
reduce I/O cost so that workloads involving bulk I/O (OLAP) run
faster.  As the name suggests, tuples stored in the underlying files
are immutable (no in-place updates / deletes).  The files can only be
extended with new data.  We will focus only on column oriented
appendonly tables in what follows.

Thanks for sharing the details of how Greenplum has implemented
columnar storage and use case of it. This will helpful in creating
a better pluggable table access methods.
 

A create table SQL command to create column oriented table:

    CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true,
orientation=column);

The current pluggable table access method provides the following syntax
currently to create a table with specific access method [1].

CREATE TABLE ... [USING ACCESSMETHOD] ...

The access method must be created by the columnar storage extension.
 
Storage format:

Each column of a column oriented table gets a dedicated file in the
data directory.  A file contains a series of variable length blocks
("varblocks").  Each concurrent transaction writing to an appendonly
table is allocated a dedicated set of files, one per column.  At the
most, 128 concurrent transactions are allowed to write to the same
appendonly table, leading to at the most 128 groups of files (one per
column).  Let's assume two concurrent transactions inserted into the
"col_table" in the above example.  The table will have two groups of
files, each group containing two files (one per column).  A group of
files is called "segment".  Segment 1 for "col_table" contains files
"<relfilenode>.1" (c1) and "<relfilenode>.129" (c2).  Segment 2
contains files "<relfilenode>.2" (c1) and "<relfilenode>.130" (c2).

Appendonly tuples do not contain MVCC information (xmin/xmax).  An
auxiliary table, pg_aoseg.pg_aoseg_<oid>, is created for each
appendonly table.  It keeps track of the length of each segment file
for the appendonly table.  The aoseg table is a heap table, normal
MVCC rules apply.  This setup isolates transactions concurrently
reading or writing to an appendonly table from each other.

Similar to heap TID, appendonly TID is a 6-byte struct
(appendonlytid.h).  It comprises of segment number and row number.
Appendonly TID is recorded in indexes without much distinction from
heap TID.  Unlike heap TID, an appendonly TID is not sufficient to
locate a row number within a column's file.  Another auxiliary table,
block directory (pg_aoseg.pg_aoblckdir_<oid>), maps a row number to
offsets in the segment.  The offset points to start of the varblock
containing the row number.  Index lookups first obtain the appendonly
TID from the index, then map the row number to its offset using block
directory and then read the varblock from the file.  The aoblckdir
table is heap, normal MVCC rules apply.

UPDATE and DELETE:

Updates and deletes are supported by maintaining a bitmap representing
visibility of appendonly row numbers.  Note that visibility is
attribute of a row.  E.g. if row number 10 is invisible in
"col_table", the column values corresponding to row number 10 in both
"<relfilenode>.1" as well as "<relfilenode>.129" files should be
considered invisible.  The bitmap is maintained in another auxiliary
table - pg_aoseg.aovisimap, which is also a normal heap table.
Deleting a row results in no change to the appendonly files.  The bit
representing the deleted row number is set in the aovisimap table.
UPDATE is delete followed by insert.  No update chains are maintained.

Catalog changes:

* pg_appenonly table contains OIDs of auxiliary tables for each
appendonly table.
* For each appendonly table, the auxiliary tables created are: aoseg,
aovisimap and block directory.  Similar to toast tables, these have
their own entires in pg_class with their own relfilenodes.
* New pg_class.relstorage values 'a' for row oriented AO and 'c' for
column oriented.

The current pluggable table access method interface provides a way to
create additional tables in the pg_class of relkind as 'E' (External table)
that can be accessed only by the storage access methods and not by the user.

Fujitsu has also developed our own columnar storage [2] and currently it is tightly coupled
with modified version of PostgreSQL and we also wants to port our version to pluggable
table access method

In our columnar storage, we also need to create additional relations to provide
the columnar storage support. To generalize it for every other storage engine,
I thought of the above External table approach.
 
Compression:

Each column can be compressed.  Supported compression algorithms are
zlib, zstd, RLE (run length encoding), delta compression on top of
RLE.

Buffer management:

Backend private memory is used to buffer contents of appendonly files.
Block size (a multiple of 8K) can be specified per column.  The amount
of memory buffered per file is equal to the block size configured for
that column.

Executor touch points:

Column store offers access methods such as
aocs_beginscan(), aocs_getnext(), aocs_fetch(), aocs_rescan(),
aocs_insert().  One may come across a bunch of conditionals sprinkled
around executor code to invoke either heap, appendonly row or column
store access methods.  A better abstraction layer would have made the
code much cleaner but we don't have it today.

Scan of column oriented tables is performed by reading the projected
columns' files in a loop.  Note that the values stored in the files
contain only data, no MVCC information is stored.  Scan node returns a
tuple table slot, just like in case of heap tables.  The slot contains
virtual tuples.  The values and nulls arrays contain the values read
from respective files.  If needed during execution, the virtual tuples
are converted to what we call "memtuples".  Memtuples are similar to
minimal tuples but supposedly more optimized for accessing individual
datums.


Vacuum:

Vacuum operates on one appendonly segment at a time.  The appendonly
segment is locked such that concurrent insert/update/delete are
blocked from operating on this segment.  The segment is scanned using
aocs_getnext() and aovisimap is used to filter out invisible row
numbers.  Only the visible row numbers are written to a new segment.
After the scan is finished, the old segment is marked for deletion in
pg_aocsseg_<oid> table.  Transactions reading the table do not read
files belonging to segments that are marked for deletion.  After all
segments have been scanned, all auxiliary tables (aocsseg, aovisimap,
block directory and toast) are vacuumed.

Problem worth highlighting:

The design of aovisimap table poses hurdles for concurrent update and
delete operations on the same appendonly table.  One aovismap entry
covers multiple row numbers in the appendonly table.  If two
concurrent transactions delete or update different row numbers such
that they are covered by the same aovisimap tuple, one of them must
abort.

Not having chained updated versions of the same tuple. the
aocs_udpate() interface differs from the heap counterpart, leading to
anomalies such as
https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/73LOuQFkIps/1TNHW0lGAwAJ

Unique indexes are not supported on column oriented tables.
Identifying a concurrent transaction that might be inserting duplicate
tuples becomes hard because (1) lack of MVCC information in tuples and
(2) one block directory tuple covers multiple row numbers.

Thanks for sharing your design details, I will go through them clearly to understand
with the current proposed pluggable table access method interfaces can handle all
the other storage methods that are implemented outside PostgreSQL.


[1] - https://www.postgresql.org/message-id/CAJrrPGfRGtRz_-Qyt_hK9KCQYdN%3DpbpjPooAHM9dZhJmqwN8fg%40mail.gmail.com


Regards,
Haribabu Kommi
Fujitsu Australia

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Rowley
Дата:
Сообщение: Re: Performance regression with PostgreSQL 11 and partitioning
Следующее
От: Peter Eisentraut
Дата:
Сообщение: config.{guess,sub} updates