Re: Zedstore - compressed in-core columnar storage

Поиск

Список

Период

Сортировка

От	Ajin Cherian
Тема	Re: Zedstore - compressed in-core columnar storage
Дата	24 мая 2019 г. 05:30:19
Msg-id	CAFPTHDa93qjCWMqJ6-pJj1RSU5uUg9EKFim9OX1nSmMp7e08aw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Zedstore - compressed in-core columnar storage (Ashwin Agrawal <aagrawal@pivotal.io>)
Ответы	Re: Zedstore - compressed in-core columnar storage
Список	pgsql-hackers

Дерево обсуждения

Hi Ashwin,

- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)

We were working on a similar columnar storage using pluggable APIs; one idea that we thought of was to modify the scan slot based on the targetlist to have only the relevant columns in the scan descriptor. This way the table AMs are passed a slot with only relevant columns in the descriptor. Today we do something similar to the result slot using ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So somewhere after creating the scan slot using ExecInitScanTupleSlot(), call a table am handler API to modify the scan tuple slot based on the targetlist, a probable name for the new table am handler would be: exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).

So this way the scan am handlers like getnextslot is passed a slot only having the relevant columns in the scan descriptor. One issue though is that the beginscan is not passed the slot, so if some memory allocation needs to be done based on the column list, it can't be done in beginscan. Let me know what you think.

regards,
Ajin Cherian
Fujitsu Australia

On Thu, May 23, 2019 at 3:56 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:

We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.

Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).

Current State:

- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.

- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.

- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4] on
how modern filesystems manage space acted as a good source for ideas.

- Tuple locks implemented

- Serializable isolation handled

- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.

Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDO

Open questions / discussion items:

- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())

- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)

- TID treated as (block, offset) in current indexing code

- Physical tlist optimization? (currently, we disabled it for
zedstore)

Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.

References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3] https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Stephen Frost
Дата: 24 мая 2019 г., 05:30:09
Сообщение: Re: initdb recommendations

Следующее

От: Sascha Kuhl
Дата: 24 мая 2019 г., 05:31:20
Сообщение: Indexing - comparison of tree structures

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Zedstore - compressed in-core columnar storage

Предыдущее

Следующее