Re: Zedstore - compressed in-core columnar storage

Поиск
Список
Период
Сортировка
От Ajin Cherian
Тема Re: Zedstore - compressed in-core columnar storage
Дата
Msg-id CAFPTHDa93qjCWMqJ6-pJj1RSU5uUg9EKFim9OX1nSmMp7e08aw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Zedstore - compressed in-core columnar storage  (Ashwin Agrawal <aagrawal@pivotal.io>)
Ответы Re: Zedstore - compressed in-core columnar storage  (Ashwin Agrawal <aagrawal@pivotal.io>)
Список pgsql-hackers
Hi Ashwin,

- how to pass the "column projection list" to table AM? (as stated in
  initial email, currently we have modified table am API to pass the
  projection to AM)
 
We were working on a similar columnar storage using pluggable APIs; one idea that we thought of was to modify the scan slot based on the targetlist to have only the relevant columns in the scan descriptor. This way the table AMs are passed a slot with only relevant columns in the descriptor. Today we do something similar to the result slot using ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So somewhere after creating the scan slot using ExecInitScanTupleSlot(), call a table am handler API to modify the scan tuple slot based on the targetlist, a probable name for the new table am handler would be: exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).

 So this way the scan am handlers like getnextslot is passed a slot only having the relevant columns in the scan descriptor. One issue though is that the beginscan is not passed the slot, so if some memory allocation needs to be done based on the column list, it can't be done in beginscan. Let me know what you think.


regards,
Ajin Cherian
Fujitsu Australia

On Thu, May 23, 2019 at 3:56 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:

We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.

Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).

Current State:

- A new type of item added to Zedstore "Array item", to boost
  compression and performance. Based on Konstantin's performance
  experiments [2] and inputs from Tomas Vodra [3], this is
  added. Array item holds multiple datums, with consecutive TIDs and
  the same visibility information. An array item saves space compared
  to multiple single items, by leaving out repetitive UNDO and TID
  fields. An array item cannot mix NULLs and non-NULLs. So, those
  experiments should result in improved performance now. Inserting
  data via COPY creates array items currently. Code for insert has not
  been modified from last time. Making singleton inserts or insert
  into select, performant is still on the todo list.

- Now we have a separate and dedicated meta-column btree alongside
  rest of the data column btrees. This special or first btree for
  meta-column is used to assign TIDs for tuples, track the UNDO
  location which provides visibility information. Also, this special
  btree, which always exists, helps to support zero-column tables
  (which can be a result of ADD COLUMN DROP COLUMN actions as
  well). Plus, having meta-data stored separately from data, helps to
  get better compression ratios. And also helps to further simplify
  the overall design/implementation as for deletes just need to edit
  the meta-column and avoid touching the actual data btrees. Index
  scans can just perform visibility checks based on this meta-column
  and fetch required datums only for visible tuples. For tuple locks
  also just need to access this meta-column only. Previously, every
  column btree used to carry the same undo pointer. Thus visibility
  check could be potentially performed, with the past layout, using
  any column. But considering overall simplification new layout
  provides it's fine to give up on that aspect. Having dedicated
  meta-column highly simplified handling for add columns with default
  and null values, as this column deterministically provides all the
  TIDs present in the table, which can't be said for any other data
  columns due to default or null values during add column.

- Free Page Map implemented. The Free Page Map keeps track of unused
  pages in the relation. The FPM is also a b-tree, indexed by physical
  block number. To be more compact, it stores "extents", i.e. block
  ranges, rather than just blocks, when possible. An interesting paper [4] on
  how modern filesystems manage space acted as a good source for ideas.

- Tuple locks implemented

- Serializable isolation handled

- With "default_table_access_method=zedstore"
  - 31 out of 194 failing regress tests
  - 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.

Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
  logs
- Performance profiling and optimizations for Insert, Selects, Index
  Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
  of its own version of UNDO

Open questions / discussion items:

- how best to get "column projection list" from planner? (currently,
  we walk plan and find the columns required for the query in
  the executor, refer GetNeededColumnsForNode())

- how to pass the "column projection list" to table AM? (as stated in
  initial email, currently we have modified table am API to pass the
  projection to AM)

- TID treated as (block, offset) in current indexing code

- Physical tlist optimization? (currently, we disabled it for
  zedstore)

Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.

References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3] https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: initdb recommendations
Следующее
От: Sascha Kuhl
Дата:
Сообщение: Indexing - comparison of tree structures