Обсуждение: [GSoC] artbufmgr

Поиск
Список
Период
Сортировка

[GSoC] artbufmgr

От
pantilimonov misha
Дата:
Hello everyone,

I am participating in gsoc and would like to share the current working status of my project,
devoted to an alternative data structure for buffer management.

To get things going, I started with the basic implementation of the art tree for the single
backend process, using the open-source realization of the algorithm. That was
a pretty straightforward task that helped me to better understand
data structure and convenient mechanism of PostgreSQL for local memory allocation.
For the validation purposes, I also ported a couple of tests as a simple extension with a function,
that can read a file with test data and run basic operations on the tree: insert/search/delete.
Described changes are present in the first two commits [0].

It is worthwhile to note that this "src/backend/lib/artree.c" implementation will not go any further
and will be thrown away for the final patch. The reason for that is an example with the current dynahash
implementation(that I have examined later, while was trying to move the tree into shared memory),
that supports both access patterns with shared & local memory.
Thus, I guess that the proper place for the tree data structure is nearby dynahash, i.e. in
"src/backend/utils/tree/..."
and eventually, with some restrictions for the shared variant (fixed key size length), it can be used in some other
places
of the system.

The last two commits implement the simplest form of the buf-tree in the shared memory, alongside the existing
hashtable.
SharedBufTree just repeats all the actions that are performed on the hashtable and compares results.
For now, it completely relies on the synchronization, that is performed at the layer above for the hashtable,
i.e. BufferAlloc(...). Also, the size of memory that is used for tree's nodes/leaves depends on just some random
numbers.
(working with default 128mb shared memory size)
Operations that work in a bulky way with buffers (drop/truncate/checkpoint/...) are not touched.

So, this is a plan, that I would like to stick with subsequently:

1) Work on synchronization.
2) Examine bulk operations on buffers, replace them with tree (prefix) iterator.
3) A reasonable way to size memory for tree maintenance.

Ideas & suggestions & comments are welcomed.

Dynamic data structure like tree brings problems that are not relevant for the current hashtable implementation.
The number of LWLocks in the hashtable is fixed and equal to the number of partitions(128),
the memory footprint for buckets and hash-entries can be easily precomputed.

Currently, I am thinking about the idea of tree decomposition, so that it is really a
tree of trees, where RelFileNode[12bytes] (maybe + ForkNum) can be used for the top tree and
the rest for the trees of blocks. Hence, each bottom tree can contain a single private LWLock, that will be
used to protect its content. Won't be that an overkill? if we take a hypothetical example with
2 tablespaces(hdd + sdd), each with 500+ objects, so the number of LWLocks should scale accordingly, or
we will be forced to reuse them, by unloading block trees, that should be done in some intelligent way.

Another approach is to use lock coupling... with the penalty of cache invalidation.

-- 
[0] https://github.com/mvpant/postgres/commits/artbufmgr




Re: [GSoC] artbufmgr

От
pantilimonov michael
Дата:
Greetings all,

> So, this is a plan, that I would like to stick with subsequently:
>
> 1) Work on synchronization.
> 2) Examine bulk operations on buffers, replace them with tree (prefix) iterator.
> 3) A reasonable way to size memory for tree maintenance.
>

previously i described the plan, that i wanted to follow; achieved results are summarized below.
Should note beforehand, that third item wasn't touched.

>
> Dynamic data structure like tree brings problems that are not relevant for the current hashtable implementation.
> The number of LWLocks in the hashtable is fixed and equal to the number of partitions(128),
> the memory footprint for buckets and hash-entries can be easily precomputed.
>
> Currently, I am thinking about the idea of tree decomposition, so that it is really a
> tree of trees, where RelFileNode[12bytes] (maybe + ForkNum) can be used for the top tree and
> the rest for the trees of blocks. Hence, each bottom tree can contain a single private LWLock, that will be
> used to protect its content. Won't be that an overkill? if we take a hypothetical example with
> 2 tablespaces(hdd + sdd), each with 500+ objects, so the number of LWLocks should scale accordingly, or
> we will be forced to reuse them, by unloading block trees, that should be done in some intelligent way.
>

Primarily i have started with a single-lock art tree, using the same locking strategy as an existing hashtable.
Both structs were surrounded by "defines", so i could run them independently or simultaneously.
The last option was mainly used as some kind of validation check that tree works in the same way
as hashtable. I have tested a single-lock tree version using pgbench and 'installcheck',
in order to test it on some kind of activity where multiple parallel processes are involved and
fresh relations tags arrive/leave due to the create/drop of tables.

It was obvious that single-lock tree, by definition, can't stand
128-locks (each lock is used to protect specific region(partition)) hashtable in all concurrent cases, so
the idea was to split the current 'MainTree' into the tree of trees. Such separation has additional benefits,
besides throughput improvement:

a) Most of the time (relatively) 'MainTree' is not modified, as the time goes it gets saturated by
    'most used' relation's keys. After that point, it is mostly used as a read-only structure.
b) 'Path' to specific subtree of 'MainTree' can be cached in SMgrRelation by saving pointers to the
     corresponding ForkNums. (just an array of pointers[max_forknum + 1])
     In the current implementation, this optimization can reduce key length from 20 to 4 bytes.

typedef struct buftag
{
    Oid            spcNode;        /* tablespace */
    Oid            dbNode;            /* database */
    Oid            relNode;        /* relation */
    ForkNumber    forkNum;
    BlockNumber blockNum;        /* blknum relative to begin of reln */
} BufferTag;

To split 'MainTree' i have injected LWLock to the art_tree structure and created
an additional separate freelist of these subtree's nodes, that is used then for dynamic allocation
and deallocation.
The key length in 'MainTree' is 16 bytes (spcNode, dbnode, relNode, forkNum) and likely
can be reduced to 13 bytes by shifting forkNum value that ranges only from -1 to 3, but occupies
4 bytes.
The key length of each subtree is 4 bytes only - BlockNumber.

Below are results of tests, performed on database initialized with
pgbench -i -s 50 with shared_buffers=128MB
and pgbench -i -s 100 with shared buffers=1GB.
It should be noted that such workload does not really represents 'real life' results,
as all contention goes into certain partitions (in case of hashtable) and subtrees (in case of tree).
Next goal is to compare data structures on some kind of realistic benchmark or just create
multiple databases inside cluster and run corresponding number of pgbench instances.

tested on pc with i7, ssd
each test performed 5 times, readonly ran subsequently,
full ran with fresh start(otherwise some problems to be fixed in tree..), best result is taken.

readonly test: pgbench --random-seed=2 -t 100000 -S -c 6 -j 6 -n
full test: pgbench --random-seed=2 -t 10000 -c 6 -j 6
drop test:
       create table test_drop2 as select a, md5(a::text) from generate_series(1, 20000) as a; ~ 167 blocks
       drop test_drop2;

128MB:
           readonly HASH: latency average = 0.102 ms tps = 58934.229170 (excluding connections establishing)
                   full  HASH: latency average = 1.233 ms tps = 4868.332127 (excluding connections establishing)

           readonly TREE: latency average = 0.106 ms tps = 56818.893538 (excluding connections establishing)
                   full  TREE: latency average = 1.227 ms tps = 4890.546989 (excluding connections establishing)

1GB:
           readonly HASH: latency average = 0.100 ms tps = 60120.841307 (excluding connections establishing)
                   full  HASH: latency average = 1.325 ms tps = 4529.205977 (excluding connections establishing)
                 drop HASH: min ~4.9ms, max ~54ms

           readonly TREE: latency average = 0.100 ms tps = 60247.345044 (excluding connections establishing)
                   full  TREE: latency average = 1.286 ms tps = 4665.538565 (excluding connections establishing)
                 drop TREE: min ~4.3ms, max ~52ms 

These tests do not show significant superiority of any of the data structures, but it should be noted
that maintenance complexity of tree (its dynamic nature) is much higher than an existing hash-table implementation for
buffer management.

Before the start of this project, we assumed that the tree might not be faster than the hash table, but at least be on
apar.
 
Tree's additional properties may allow other operations to be performed more efficiently,
therefore making it more attractive as a future structure for buffer management.

So, i feel like it is better for now _not_ to concentrate on those 'additional properties' and try to pursue
the third item stated in the beginning -- maintenance complexity. So at the end of the project, the tree can be
fully used as an alternative structure instead of the existing hashtable and be
the base for 'additional properties' future implementation, like drop/truncate/checkpoint/relation extending/etc.
My attempt to implement a tree version of DropRelFileNodesAllBuffers showed that
there are many places in the system, that should cope with these changes.




Re: [GSoC] artbufmgr

От
pantilimonov michael
Дата:
Hello everyone,

> It should be noted that such workload does not really represents 'real life' results,
> as all contention goes into certain partitions (in case of hashtable) and subtrees (in case of tree).
> Next goal is to compare data structures on some kind of realistic benchmark or just create
> multiple databases inside cluster and run corresponding number of pgbench instances.
>

i am still working on the better way to allocate and recycle different parts of the tree (nodes, subtrees, etc),
but would like to share latest results of the benchmarks.

Here is a link to the google sheet:
https://docs.google.com/spreadsheets/d/1VfVY0NUnPQYqgxMEXkpxhHvspbT9uZPRV9mflu8UhLQ/edit?usp=sharing

(Excuse me for the link, it is convenient to accumulate and check results in the sheets.)

Comparison is done with pg 11.3 (0616aed243).
Each tpc-h query ran 12 times. Server restarts weren't performed between queries.
Average is calculated on base of 10 launches, first 2 are skipped.

Generally speaking, current shared tree performs worse than the hashtable in the majority of the TPC-H test queries,
especially in the case of 4GB shared buffers. With a greater size of the buffer cache - 16GB the situation looks
better,
but there is still a 1-6% performance drop in most of the queries. I haven't yet profile any query, but suppose
that there are a couple of places worth optimizing.

It is interesting to note that results of pgbench tests have the same pattern as in 128MB and 1GB buffer cache size:
hashtable performs slightly better on select-only workload, while tree has better tps throughput in tpcb-like.