Обсуждение: [Proposal] Improvement of GiST page layout

Поиск
Список
Период
Сортировка

[Proposal] Improvement of GiST page layout

От
Andrew Borodin
Дата:
Hi, hackers!

I want to propose improvement of GiST page layout.

GiST is optimized to reduce disk operations on search queries, for
example, windows search queries in case of R-tree.

I expect that more complicated page layout will help to tradeoff some
of CPU efficiency for disk efficiency.

GiST tree is a balanced tree structure with two kinds of pages:
internal pages and leaf pages. Each tree page contains bunch of tuples
with the same structure. Tuples of an internal page reference other
pages of the same tree, while a leaf page tuples holds heap TIDs.

During execution of search query GiST for each tuple on page invokes
key comparison algorithm with two possible outcomes: 'no' and 'may
be'. 'May be' answer recursively descends search algorithms to
referenced page (in case of internal page) or yields tuple (in case of
leaf page).

Expected tuples count on page is around of tenth to hundreds. This
count is big enough to try to save some cache lines from loading into
CPU during enumeration.

For B-trees during inspection of a page we effectively apply binary
search algorithm, which is not possible in GiST tree.

Let's consider R-tree with arbitrary fan-out f. If a given query will
find exactly one data tuple, it is easily to show that keys comparison
count is minimal if f->e /*round_to_optimal(2.78) == 3*/ (tree have to
review f*h keys, h=logf(N), f*logf(N) is minimal when f->e). Smaller
keys comparison count means less cache lines are touched. So fan-out
reduction means cache pressure reduction (except avg fan-out 2, which
seems to be too small) and less time waiting for RAM. I suppose, all
that reasoning holds true in a cases when not just one tuple will be
found.

How do we reduce tree fan-out? Obviously, we can’t fill page with just
3 tuples. But we can install small tree-like structure inside one
page. General GiST index has root page. But a page tree should have
“root” layer of tuples. Private (or internal, intermediate, auxiliary,
I can’t pick precise word) tuples have only keys and fixed-size(F)
array of underlying records offsets. Each layer is linked-list. After
page have just been allocated there is only “ground” level of regular
tuples. Eventually record count reaches F-1 and we create new root
layer with two tuples. Each new tuple references half of preexisting
records. Placement of new “ground” tuples on page eventually will
cause internal tuple to split. If there is not enough space to spilt
internal tuple we mark page for whole page-split during next iteration
of insertion algorithms of owning tree. That is why tuple-spilt
happens on F-1 tuples, not on F: if we have no space for splitting, we
just adding reference to last slot. In this algorithm, page split will
cause major page defragmentation: we take root layer, halve it and
place halves on different pages. When half of a data is gone to other
page, restructuration should tend to place records in such a fashion
that accessed together tuples lie together. I think, placing whole
level together is a good strategy.

Let’s look how page grows with fan-out factor F=5. RLS – root layer
start, G – ground tuple, Ix – internal tuple of level x.

When we added 3 ground tuples it’s just a ground layer
RLS=0|G G G
Then we place one more tuple and layer splits:
RLS=4|G G G G I0 I0
Each I0 tuple now references two G tuples. We keep placing G tuples.
RLS=4|G G G G I0 I0 G G
And then one of I0 tuples is spitted
RLS=4|G G G G I0 I0 G G G I0
And right after one more I0 split causes new layer
RLS=12|G G G G I0 I0 G G G I0 G I0 I1 I1

And so on, until we have space on a page. In a regular GiST we ran out
of space on page before we insert tuple on page. Now we can run out of
space during insertion. But this will not be fatal, we still will be
able to place ground level tuple. Inner index structure will use
one-extra slot for reference allocation, but next insertion on a page
will deal with it. On a pages marked for split we have to find which
exactly index tuple has run out of extra-slot during split and fix it.

Several years ago I had unsuccessful attempt to implement akin
algorithm in a database engine of a proprietary system. I stuck in the
debug of deletion algorithm and tree condensation, it is in use in
that system. I suppose it was a mistake to defrag page with creeping
heuristic, eventually I dropped the idea and just moved on to actually
important tasks, there is always deadline rush in business. I’m newbie
in Postgres internals. But as I see there is no deletion on GiST page.
So I feel itch to try this feature one more time.

I expect that implementation of this proposal could speed up
insertions into index by 20% and performance of queries by 30% when
all index accommodates in shared buffer. In case of buffer starvation,
when index is accessed through disk this feature will cause 15%
performance degrade since effective page capacity will be smaller.
Should this feature be configurable? May be this should be other
access method?

I need help to assess amount for work TDB. WAL redo, vacuum, bulk
index construction: what else parts are touched and have to be
reconstructed?

Please, tell me what do you think about this proposal.

Best regards, Andrey Borodin, Octonica & Ural Federal University.



Re: [Proposal] Improvement of GiST page layout

От
Heikki Linnakangas
Дата:
On 09/02/16 07:15, Andrew Borodin wrote:
> Hi, hackers!
>
> I want to propose improvement of GiST page layout.
>
> GiST is optimized to reduce disk operations on search queries, for
> example, windows search queries in case of R-tree.
>
> I expect that more complicated page layout will help to tradeoff some
> of CPU efficiency for disk efficiency.
>
> GiST tree is a balanced tree structure with two kinds of pages:
> internal pages and leaf pages. Each tree page contains bunch of tuples
> with the same structure. Tuples of an internal page reference other
> pages of the same tree, while a leaf page tuples holds heap TIDs.
>
> During execution of search query GiST for each tuple on page invokes
> key comparison algorithm with two possible outcomes: 'no' and 'may
> be'. 'May be' answer recursively descends search algorithms to
> referenced page (in case of internal page) or yields tuple (in case of
> leaf page).
>
> Expected tuples count on page is around of tenth to hundreds. This
> count is big enough to try to save some cache lines from loading into
> CPU during enumeration.
>
> For B-trees during inspection of a page we effectively apply binary
> search algorithm, which is not possible in GiST tree.
>
> Let's consider R-tree with arbitrary fan-out f. If a given query will
> find exactly one data tuple, it is easily to show that keys comparison
> count is minimal if f->e /*round_to_optimal(2.78) == 3*/ (tree have to
> review f*h keys, h=logf(N), f*logf(N) is minimal when f->e). Smaller
> keys comparison count means less cache lines are touched. So fan-out
> reduction means cache pressure reduction (except avg fan-out 2, which
> seems to be too small) and less time waiting for RAM. I suppose, all
> that reasoning holds true in a cases when not just one tuple will be
> found.

GiST comparison operators tend to be quite expensive, so I suspect much 
of the gain is simply from reducing the number of comparisons, rather 
than cache efficiency. The conclusion is the same though, so it doesn't 
matter.

> How do we reduce tree fan-out? Obviously, we can’t fill page with just
> 3 tuples. But we can install small tree-like structure inside one
> page. General GiST index has root page. But a page tree should have
> “root” layer of tuples. Private (or internal, intermediate, auxiliary,
> I can’t pick precise word) tuples have only keys and fixed-size(F)
> array of underlying records offsets. Each layer is linked-list. After
> page have just been allocated there is only “ground” level of regular
> tuples. Eventually record count reaches F-1 and we create new root
> layer with two tuples. Each new tuple references half of preexisting
> records. Placement of new “ground” tuples on page eventually will
> cause internal tuple to split. If there is not enough space to spilt
> internal tuple we mark page for whole page-split during next iteration
> of insertion algorithms of owning tree. That is why tuple-spilt
> happens on F-1 tuples, not on F: if we have no space for splitting, we
> just adding reference to last slot. In this algorithm, page split will
> cause major page defragmentation: we take root layer, halve it and
> place halves on different pages. When half of a data is gone to other
> page, restructuration should tend to place records in such a fashion
> that accessed together tuples lie together. I think, placing whole
> level together is a good strategy.

Yeah, something like that would make a lot of sense. I've actually been 
thinking of this same idea myself over the years, but never got around 
to implement it.

I'm not wedded to that particular layout, it seems a bit complicated. A 
two-level setup within each page might be good enough in practice: there 
would regular tuples, like today, and also "container" tuples, which 
represent a group of regular tuples. Also note that you can freely move 
tuples around in a page, so instead of storing an actual array of 
offsets with the container tuples, you could store just a range of line 
pointers, and keep all the child tuples packed together.

I wonder if it's really a good idea to perform page split by simply 
using the "container" tuples. It certainly saves CPU time when 
splitting, but I'm worried that the split might not be as good as today. 
The mini-groups within the page are grown "organically" as tuples are 
inserted, and the page split algorithm might come up with a better 
grouping if it sees all the tuples at once.

> I expect that implementation of this proposal could speed up
> insertions into index by 20% and performance of queries by 30% when
> all index accommodates in shared buffer. In case of buffer starvation,
> when index is accessed through disk this feature will cause 15%
> performance degrade since effective page capacity will be smaller.

Not sure how you got those numbers, but the general direction sounds right.

> Should this feature be configurable? May be this should be other
> access method?

I think this should be included in GiST, and always enabled. Or at least 
enabled by default. While the index will be slightly larger because of 
the extra tuples, it'd almost always be a good tradeoff.

> I need help to assess amount for work TDB. WAL redo, vacuum, bulk
> index construction: what else parts are touched and have to be
> reconstructed?

Something to keep in mind is pg_upgrade. The new code needs to still be 
able to read the old disk format. That means that it's easier to keep 
the page format as close to the current format as possible, and there 
needs to be a version number or something that you can use to 
distinguish the old and new format pages.

Overall, +1 for pursuing this idea!

- Heikki