Обсуждение: Faster inserts with mostly-monotonically increasing values

Поиск
Список
Период
Сортировка

Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:
Hello All,

On a per-session basis, we cache the last heap block used for inserts and try to use the same block for subsequent inserts. We don't do that for indexes because the target block in the index is determined by the overall structure of the index and the index key being inserted and hence caching is usually not very useful. But for certain typical, yet not-so-uncommon cases we can make similar optimisations. For example, consider a btree index on a "timestamp" column or on a "serial" column. In such cases, each new index entry typically goes beyond the rightmost entry in the index. If we cache the rightmost block and check that first, we can avoid the cost of descending down and looking up the index

Here is a patch that implements the idea. If the last insert happens to be in the rightmost block of an index, then we cache the block and check that first for the next insert. For the cached block to be usable for the insert, it should still be the rightmost, the to-be-inserted key should go into the cached block and there is enough free space in the block. If any of these conditions do not meet, we fall back to the regular code path, i.e. descent down the index from the top.

I benchmarked the patch by creating a simple table with just one bigint column and a btree index on it. Here are the results:

Monotonically Increasing 10M Inserts (time in ms)
======================================
Run Patched  Master
1 17656.222 25737.593
2 18765.002 26843.314
3 20629.567 27058.358
4 21174.998 26680.003
5 21118.865 26660.557

Avg 19868.9308 26595.965 (33.86% improvement)


Monotonically Increasing 100M Inserts (time in ms)
======================================
Run Patched  Master
1 159861.58 248885.763
2 160138.432 256438.625
3 160426.135 250379.589
4 163218.405 249474.695
5 161125.523 247805.675

Avg 160954.015 250596.8694 (55.69% improvement)


So as the size of the index increases, the benefits of the patch also tend to increase. This is most likely because as the index becomes taller and taller, the costs associated with index descent becomes higher.

Patch credit: this work is based on Simon Riggs's original ideas and research.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: Faster inserts with mostly-monotonically increasing values

От
Peter Geoghegan
Дата:
On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Here is a patch that implements the idea. If the last insert happens to be
> in the rightmost block of an index, then we cache the block and check that
> first for the next insert. For the cached block to be usable for the insert,
> it should still be the rightmost, the to-be-inserted key should go into the
> cached block and there is enough free space in the block. If any of these
> conditions do not meet, we fall back to the regular code path, i.e. descent
> down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)? I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

How many clients are involved with your benchmark?

> So as the size of the index increases, the benefits of the patch also tend
> to increase. This is most likely because as the index becomes taller and
> taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

-- 
Peter Geoghegan


Re: Faster inserts with mostly-monotonically increasing values

От
Andrey Borodin
Дата:
Hi!
31 дек. 2017 г., в 11:44, Pavan Deolasee <pavan.deolasee@gmail.com> написал(а):
On a per-session basis, we cache the last heap block used for inserts and try to use the same block for subsequent inserts.
+1, from my POV this is good idea and it's cool that it already has implementation.
I'd suggest to do not 1\1 split for these pages, to keep tree lower. Like fillfactor/(1-fillfactor) split. Also, this will make splits less frequent.

Best regards, Andrey Borodin.

Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Here is a patch that implements the idea. If the last insert happens to be
> in the rightmost block of an index, then we cache the block and check that
> first for the next insert. For the cached block to be usable for the insert,
> it should still be the rightmost, the to-be-inserted key should go into the
> cached block and there is enough free space in the block. If any of these
> conditions do not meet, we fall back to the regular code path, i.e. descent
> down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)?

I thought we can throw in a bunch of checks there to ensure that we are still looking at the correct rightmost page or else fallback to the regular path. Do you see something missing there? I don't claim that I've got everything correct, but since we're holding an exclusive lock on the page at that point, wouldn't we be able to guard against any concurrent splits/deletions? I will reread the documentation/code again, but I must admit, it's not particularly easy to understand all the concurrency issues.

 
I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

Why would that happen? We are checking if the to-be-inserted key is strictly greater than the last key in the rightmost block and only then proceeding with the fast path. And we do this while holding an exclusive lock on the page. Are you suggesting that someone can concurrently still split the page?
 

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

This is a good idea, but we might need to track the block/LSN in shared memory. Otherwise with multiple backends doing inserts, we might never be able to match LSNs (i.e. they will mostly vary, thus falling back to slow path for majority of the time)
 

How many clients are involved with your benchmark?

Just a single client for these benchmarks.
 

> So as the size of the index increases, the benefits of the patch also tend
> to increase. This is most likely because as the index becomes taller and
> taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.


Hmm Ok. I am not entirely sure whether it's the depth or just purely binary searching through 3-4 index pages and/or pinning/unpinning buffers result in much overhead. I will run some more tests and collect evidences.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
"Tels"
Дата:
Moin,

On Tue, January 2, 2018 7:51 am, Pavan Deolasee wrote:
> On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
>> On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
>> <pavan.deolasee@gmail.com> wrote:
>> > Here is a patch that implements the idea. If the last insert happens
>> to
>> be
>> > in the rightmost block of an index, then we cache the block and check
>> that
>> > first for the next insert. For the cached block to be usable for the
>> insert,
>> > it should still be the rightmost, the to-be-inserted key should go
>> into
>> the
>> > cached block and there is enough free space in the block. If any of
>> these
>> > conditions do not meet, we fall back to the regular code path, i.e.
>> descent
>> > down the index from the top.
[snip]

>> > So as the size of the index increases, the benefits of the patch also
>> tend
>> > to increase. This is most likely because as the index becomes taller
>> and
>> > taller, the costs associated with index descent becomes higher.
>>
>> FWIW, I think that int4 indexes have to be extremely large before they
>> exceed 4 levels (where the leaf level counts as a level). My
>> conservative back-of-an-envelope estimate is 9 billion index tuples.
>>
>>
> Hmm Ok. I am not entirely sure whether it's the depth or just purely
> binary
> searching through 3-4 index pages and/or pinning/unpinning buffers result
> in much overhead. I will run some more tests and collect evidences.

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

@Peter: Could you please share your back-of-the-envelope calculation, I'd
love to get some insights into the innards.

All the best,

Tels



Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
Moin,


>>
>>
> Hmm Ok. I am not entirely sure whether it's the depth or just purely
> binary
> searching through 3-4 index pages and/or pinning/unpinning buffers result
> in much overhead. I will run some more tests and collect evidences.

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index page can accommodate N entries, the page will be split when (N+1)th entry is added. The left page will have half the entries and the right page will get the remaining. The next split will happen when the right page is full  i.e. when another N/2 entries are added. Thus there will be a split at every N/2 inserts, creating an index with half filled pages.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Alvaro Herrera
Дата:
Pavan Deolasee wrote:
> On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
> 
> > Just a question trying to understand how btree indexes work:
> >
> > If one inserts ever-increasing value, is the tree a balanced tree with a
> > minimum (or at least not-as-high) number of levels, or does it increase in
> > height every insert and creates a "tall stack"?
> 
> AFAIK all pages will be half-filled in this case. Imagine if one index page
> can accommodate N entries, the page will be split when (N+1)th entry is
> added. The left page will have half the entries and the right page will get
> the remaining. The next split will happen when the right page is full  i.e.
> when another N/2 entries are added. Thus there will be a split at every N/2
> inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

 * If the page is the rightmost page on its level, we instead try to arrange
 * to leave the left split page fillfactor% full.  In this way, when we are
 * inserting successively increasing keys (consider sequences, timestamps,
 * etc) we will end up with a tree whose pages are about fillfactor% full,
 * instead of the 50% full result that we'd get without this special case.


To answer Tels question: the tree is balanced by splitting pages and
propagating the splits upwards to the parent pages, all the way up to
the root.  When the root page gets full, it is also split and a new root
is created on top of the old root plus its new sibling page, which is
the only point at which the tree grows in depth.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
"Tels"
Дата:
Hello Alvaro,

On Thu, January 4, 2018 7:35 am, Alvaro Herrera wrote:
> Pavan Deolasee wrote:
>> On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com>
>> wrote:
>>
>> > Just a question trying to understand how btree indexes work:
>> >
>> > If one inserts ever-increasing value, is the tree a balanced tree with
>> a
>> > minimum (or at least not-as-high) number of levels, or does it
>> increase in
>> > height every insert and creates a "tall stack"?
>>
>> AFAIK all pages will be half-filled in this case. Imagine if one index
>> page
>> can accommodate N entries, the page will be split when (N+1)th entry is
>> added. The left page will have half the entries and the right page will
>> get
>> the remaining. The next split will happen when the right page is full
>> i.e.
>> when another N/2 entries are added. Thus there will be a split at every
>> N/2
>> inserts, creating an index with half filled pages.
>
> Not really -- quoth _bt_findsplitloc():
>
>  * If the page is the rightmost page on its level, we instead try to
> arrange
>  * to leave the left split page fillfactor% full.  In this way, when we
> are
>  * inserting successively increasing keys (consider sequences, timestamps,
>  * etc) we will end up with a tree whose pages are about fillfactor% full,
>  * instead of the 50% full result that we'd get without this special case.
>
>
> To answer Tels question: the tree is balanced by splitting pages and
> propagating the splits upwards to the parent pages, all the way up to
> the root.  When the root page gets full, it is also split and a new root
> is created on top of the old root plus its new sibling page, which is
> the only point at which the tree grows in depth.

Thank you for the explanation! You learn something new every time :)

All the best,

Tels



Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Thu, Jan 4, 2018 at 6:05 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Pavan Deolasee wrote:
> On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
>
> > Just a question trying to understand how btree indexes work:
> >
> > If one inserts ever-increasing value, is the tree a balanced tree with a
> > minimum (or at least not-as-high) number of levels, or does it increase in
> > height every insert and creates a "tall stack"?
>
> AFAIK all pages will be half-filled in this case. Imagine if one index page
> can accommodate N entries, the page will be split when (N+1)th entry is
> added. The left page will have half the entries and the right page will get
> the remaining. The next split will happen when the right page is full  i.e.
> when another N/2 entries are added. Thus there will be a split at every N/2
> inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

 * If the page is the rightmost page on its level, we instead try to arrange
 * to leave the left split page fillfactor% full.  In this way, when we are
 * inserting successively increasing keys (consider sequences, timestamps,
 * etc) we will end up with a tree whose pages are about fillfactor% full,
 * instead of the 50% full result that we'd get without this special case.


Ah ok. Thanks for pointing that out. Makes a lot more sense.

Thanks,
Pavan 


--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Simon Riggs
Дата:
On 31 December 2017 at 11:06, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>> Here is a patch that implements the idea. If the last insert happens to be
>> in the rightmost block of an index, then we cache the block and check that
>> first for the next insert. For the cached block to be usable for the insert,
>> it should still be the rightmost, the to-be-inserted key should go into the
>> cached block and there is enough free space in the block. If any of these
>> conditions do not meet, we fall back to the regular code path, i.e. descent
>> down the index from the top.
>
> I'd be particularly concerned about page deletion/page recycling still
> being correct with the patch, especially because it's hard to test
> anything there. The P_ISLEAF() part of your fastpath's test hints at
> this -- why should it not *always* be a leaf page (surely you should
> be sure that the page cannot be recycled anyway)?

I don't see any issues related to these points.

We wouldn't be inserting into a deleted/recycled page, so no issues to
discuss. We're not allocating a new page, so no recycling issues.

The block is tested for whether it is an incomplete split and also
locked, so it cannot split while locked.

> I also have my
> doubts about unique index enforcement remaining correct with the patch
> when there are many physical duplicates, to the extent that more than
> a single leaf page is needed for a single value.

This isn't a problem either.

The patch only uses the cache when the value is higher than the
rightmost value. So by definition there are no duplicates in that
case.

Page splitting preserves the ordering property, so we are certain that
the rightmost block has the highest value.

We do need to check that the index is sorted ASC rather than DESC,
possibly with special action for NULL sorting.

> Maybe you should do something with page LSN here -- cache LSN
> alongside block number, and have a non-matching LSN invalidate the
> test.

Don't think we need that.

> How many clients are involved with your benchmark?

Just one, since it is demonstrating the reduced path lenth of doing that.

>> So as the size of the index increases, the benefits of the patch also tend
>> to increase. This is most likely because as the index becomes taller and
>> taller, the costs associated with index descent becomes higher.
>
> FWIW, I think that int4 indexes have to be extremely large before they
> exceed 4 levels (where the leaf level counts as a level). My
> conservative back-of-an-envelope estimate is 9 billion index tuples.

If they did, the gains would be even bigger.

But as you say, a unique index with many duplicates could be a
problem, and index fragmentation would make the situation worse.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Sun, Dec 31, 2017 at 8:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I also have my
> doubts about unique index enforcement remaining correct with the patch
> when there are many physical duplicates, to the extent that more than
> a single leaf page is needed for a single value.

given...

+        if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+            !P_INCOMPLETE_SPLIT(lpageop) &&
+            !P_IGNORE(lpageop) &&
+            (PageGetFreeSpace(page) > itemsz) &&
+            _bt_compare(rel, natts, itup_scankey, page,
PageGetMaxOffsetNumber(page)) > 0)

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

That would allow this case to be applied not only to perfectly
monotonic sequences, but for nearly monotonic ones as well (think
timestamps).


Re: Faster inserts with mostly-monotonically increasing values

От
Peter Geoghegan
Дата:
On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> given...
>
> +        if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
> +            !P_INCOMPLETE_SPLIT(lpageop) &&
> +            !P_IGNORE(lpageop) &&
> +            (PageGetFreeSpace(page) > itemsz) &&
> +            _bt_compare(rel, natts, itup_scankey, page,
> PageGetMaxOffsetNumber(page)) > 0)

One simple problem with this code is that it assumes that there must
be at least one item on a leaf page, which isn't guaranteed. There has
to be a highkey on non-righmost pages, but rightmost leaf pages can be
totally empty. While btvacuumpage() is very unlikely to not be able to
begin deleting the page there and then, it can happen (see remarks at
the end of btvacuumpage() about recursing).

Other issues spotted:

* The variable name "lpageop" seems like is was lifted from somewhere
dealing with a page to the left of some other page, which is not the
case here.

* P_INCOMPLETE_SPLIT() checks a bit that can only be set on a
non-rightmost page -- a page that has a sibling to its right that
doesn't have a downlink in parent. The bit is only ever set on the
"target" of a (still unfinished) page split. This is why
_bt_moveright() doesn't bother with completing a page split when it
reaches a rightmost page. I don't see why you need this part of the
test at all.

> The key there is strictly greater than the rightmost key. As such, it
> must precede the first page with an equal key, and since it's the
> rightmost page... there's no such key. But if there was, the check for
> uniqueness only needs the insert point to point to the first such
> page, and this guarantees it.

This code assumes that it can use block number as a stable way of
finding the same point in the B-Tree over time, without any interlock.
That may actually work out in practice, since even if the page is
recycled there can only ever be one rightmost leaf page, but it seems
like a brittle approach to me. The page image may be recycled by the
FSM already, and we really shouldn't be depending on the page not
looking a particular way once that has happened. I guess that you can
say the same thing about using page LSN to determine cached block
staleness instead, but that still seems a lot safer.

Basically, I'm worried that we could do something entirely
unpredictable when a page has actually been recycled, since we're
entirely opting out of the RecentGlobalXmin interlock business on the
assumption that we can be sure that recycling hasn't occurred in some
other way. Imagine what could happen if we ever teach the FSM to share
freespace among small relations, which seems quite possible to me.

> You could allow for some slack, by comparing against the leftmost key
> instead. You just need to know whether the key fits in the page. Then,
> the insert can proceed as usual, doing the binsearch to find the
> proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

-- 
Peter Geoghegan


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Mon, Mar 5, 2018 at 9:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> The key there is strictly greater than the rightmost key. As such, it
>> must precede the first page with an equal key, and since it's the
>> rightmost page... there's no such key. But if there was, the check for
>> uniqueness only needs the insert point to point to the first such
>> page, and this guarantees it.
>
> This code assumes that it can use block number as a stable way of
> finding the same point in the B-Tree over time, without any interlock.
> That may actually work out in practice, since even if the page is
> recycled there can only ever be one rightmost leaf page, but it seems
> like a brittle approach to me. The page image may be recycled by the
> FSM already, and we really shouldn't be depending on the page not
> looking a particular way once that has happened. I guess that you can
> say the same thing about using page LSN to determine cached block
> staleness instead, but that still seems a lot safer.
>
> Basically, I'm worried that we could do something entirely
> unpredictable when a page has actually been recycled, since we're
> entirely opting out of the RecentGlobalXmin interlock business on the
> assumption that we can be sure that recycling hasn't occurred in some
> other way. Imagine what could happen if we ever teach the FSM to share
> freespace among small relations, which seems quite possible to me.

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

>> You could allow for some slack, by comparing against the leftmost key
>> instead. You just need to know whether the key fits in the page. Then,
>> the insert can proceed as usual, doing the binsearch to find the
>> proper insertion point, etc.
>
> The whole way that this patch opts out of using an exclusive buffer
> lock on the "first page the value could be on" (the _bt_check_unique()
> + _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
> terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

And it seems that assumption is correct in the patch as-is. In fact,
the current patch checks for a much stronger situation than needed to
apply this optimization, so it can even skip checking for conflicting
concurrent transactions. With an exclusive lock on the buffer, and
being the rightmost page, it means there can be no conflicting key
since the check is that the scan key is strictly greater than the
rightmost key (lets forget for a moment empty rightmost pages). Unless
there can be a new rightmost page in construction somehow (which I
don't believe it can happen, new rightmost pages would be created by
splitting the rightmost page, and that would conflict with the
exclusive lock), I don't see how this can fail.

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Mon, Mar 5, 2018 at 9:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Assuming the rightmost page is the first page the value could be on,
> it already does get an exclusive buffer lock.

That made me check, and:

+        /*
+         * Acquire exclusive lock on the buffer before doing any checks. This
+         * ensures that the index state cannot change, as far as the rightmost
+         * part of the index is concerned.
+         */
+        LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);

BTree code uses BT_READ and BT_WRITE instead, so that should be:

LockBuffer(buf, BT_WRITE)


Re: Faster inserts with mostly-monotonically increasing values

От
Peter Geoghegan
Дата:
On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Correct me if I'm wrong, but there's lots of code in nbtree already
> that risks reading recycled pages for various reasons. Presumably,
> checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
> !P_IGNORE implies.

You're mistaken. Read the nbtree README on page deletion, and the
RecentGlobalXmin interlock. Note also that there are 3 distinct page
states in play as far as deletion/recycling is concerned:

1. The first phase of deletion
2. The second phase of deletion.
3. Post-recycling, when in general the page could look like just about anything.

It's page deletion's job to make sure that an index scan cannot land
on a page after it was recycled. The corollary is that it is not
everyone else's job to make sure that that didn't happen -- how could
that possibly work?

Quite a few places know a bit and must reason about the the first
phase of deletion, and a few know about the second, but that's it.

>>> You could allow for some slack, by comparing against the leftmost key
>>> instead. You just need to know whether the key fits in the page. Then,
>>> the insert can proceed as usual, doing the binsearch to find the
>>> proper insertion point, etc.
>>
>> The whole way that this patch opts out of using an exclusive buffer
>> lock on the "first page the value could be on" (the _bt_check_unique()
>> + _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
>> terribly concrete.
>
> Assuming the rightmost page is the first page the value could be on,
> it already does get an exclusive buffer lock.

This can be quite fiddly. The downlink in the parent can be equal to
the value in the target page, meaning that we actually lock the
target's left sibling (see _bt_binsrch() special case for internal
pages).

I didn't say that the patch is necessarily wrong here. Just that it
makes me nervous.

> If one wanted to relax the condition as I proposed, that uniqueness
> check is necessary, so that's why I propose shortcircuiting the search
> only, and not the actual insertion on the page. I believe IMO it's
> more important to try and maximize the conditions under which the
> optimization can be applied.

I didn't get what the point of checking the first item on the page
(your proposal) was.

-- 
Peter Geoghegan


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Mon, Mar 5, 2018 at 9:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Correct me if I'm wrong, but there's lots of code in nbtree already
>> that risks reading recycled pages for various reasons. Presumably,
>> checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
>> !P_IGNORE implies.
>
> You're mistaken. Read the nbtree README on page deletion, and the
> RecentGlobalXmin interlock. Note also that there are 3 distinct page
> states in play as far as deletion/recycling is concerned:
>
> 1. The first phase of deletion
> 2. The second phase of deletion.
> 3. Post-recycling, when in general the page could look like just about anything.
>
> It's page deletion's job to make sure that an index scan cannot land
> on a page after it was recycled. The corollary is that it is not
> everyone else's job to make sure that that didn't happen -- how could
> that possibly work?

From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

For the third phase:

> A deleted page can only be reclaimed once there is no scan or search that
> has a reference to it; until then, it must stay in place with its
> right-link undisturbed.

That's because:

> A deleted page cannot be reclaimed immediately, since there may be other
> processes waiting to reference it (ie, search processes that just left the
> parent, or scans moving right or left from one of the siblings).  These
> processes must observe that the page is marked dead and recover
> accordingly.  Searches and forward scans simply follow the right-link
> until they find a non-dead page --- this will be where the deleted page's
> key-space moved to.

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

>>>> You could allow for some slack, by comparing against the leftmost key
>>>> instead. You just need to know whether the key fits in the page. Then,
>>>> the insert can proceed as usual, doing the binsearch to find the
>>>> proper insertion point, etc.
>>>
>>> The whole way that this patch opts out of using an exclusive buffer
>>> lock on the "first page the value could be on" (the _bt_check_unique()
>>> + _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
>>> terribly concrete.
>>
>> Assuming the rightmost page is the first page the value could be on,
>> it already does get an exclusive buffer lock.
>
> This can be quite fiddly. The downlink in the parent can be equal to
> the value in the target page, meaning that we actually lock the
> target's left sibling (see _bt_binsrch() special case for internal
> pages).

I'm not sure which special case you're referring to, I don't see
anything relevant in _bt_binsrch.

Yes, if the scankey is equal to the leftmost key, the first occurrence
of that key could be on the left, so the check would have to be for
strictly greater. But that's about as complex as it would get.

> I didn't say that the patch is necessarily wrong here. Just that it
> makes me nervous.

Any change to btree would make anyone nervous ;-)

>> If one wanted to relax the condition as I proposed, that uniqueness
>> check is necessary, so that's why I propose shortcircuiting the search
>> only, and not the actual insertion on the page. I believe IMO it's
>> more important to try and maximize the conditions under which the
>> optimization can be applied.
>
> I didn't get what the point of checking the first item on the page
> (your proposal) was.

Right now, if you've got concurrent transactions, there's absolutely
no chance this optimization would kick in. For it to happen,
insertions on the index need to happen in order, each insertion a key
greater than any other key (visible or not) on the index.

Concurrent inserts on PKs are unlikely to arrive in order, a slight
disorder is to be expected as serial sequence numbers get generated a
significant amount of time before the insert actually takes place.
Different backends can get to the index insertion at different times,
causing unordered inserts.

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Consider 3 transactions on a table like:

CREATE TABLE test ( id serial, d integer );

T1: INSERT INTO test (d) VALUES (3), (4), (5), (6);
T2: INSERT INTO test (d) SELECT generate_series(1,10);
T3: INSERT INTO test (d) VALUES (1);

Each transaction will take values from the sequence at some point of
the evaluation, far from actual insertion. This could cause the
following insertions:

  -- (id, d)
T1: (5, 3)
T2: (10, 1)
T2: (11, 2)
T1: (6, 4)
T3: (1, 1)
... etc

No insert would be optimized in this case, and I don't think it's an
uncommon case.

So what I propose is that the optimization only needs to check whether
"the rightmost page" is the right page to insert on. And for that
check, all you have to assert is that:

* The page is actually the currently valid rightmost leaf
* The **first** (data) key on that page is strictly lesser than the scan key

If both are true, the insertion point will be somewhere on that page.
Maybe not at the end, maybe at the middle, but somewhere in that page,
no need to go looking any further in the b-tree.

If the page is empty, the optimization can't be applied, simple as
that. That check is missing on the patch, true enough.


Re: Faster inserts with mostly-monotonically increasing values

От
Peter Geoghegan
Дата:
On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

True.

> For the third phase:
>
>> A deleted page can only be reclaimed once there is no scan or search that
>> has a reference to it; until then, it must stay in place with its
>> right-link undisturbed.
>
> That's because:
>
>> A deleted page cannot be reclaimed immediately, since there may be other
>> processes waiting to reference it (ie, search processes that just left the
>> parent, or scans moving right or left from one of the siblings).  These
>> processes must observe that the page is marked dead and recover
>> accordingly.  Searches and forward scans simply follow the right-link
>> until they find a non-dead page --- this will be where the deleted page's
>> key-space moved to.
>
> But in thise case there's no right link to follow, so it's a non-issue.
>
> BTree doesn't truncate AFAIK, so the cached block number can't be
> nonexisting (beyond the end of the relation). It's probably fragile to
> assume BTree will never truncate, so maybe it would be wise to check
> for that. But if the block exists, I believe it contains a page that
> can be safely interpreted by that condition. As long as there can be
> no other pages with the same flags on the index while the cached page
> is write-locked, that code will be safe.

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

> Yes, if the scankey is equal to the leftmost key, the first occurrence
> of that key could be on the left, so the check would have to be for
> strictly greater. But that's about as complex as it would get.

That's pretty complicated.

>> I didn't say that the patch is necessarily wrong here. Just that it
>> makes me nervous.
>
> Any change to btree would make anyone nervous ;-)

True. Anyway, this isn't the thing that I'm most concerned about right now.

>> I didn't get what the point of checking the first item on the page
>> (your proposal) was.
>
> Right now, if you've got concurrent transactions, there's absolutely
> no chance this optimization would kick in. For it to happen,
> insertions on the index need to happen in order, each insertion a key
> greater than any other key (visible or not) on the index.
>
> Concurrent inserts on PKs are unlikely to arrive in order, a slight
> disorder is to be expected as serial sequence numbers get generated a
> significant amount of time before the insert actually takes place.
> Different backends can get to the index insertion at different times,
> causing unordered inserts.
>
> I believe PKs are a prime candidate for this optimization, and
> expecting it to apply only when no concurrency is involved is severely
> dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

-- 
Peter Geoghegan


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Mon, Mar 5, 2018 at 10:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> But in thise case there's no right link to follow, so it's a non-issue.
>>
>> BTree doesn't truncate AFAIK, so the cached block number can't be
>> nonexisting (beyond the end of the relation). It's probably fragile to
>> assume BTree will never truncate, so maybe it would be wise to check
>> for that. But if the block exists, I believe it contains a page that
>> can be safely interpreted by that condition. As long as there can be
>> no other pages with the same flags on the index while the cached page
>> is write-locked, that code will be safe.
>
> Introducing any case that allows us to land on a recycled page, and
> reason that it must at least not be the page we were looking for based
> on *any* criteria about the page itself seems very brittle. Yeah, it
> probably won't break in practice, but it's a bad design.

How is this any different from btvacuumscan?

I don't think it can be argued that somehow btvacuumscan has
permission to be brittle and the rest of the code doesn't.

Point is, it's not brittle. The on-disk format of the tree is such
that any block can be catalogued by inspecting its header,
btvacuumscan depends on that. Questions that can be answered looking
at a page header alone, are:

* Is it deleted or new?
* Is it a leaf?
* Is it rightmost?

Only question remaining is whether it's the *only* rightmost leaf, and
that can be guaranteed by locking.

So, can a flag check result in a random outcome? No. That would also
cause a random outcome for btvacuumscan and then vacuum would corrupt
the index just as well.

And now that we mention this... why is the code not using _bt_getbuf?
It already does quite a bit of sanity checking that is IMO quite
desirable here.


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

> I believe PKs are a prime candidate for this optimization, and
> expecting it to apply only when no concurrency is involved is severely
> dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

Ok. I will repeat those tests with more number of clients and report back.

Regarding your suggestion about using page LSN to detect intermediate activity, my concern is that unless we store that value in shared memory, concurrent backends, even if inserting values in an order, will make backend-local cached page LSN invalid and the optimisation will not kick-in. 

I am yet to digest the entire conversation between you and Claudio; you guys clearly understand b-tree internals better than me. It seems while you're worried about missing out on something, Claudio feels that we can find a safe way just looking at the information available in the current page. I feel the same way, but will need to re-read the discussion carefully again.

Simon had raised concerns about DESC indexes and whether we need to do the checks for leftmost page in that case. I haven't yet figured out if DESC indexes are actually stored in the reverse order. I am gonna look at that too.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Peter Geoghegan
Дата:
On Mon, Mar 5, 2018 at 7:10 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Introducing any case that allows us to land on a recycled page, and
>> reason that it must at least not be the page we were looking for based
>> on *any* criteria about the page itself seems very brittle. Yeah, it
>> probably won't break in practice, but it's a bad design.
>
> How is this any different from btvacuumscan?
>
> I don't think it can be argued that somehow btvacuumscan has
> permission to be brittle and the rest of the code doesn't.

VACUUM doesn't have to worry about concurrent page recycling because
it is already the only thing that performs page deletion. It's already
the process that has the exclusive right to give index pages back to
the FSM.

-- 
Peter Geoghegan


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Tue, Mar 6, 2018 at 1:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Mar 5, 2018 at 7:10 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> Introducing any case that allows us to land on a recycled page, and
>>> reason that it must at least not be the page we were looking for based
>>> on *any* criteria about the page itself seems very brittle. Yeah, it
>>> probably won't break in practice, but it's a bad design.
>>
>> How is this any different from btvacuumscan?
>>
>> I don't think it can be argued that somehow btvacuumscan has
>> permission to be brittle and the rest of the code doesn't.
>
> VACUUM doesn't have to worry about concurrent page recycling because
> it is already the only thing that performs page deletion. It's already
> the process that has the exclusive right to give index pages back to
> the FSM.

Not *concurrent* recycling, but it does have to handle *recycled*
pages correctly.

With the write lock, I don't think there's an issue with *concurrent*
recycling. According to the readme, leaf pages aren't deleted at all,
so only concurrent splitting is a concern.

The only issue is that we may land on a *recycled* page. Now, it
depends on what you mean by recycled here... if you mean "deleted",
!P_IGNORE will skip that page. If you mean it was deleted and freed
but not reused, again, !P_IGNORE will skip it (much as it happens for
vacuum). If you mean that someone reused it, then it will be a valid
page with valid headers, so we need not worry about it not being
consistent. It can be some other page than the one we expect it to be,
but the checks ought to be sufficient to quickly verify whether that's
the case.

Unless you see a way in which a write-locked page could say it is a
rightmost leaf when it is not?


Re: Faster inserts with mostly-monotonically increasing values

От
Simon Riggs
Дата:
On 6 March 2018 at 04:40, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>
> On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>
>> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com>
>> wrote:
>>
>> > I believe PKs are a prime candidate for this optimization, and
>> > expecting it to apply only when no concurrency is involved is severely
>> > dumbing down the optimization.
>>
>> Pavan justified the patch using a benchmark that only involved a
>> single client -- hardly typical for a patch that changes the B-Tree
>> code. If the benefits with many clients can be shown to matter, that
>> will make this much more interesting to me.
>
>
> Ok. I will repeat those tests with more number of clients and report back.

Even if the optimization is only valid with a single backend, its
still very useful - think COPY, but other similar cases.

> Regarding your suggestion about using page LSN to detect intermediate
> activity, my concern is that unless we store that value in shared memory,
> concurrent backends, even if inserting values in an order, will make
> backend-local cached page LSN invalid and the optimisation will not kick-in.
>
> I am yet to digest the entire conversation between you and Claudio; you guys
> clearly understand b-tree internals better than me. It seems while you're
> worried about missing out on something, Claudio feels that we can find a
> safe way just looking at the information available in the current page. I
> feel the same way, but will need to re-read the discussion carefully again.
>
> Simon had raised concerns about DESC indexes and whether we need to do the
> checks for leftmost page in that case. I haven't yet figured out if DESC
> indexes are actually stored in the reverse order. I am gonna look at that
> too.

No, I meant that you were testing whether the value was higher (> 0),
whereas it should be lower (< 0) on DESC indexes.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Tue, Mar 6, 2018 at 9:06 AM, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
>> Simon had raised concerns about DESC indexes and whether we need to do the
>> checks for leftmost page in that case. I haven't yet figured out if DESC
>> indexes are actually stored in the reverse order. I am gonna look at that
>> too.
>
> No, I meant that you were testing whether the value was higher (> 0),
> whereas it should be lower (< 0) on DESC indexes.

Isn't that already handled by _bt_compare?


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:


On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

> I believe PKs are a prime candidate for this optimization, and
> expecting it to apply only when no concurrency is involved is severely
> dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

Ok. I will repeat those tests with more number of clients and report back.


So I repeated the tests with 1,2,4 and 8 clients, each running the following statement and a total of 1024 transactions. So roughly 100M rows are inserted.

INSERT INTO testtab(b) SELECT generate_series(1,100000); 

The table definition is:
postgres=# \d+ testtab
                                               Table "public.testtab"
 Column |  Type  | Collation | Nullable |              Default               | Storage | Stats target | Description 
--------+--------+-----------+----------+------------------------------------+---------+--------------+-------------
 a      | bigint |           | not null | nextval('testtab_a_seq'::regclass) | plain   |              | 
 b      | bigint |           |          |                                    | plain   |              | 
Indexes:
    "testtab_a_key" UNIQUE CONSTRAINT, btree (a)


After taking average of 3-runs:

+---------+--------------------------------+-------------------------------+
| clients | Patched - time in sec     | Master - time in sec |
+---------+--------------------------------+-------------------------------+
| 1       | 311.8643602                    | 411.832757                    |
+---------+--------------------------------+-------------------------------+
| 2       | 252.5433                       | 300.7875613                   |
+---------+--------------------------------+-------------------------------+
| 4       | 337.0414279                    | 350.9636766                   |
+---------+--------------------------------+-------------------------------+
| 8       | 444.2035582                    | 477.1903417                   |
+---------+--------------------------------+-------------------------------+

So yes, the benefits of the patch go down with higher number of clients, but it does not entirely vanish. 

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>
> On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee <pavan.deolasee@gmail.com>
> wrote:
>>
>>
>>
>> On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>>
>>> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com>
>>> wrote:
>>>
>>> > I believe PKs are a prime candidate for this optimization, and
>>> > expecting it to apply only when no concurrency is involved is severely
>>> > dumbing down the optimization.
>>>
>>> Pavan justified the patch using a benchmark that only involved a
>>> single client -- hardly typical for a patch that changes the B-Tree
>>> code. If the benefits with many clients can be shown to matter, that
>>> will make this much more interesting to me.
>>
>>
>> Ok. I will repeat those tests with more number of clients and report back.
>>
>
> So I repeated the tests with 1,2,4 and 8 clients, each running the following
> statement and a total of 1024 transactions. So roughly 100M rows are
> inserted.
>
> INSERT INTO testtab(b) SELECT generate_series(1,100000);
>
> The table definition is:
> postgres=# \d+ testtab
>                                                Table "public.testtab"
>  Column |  Type  | Collation | Nullable |              Default
> | Storage | Stats target | Description
> --------+--------+-----------+----------+------------------------------------+---------+--------------+-------------
>  a      | bigint |           | not null | nextval('testtab_a_seq'::regclass)
> | plain   |              |
>  b      | bigint |           |          |
> | plain   |              |
> Indexes:
>     "testtab_a_key" UNIQUE CONSTRAINT, btree (a)
>
>
> After taking average of 3-runs:
>
> +---------+--------------------------------+-------------------------------+
> | clients | Patched - time in sec     | Master - time in sec |
> +---------+--------------------------------+-------------------------------+
> | 1       | 311.8643602                    | 411.832757                    |
> +---------+--------------------------------+-------------------------------+
> | 2       | 252.5433                       | 300.7875613                   |
> +---------+--------------------------------+-------------------------------+
> | 4       | 337.0414279                    | 350.9636766                   |
> +---------+--------------------------------+-------------------------------+
> | 8       | 444.2035582                    | 477.1903417                   |
> +---------+--------------------------------+-------------------------------+
>
> So yes, the benefits of the patch go down with higher number of clients, but
> it does not entirely vanish.

What if you implement my suggestion?

That should improve the multi-client case considerably.


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Sat, Mar 10, 2018 at 12:11 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>

>
> So yes, the benefits of the patch go down with higher number of clients, but
> it does not entirely vanish.

What if you implement my suggestion?

That should improve the multi-client case considerably.


Yes, I will try that next - it seems like a good idea. So the idea would be: check if the block is still the rightmost block and the insertion-key is greater than the first key in the page. If those conditions are satisfied, then we do a regular binary search within the page to find the correct location. This might add an overhead of binary search when keys are strictly ordered and a single client is inserting the data. If that becomes a concern, we might be able to look for that special case too and optimise for it too.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Sat, Mar 10, 2018 at 12:11 AM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>> On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
>> wrote:
>> >
>> >
>>
>> >
>> > So yes, the benefits of the patch go down with higher number of clients,
>> > but
>> > it does not entirely vanish.
>>
>> What if you implement my suggestion?
>>
>> That should improve the multi-client case considerably.
>
>
>
> Yes, I will try that next - it seems like a good idea. So the idea would be:
> check if the block is still the rightmost block and the insertion-key is
> greater than the first key in the page. If those conditions are satisfied,
> then we do a regular binary search within the page to find the correct
> location. This might add an overhead of binary search when keys are strictly
> ordered and a single client is inserting the data. If that becomes a
> concern, we might be able to look for that special case too and optimise for
> it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee

>
> Yes, I will try that next - it seems like a good idea. So the idea would be:
> check if the block is still the rightmost block and the insertion-key is
> greater than the first key in the page. If those conditions are satisfied,
> then we do a regular binary search within the page to find the correct
> location. This might add an overhead of binary search when keys are strictly
> ordered and a single client is inserting the data. If that becomes a
> concern, we might be able to look for that special case too and optimise for
> it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.

So I've been toying with this idea since yesterday and I am quite puzzled with the results. See the attached patch which compares the insertion key with the last key inserted by this backend, if the cached block is still the rightmost block in the tree. I initially only compared with the first key in the page, but I tried this version because of the strange performance regression which I still have no answers.

For a small number of clients, the patched version does better. But as the number of clients go up, the patched version significantly underperforms master. I roughly counted the number of times the fastpath is taken and I noticed that almost 98% inserts take the fastpath. I first thought that the "firstkey" location in the page might be becoming a hot-spot for concurrent processes and hence changed that to track the per-backend last offset and compare against that the next time. But that did not help much.

+---------+--------------------------------+-------------------------------+
| clients | Master - Avg load time in sec | Patched - Avg load time in sec |
+---------+--------------------------------+-------------------------------+
| 1       | 500.0725203                    | 347.632079                    |
+---------+--------------------------------+-------------------------------+
| 2       | 308.4580771                    | 263.9120163                   |
+---------+--------------------------------+-------------------------------+
| 4       | 359.4851779                    | 514.7187444                   |
+---------+--------------------------------+-------------------------------+
| 8       | 476.4062592                    | 780.2540855                   |
+---------+--------------------------------+-------------------------------+

The perf data does not show anything interesting either. I mean there is a reduction in CPU time spent in btree related code in the patched version, but the overall execution time to insert the same number of records go up significantly.

Perf (master):
===========

+   72.59%     1.81%  postgres  postgres            [.] ExecInsert                      
+   47.55%     1.27%  postgres  postgres            [.] ExecInsertIndexTuples           
+   44.24%     0.48%  postgres  postgres            [.] btinsert                        
-   42.40%     0.87%  postgres  postgres            [.] _bt_doinsert                    
   - 41.52% _bt_doinsert                                                                
      + 21.14% _bt_search                                                               
      + 12.57% _bt_insertonpg                                                           
      + 2.03% _bt_binsrch                                                               
        1.60% _bt_mkscankey                                                             
        1.20% LWLockAcquire                                                             
      + 1.03% _bt_freestack                                                             
        0.67% LWLockRelease                                                             
        0.57% _bt_check_unique                                                          
   + 0.87% _start                                                                       
+   26.03%     0.95%  postgres  postgres            [.] ExecScan                        
+   21.14%     0.82%  postgres  postgres            [.] _bt_search                      
+   20.70%     1.31%  postgres  postgres            [.] ExecInterpExpr                  
+   19.05%     1.14%  postgres  postgres            [.] heap_insert                     
+   18.84%     1.16%  postgres  postgres            [.] nextval_internal                
+   18.08%     0.84%  postgres  postgres            [.] ReadBufferExtended              
+   17.24%     2.03%  postgres  postgres            [.] ReadBuffer_common               
+   12.57%     0.59%  postgres  postgres            [.] _bt_insertonpg                  
+   11.12%     1.63%  postgres  postgres            [.] XLogInsert                      
+    9.90%     0.10%  postgres  postgres            [.] _bt_relandgetbuf                
+    8.97%     1.16%  postgres  postgres            [.] LWLockAcquire                   
+    8.42%     2.03%  postgres  postgres            [.] XLogInsertRecord                
+    7.26%     1.01%  postgres  postgres            [.] _bt_binsrch                     
+    7.07%     1.20%  postgres  postgres            [.] RelationGetBufferForTuple       
+    6.27%     4.92%  postgres  postgres            [.] _bt_compare                     
+    5.97%     0.63%  postgres  postgres            [.] read_seq_tuple.isra.3           
+    5.70%     4.89%  postgres  postgres            [.] hash_search_with_hash_value     
+    5.44%     5.44%  postgres  postgres            [.] LWLockAttemptLock


Perf (Patched):
============

+   69.33%     2.36%  postgres  postgres            [.] ExecInsert                  
+   35.21%     0.64%  postgres  postgres            [.] ExecInsertIndexTuples       
-   32.14%     0.45%  postgres  postgres            [.] btinsert                    
   - 31.69% btinsert                                                                
      - 30.35% _bt_doinsert                                                         
         + 13.10% _bt_insertonpg                                                    
         + 5.11% _bt_getbuf                                                         
         + 2.75% _bt_binsrch                                                        
         + 2.49% _bt_mkscankey                                                      
         + 2.43% _bt_search                                                         
         + 0.96% _bt_compare                                                        
           0.70% CheckForSerializableConflictIn                                     
      + 1.34% index_form_tuple                                                      
+   30.35%     1.53%  postgres  postgres            [.] _bt_doinsert                
+   28.31%     1.53%  postgres  postgres            [.] ExecScan                    
+   26.52%     0.77%  postgres  postgres            [.] heap_insert                 
+   22.11%     1.21%  postgres  postgres            [.] ExecInterpExpr              
+   20.51%     0.51%  postgres  postgres            [.] nextval_internal            
+   16.55%     1.15%  postgres  postgres            [.] ReadBufferExtended          
+   15.40%     3.77%  postgres  postgres            [.] ReadBuffer_common           
+   14.25%     2.36%  postgres  postgres            [.] XLogInsert                  
+   13.10%     0.77%  postgres  postgres            [.] _bt_insertonpg              
+   10.93%     3.07%  postgres  postgres            [.] XLogInsertRecord            
+   10.80%     0.96%  postgres  postgres            [.] RelationGetBufferForTuple   
+    7.41%     0.38%  postgres  postgres            [.] _bt_getbuf                  
+    5.18%     0.58%  postgres  postgres            [.] read_seq_tuple.isra.3       
+    5.18%     0.19%  postgres  postgres            [.] pg_class_aclcheck

I am testing this on r3.2xlarge AWS instance. 

Is there something wrong with the patch? What can cause sudden slowdown with the increasing number of clients, given that the patch is still doing what it's supposed to do? I wonder if the shortened code path actually leads to heavier contention for EXCLUSIVE lock on the rightmost page, which in turn causes the slowdown. But that's just a theory. Any ideas how to prove or disprove that theory or any other pointers?

Thanks,
Pavan


--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>> On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
>>
>> >
>> > Yes, I will try that next - it seems like a good idea. So the idea would
>> > be:
>> > check if the block is still the rightmost block and the insertion-key is
>> > greater than the first key in the page. If those conditions are
>> > satisfied,
>> > then we do a regular binary search within the page to find the correct
>> > location. This might add an overhead of binary search when keys are
>> > strictly
>> > ordered and a single client is inserting the data. If that becomes a
>> > concern, we might be able to look for that special case too and optimise
>> > for
>> > it too.
>>
>> Yeah, pretty much that's the idea. Beware, if the new item doesn't
>> fall in the rightmost place, you still need to check for serialization
>> conflicts.
>
>
> So I've been toying with this idea since yesterday and I am quite puzzled
> with the results. See the attached patch which compares the insertion key
> with the last key inserted by this backend, if the cached block is still the
> rightmost block in the tree. I initially only compared with the first key in
> the page, but I tried this version because of the strange performance
> regression which I still have no answers.
>
> For a small number of clients, the patched version does better. But as the
> number of clients go up, the patched version significantly underperforms
> master. I roughly counted the number of times the fastpath is taken and I
> noticed that almost 98% inserts take the fastpath. I first thought that the
> "firstkey" location in the page might be becoming a hot-spot for concurrent
> processes and hence changed that to track the per-backend last offset and
> compare against that the next time. But that did not help much.
>
> +---------+--------------------------------+-------------------------------+
> | clients | Master - Avg load time in sec | Patched - Avg load time in sec |
> +---------+--------------------------------+-------------------------------+
> | 1       | 500.0725203                    | 347.632079                    |
> +---------+--------------------------------+-------------------------------+
> | 2       | 308.4580771                    | 263.9120163                   |
> +---------+--------------------------------+-------------------------------+
> | 4       | 359.4851779                    | 514.7187444                   |
> +---------+--------------------------------+-------------------------------+
> | 8       | 476.4062592                    | 780.2540855                   |
> +---------+--------------------------------+-------------------------------+
>
> The perf data does not show anything interesting either. I mean there is a
> reduction in CPU time spent in btree related code in the patched version,
> but the overall execution time to insert the same number of records go up
> significantly.
>
> Perf (master):
> ===========
>
> +   72.59%     1.81%  postgres  postgres            [.] ExecInsert
> +   47.55%     1.27%  postgres  postgres            [.]
> ExecInsertIndexTuples
> +   44.24%     0.48%  postgres  postgres            [.] btinsert
> -   42.40%     0.87%  postgres  postgres            [.] _bt_doinsert
>    - 41.52% _bt_doinsert
>       + 21.14% _bt_search
>       + 12.57% _bt_insertonpg
>       + 2.03% _bt_binsrch
>         1.60% _bt_mkscankey
>         1.20% LWLockAcquire
>       + 1.03% _bt_freestack
>         0.67% LWLockRelease
>         0.57% _bt_check_unique
>    + 0.87% _start
> +   26.03%     0.95%  postgres  postgres            [.] ExecScan
> +   21.14%     0.82%  postgres  postgres            [.] _bt_search
> +   20.70%     1.31%  postgres  postgres            [.] ExecInterpExpr
> +   19.05%     1.14%  postgres  postgres            [.] heap_insert
> +   18.84%     1.16%  postgres  postgres            [.] nextval_internal
> +   18.08%     0.84%  postgres  postgres            [.] ReadBufferExtended
> +   17.24%     2.03%  postgres  postgres            [.] ReadBuffer_common
> +   12.57%     0.59%  postgres  postgres            [.] _bt_insertonpg
> +   11.12%     1.63%  postgres  postgres            [.] XLogInsert
> +    9.90%     0.10%  postgres  postgres            [.] _bt_relandgetbuf
> +    8.97%     1.16%  postgres  postgres            [.] LWLockAcquire
> +    8.42%     2.03%  postgres  postgres            [.] XLogInsertRecord
> +    7.26%     1.01%  postgres  postgres            [.] _bt_binsrch
> +    7.07%     1.20%  postgres  postgres            [.]
> RelationGetBufferForTuple
> +    6.27%     4.92%  postgres  postgres            [.] _bt_compare
> +    5.97%     0.63%  postgres  postgres            [.]
> read_seq_tuple.isra.3
> +    5.70%     4.89%  postgres  postgres            [.]
> hash_search_with_hash_value
> +    5.44%     5.44%  postgres  postgres            [.] LWLockAttemptLock
>
>
> Perf (Patched):
> ============
>
> +   69.33%     2.36%  postgres  postgres            [.] ExecInsert
> +   35.21%     0.64%  postgres  postgres            [.]
> ExecInsertIndexTuples
> -   32.14%     0.45%  postgres  postgres            [.] btinsert
>    - 31.69% btinsert
>       - 30.35% _bt_doinsert
>          + 13.10% _bt_insertonpg
>          + 5.11% _bt_getbuf
>          + 2.75% _bt_binsrch
>          + 2.49% _bt_mkscankey
>          + 2.43% _bt_search
>          + 0.96% _bt_compare
>            0.70% CheckForSerializableConflictIn
>       + 1.34% index_form_tuple

_bt_getbuf doesn't even show up in master, and neither does
CheckForSerializableConflictIn. WAL stuff also went up quite a bit.

I'm thinking there could be contention on some lock somewhere.

Can you attach the benchmark script you're using so I can try to reproduce it?


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Wed, Mar 14, 2018 at 10:58 AM, Claudio Freire <klaussfreire@gmail.com> wrote:


I'm thinking there could be contention on some lock somewhere.

Can you attach the benchmark script you're using so I can try to reproduce it?


I am using a very ad-hoc script.. But here it is.. It assumes a presence of a branch named "btree_rightmost" with the patched code.

You will need to make necessary adjustments of course.

Thanks,
Pavan


--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: Faster inserts with mostly-monotonically increasing values

От
Simon Riggs
Дата:
On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> I wonder if the shortened code path actually leads to
> heavier contention for EXCLUSIVE lock on the rightmost page, which in turn
> causes the slowdown. But that's just a theory. Any ideas how to prove or
> disprove that theory or any other pointers?

Certainly: dropping performance with higher sessions means increased
contention is responsible. Your guess is likely correct.

Suggest making the patch attempt a ConditionalLock on the target
block, so if its contended we try from top of index. So basically only
attempt the optimization if uncontended. This makes sense because in
many cases if the block is locked it is filling up and the RHS block
is more likely to change anyway.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> I wonder if the shortened code path actually leads to
> heavier contention for EXCLUSIVE lock on the rightmost page, which in turn
> causes the slowdown. But that's just a theory. Any ideas how to prove or
> disprove that theory or any other pointers?

Certainly: dropping performance with higher sessions means increased
contention is responsible. Your guess is likely correct.

Suggest making the patch attempt a ConditionalLock on the target
block, so if its contended we try from top of index. So basically only
attempt the optimization if uncontended. This makes sense because in
many cases if the block is locked it is filling up and the RHS block
is more likely to change anyway.

Hmm. I can try that. It's quite puzzling though that slowing down things actually make them better.

I can also try to reduce the amount of time EX lock is held by doing some checks with a SHARE lock and then trade it for EX lock if we conclude that fast path can be taken. We can do page lsn check to confirm nothing changed when the lock was traded. Will check both approaches.

Thanks,
Pavan


--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Simon Riggs
Дата:
On 14 March 2018 at 13:33, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>
> On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs <simon.riggs@2ndquadrant.com>
> wrote:
>>
>> On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com>
>> wrote:
>> > I wonder if the shortened code path actually leads to
>> > heavier contention for EXCLUSIVE lock on the rightmost page, which in
>> > turn
>> > causes the slowdown. But that's just a theory. Any ideas how to prove or
>> > disprove that theory or any other pointers?
>>
>> Certainly: dropping performance with higher sessions means increased
>> contention is responsible. Your guess is likely correct.
>>
>> Suggest making the patch attempt a ConditionalLock on the target
>> block, so if its contended we try from top of index. So basically only
>> attempt the optimization if uncontended. This makes sense because in
>> many cases if the block is locked it is filling up and the RHS block
>> is more likely to change anyway.
>
>
> Hmm. I can try that. It's quite puzzling though that slowing down things
> actually make them better.

That's not what is happening though.

The cache path is 1) try to lock cached block, 2) when got lock check
relevance, if stale 3) recheck from top

The non-cached path is just 3) recheck from top

The overall path length is longer in the cached case but provides
benefit if we can skip step 3 in high % of cases. The non-cached path
is more flexible because it locates the correct RHS block, even if it
changes dynamically between starting the recheck from top.

If there is enough delay in step 1 then any benefit is lost anyway, so
there is no point taking the cached path even when successful, so its
better to ignore in that case.

> I can also try to reduce the amount of time EX lock is held by doing some
> checks with a SHARE lock and then trade it for EX lock if we conclude that
> fast path can be taken. We can do page lsn check to confirm nothing changed
> when the lock was traded. Will check both approaches.

That will still be slow.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Wed, Mar 14, 2018 at 10:51 AM, Simon Riggs
<simon.riggs@2ndquadrant.com> wrote:
> If there is enough delay in step 1 then any benefit is lost anyway, so
> there is no point taking the cached path even when successful, so its
> better to ignore in that case.

I find myself agreeing with that.


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Wed, Mar 14, 2018 at 7:21 PM, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
On 14 March 2018 at 13:33, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>
> On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs <simon.riggs@2ndquadrant.com>
> wrote:
>>
>> On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com>
>> wrote:
>> > I wonder if the shortened code path actually leads to
>> > heavier contention for EXCLUSIVE lock on the rightmost page, which in
>> > turn
>> > causes the slowdown. But that's just a theory. Any ideas how to prove or
>> > disprove that theory or any other pointers?
>>
>> Certainly: dropping performance with higher sessions means increased
>> contention is responsible. Your guess is likely correct.
>>
>> Suggest making the patch attempt a ConditionalLock on the target
>> block, so if its contended we try from top of index. So basically only
>> attempt the optimization if uncontended. This makes sense because in
>> many cases if the block is locked it is filling up and the RHS block
>> is more likely to change anyway.
>
>
> Hmm. I can try that. It's quite puzzling though that slowing down things
> actually make them better.

That's not what is happening though.

The cache path is 1) try to lock cached block, 2) when got lock check
relevance, if stale 3) recheck from top

The non-cached path is just 3) recheck from top

The overall path length is longer in the cached case but provides
benefit if we can skip step 3 in high % of cases. The non-cached path
is more flexible because it locates the correct RHS block, even if it
changes dynamically between starting the recheck from top.


So as I noted in one of the previous emails, the revised patch still takes fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in just 2% cases should cause such dramatic slowdown.

If there is enough delay in step 1 then any benefit is lost anyway, so
there is no point taking the cached path even when successful, so its
better to ignore in that case.

The non-cached path is also going to land on the same page, it just that the _bt_search() will ensure that it doesn't happen quickly. So my understanding that the additional time spent in _bt_search() accidentally reduces contention on the EX lock and ends up improving net performance. I know it sounds weird..

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>> On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
>>
>> >
>> > Yes, I will try that next - it seems like a good idea. So the idea would
>> > be:
>> > check if the block is still the rightmost block and the insertion-key is
>> > greater than the first key in the page. If those conditions are
>> > satisfied,
>> > then we do a regular binary search within the page to find the correct
>> > location. This might add an overhead of binary search when keys are
>> > strictly
>> > ordered and a single client is inserting the data. If that becomes a
>> > concern, we might be able to look for that special case too and optimise
>> > for
>> > it too.
>>
>> Yeah, pretty much that's the idea. Beware, if the new item doesn't
>> fall in the rightmost place, you still need to check for serialization
>> conflicts.
>
>
> So I've been toying with this idea since yesterday and I am quite puzzled
> with the results. See the attached patch which compares the insertion key
> with the last key inserted by this backend, if the cached block is still the
> rightmost block in the tree. I initially only compared with the first key in
> the page, but I tried this version because of the strange performance
> regression which I still have no answers.
>
> For a small number of clients, the patched version does better. But as the
> number of clients go up, the patched version significantly underperforms
> master. I roughly counted the number of times the fastpath is taken and I
> noticed that almost 98% inserts take the fastpath. I first thought that the
> "firstkey" location in the page might be becoming a hot-spot for concurrent
> processes and hence changed that to track the per-backend last offset and
> compare against that the next time. But that did not help much.

+            _bt_compare(rel, natts, itup_scankey, page,
+                        RelationGetLastOffset(rel)) >= 0)

Won't this access garbage if the last offset is stale and beyond the
current rightmost page's last offset?

I'd suggest simply using P_FIRSTDATAKEY after checking that the page
isn't empty (see _bt_binsrch).

On Wed, Mar 14, 2018 at 11:12 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>> > Hmm. I can try that. It's quite puzzling though that slowing down things
>> > actually make them better.
>>
>> That's not what is happening though.
>>
>> The cache path is 1) try to lock cached block, 2) when got lock check
>> relevance, if stale 3) recheck from top
>>
>> The non-cached path is just 3) recheck from top
>>
>> The overall path length is longer in the cached case but provides
>> benefit if we can skip step 3 in high % of cases. The non-cached path
>> is more flexible because it locates the correct RHS block, even if it
>> changes dynamically between starting the recheck from top.
>>
>
> So as I noted in one of the previous emails, the revised patch still takes
> fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in
> just 2% cases should cause such dramatic slowdown.

Real-world workloads will probably take the slow path more often, so
it's probably worth keeping the failure path as contention-free as
possible.

Besides, even though it may be just 2% the times it lands there, it
could still block for a considerable amount of time for no benefit.

So I guess a conditional lock is not a bad idea.


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Wed, Mar 14, 2018 at 12:05 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>>
>>
>> On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com>
>> wrote:
>>>
>>> On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
>>>
>>> >
>>> > Yes, I will try that next - it seems like a good idea. So the idea would
>>> > be:
>>> > check if the block is still the rightmost block and the insertion-key is
>>> > greater than the first key in the page. If those conditions are
>>> > satisfied,
>>> > then we do a regular binary search within the page to find the correct
>>> > location. This might add an overhead of binary search when keys are
>>> > strictly
>>> > ordered and a single client is inserting the data. If that becomes a
>>> > concern, we might be able to look for that special case too and optimise
>>> > for
>>> > it too.
>>>
>>> Yeah, pretty much that's the idea. Beware, if the new item doesn't
>>> fall in the rightmost place, you still need to check for serialization
>>> conflicts.
>>
>>
>> So I've been toying with this idea since yesterday and I am quite puzzled
>> with the results. See the attached patch which compares the insertion key
>> with the last key inserted by this backend, if the cached block is still the
>> rightmost block in the tree. I initially only compared with the first key in
>> the page, but I tried this version because of the strange performance
>> regression which I still have no answers.
>>
>> For a small number of clients, the patched version does better. But as the
>> number of clients go up, the patched version significantly underperforms
>> master. I roughly counted the number of times the fastpath is taken and I
>> noticed that almost 98% inserts take the fastpath. I first thought that the
>> "firstkey" location in the page might be becoming a hot-spot for concurrent
>> processes and hence changed that to track the per-backend last offset and
>> compare against that the next time. But that did not help much.
>
> +            _bt_compare(rel, natts, itup_scankey, page,
> +                        RelationGetLastOffset(rel)) >= 0)
>
> Won't this access garbage if the last offset is stale and beyond the
> current rightmost page's last offset?
>
> I'd suggest simply using P_FIRSTDATAKEY after checking that the page
> isn't empty (see _bt_binsrch).
>
> On Wed, Mar 14, 2018 at 11:12 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>>> > Hmm. I can try that. It's quite puzzling though that slowing down things
>>> > actually make them better.
>>>
>>> That's not what is happening though.
>>>
>>> The cache path is 1) try to lock cached block, 2) when got lock check
>>> relevance, if stale 3) recheck from top
>>>
>>> The non-cached path is just 3) recheck from top
>>>
>>> The overall path length is longer in the cached case but provides
>>> benefit if we can skip step 3 in high % of cases. The non-cached path
>>> is more flexible because it locates the correct RHS block, even if it
>>> changes dynamically between starting the recheck from top.
>>>
>>
>> So as I noted in one of the previous emails, the revised patch still takes
>> fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in
>> just 2% cases should cause such dramatic slowdown.
>
> Real-world workloads will probably take the slow path more often, so
> it's probably worth keeping the failure path as contention-free as
> possible.
>
> Besides, even though it may be just 2% the times it lands there, it
> could still block for a considerable amount of time for no benefit.
>
> So I guess a conditional lock is not a bad idea.

Re all of the above, I did some tests.

I couldn't reproduce the regression you showed so clearly, in fact at
all, but I was able to get consistently lock-bound with 8 clients by
setting fsync=off.

Something noteworthy, is that both master and patched are lock-bound.
It just must be that when that happens, the extra check for the
rightmost page really hurts.

So, I tried the conditional locks, and indeed it (at least in my
meager hardware) turns the lock-bound test into an I/O bound one.

I applied the attached patches on top of your patch, so it would be
nice to see if you can give it a try in your test hardware to see
whether it helps.

Вложения

Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote:


I applied the attached patches on top of your patch, so it would be
nice to see if you can give it a try in your test hardware to see
whether it helps.

I can confirm that I no longer see any regression at 8 or even 16 clients. In fact, the patched version consistent do better than master even at higher number of clients.

The only thing I am a bit puzzled is that I am no longer able to reproduce the 40-50% gains that I'd earlier observed with a single client. I now get 20-25% with smaller number of client and 10-15% with larger number of clients. I haven't been able to establish why it's happening, but since it's a different AWS instance (though of the same type), I am inclined to blame it on that for now.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Mon, Mar 19, 2018 at 11:57 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>>
>>
>> I applied the attached patches on top of your patch, so it would be
>> nice to see if you can give it a try in your test hardware to see
>> whether it helps.
>
>
> I can confirm that I no longer see any regression at 8 or even 16 clients.
> In fact, the patched version consistent do better than master even at higher
> number of clients.
>
> The only thing I am a bit puzzled is that I am no longer able to reproduce
> the 40-50% gains that I'd earlier observed with a single client. I now get
> 20-25% with smaller number of client and 10-15% with larger number of
> clients. I haven't been able to establish why it's happening, but since it's
> a different AWS instance (though of the same type), I am inclined to blame
> it on that for now.

Your original patch also skipped the check for serializable conflicts.

*IF* that was correct, it probably further reduced contention. I'm not
entirely sure if skipping that check is correct, but if it is, you can
still accomplish this checking whether the new key is beyond the
current rightmost key, and setting a flag so that check is later
skipped.

But there are lots of complex interactions in that code and I'm no
expert there, I'd prefer if someone more knowledgeable could confirm
whether it's safe to skip that check or not. Or leave it just in case.


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Mon, Mar 19, 2018 at 12:06 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Mon, Mar 19, 2018 at 11:57 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>>
>>
>> On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> I applied the attached patches on top of your patch, so it would be
>>> nice to see if you can give it a try in your test hardware to see
>>> whether it helps.
>>
>>
>> I can confirm that I no longer see any regression at 8 or even 16 clients.
>> In fact, the patched version consistent do better than master even at higher
>> number of clients.
>>
>> The only thing I am a bit puzzled is that I am no longer able to reproduce
>> the 40-50% gains that I'd earlier observed with a single client. I now get
>> 20-25% with smaller number of client and 10-15% with larger number of
>> clients. I haven't been able to establish why it's happening, but since it's
>> a different AWS instance (though of the same type), I am inclined to blame
>> it on that for now.
>
> Your original patch also skipped the check for serializable conflicts.
>
> *IF* that was correct, it probably further reduced contention. I'm not
> entirely sure if skipping that check is correct, but if it is, you can
> still accomplish this checking whether the new key is beyond the
> current rightmost key, and setting a flag so that check is later
> skipped.
>
> But there are lots of complex interactions in that code and I'm no
> expert there, I'd prefer if someone more knowledgeable could confirm
> whether it's safe to skip that check or not. Or leave it just in case.

If you're not planning on making any further changes, would you mind
posting a coalesced patch?

Notice that I removed the last offset thing only halfway, so it would
need some cleanup.


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:

On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

If you're not planning on making any further changes, would you mind
posting a coalesced patch?

Notice that I removed the last offset thing only halfway, so it would
need some cleanup.

Here is an updated patch. I removed the last offset caching thing completely and integrated your changes for conditional lock access. Some other improvements to test cases  too. I realised that we must truncate and re-insert to test index fastpath correctly.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Thu, Mar 22, 2018 at 3:29 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
> On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>>
>> If you're not planning on making any further changes, would you mind
>> posting a coalesced patch?
>>
>> Notice that I removed the last offset thing only halfway, so it would
>> need some cleanup.
>
>
> Here is an updated patch. I removed the last offset caching thing completely
> and integrated your changes for conditional lock access. Some other
> improvements to test cases  too. I realised that we must truncate and
> re-insert to test index fastpath correctly.

Thanks.

Some comments

+    /*
+     * It's very common to have an index on an auto-incremented or
+     * monotonically increasing value. In such cases, every insertion happens
+     * towards the end of the index. We try to optimise that case by caching
+     * the right-most block of the index. If our cached block is still the
+     * rightmost block, has enough free space to accommodate a new entry and
+     * the insertion key is greater or equal to the first key in this page,
+     * then we can safely conclude that the new key will be inserted in the
+     * cached block. So we simply search within the cached page and insert the
+     * key at the appropriate location. We call it a fastpath.

It should say "the insertion key is strictly greater than the first key"

Equal cases cannot be accepted since it would break uniqueness checks,
so the comment should say that. The code already is correctly checking
for strict inequality, it's just the comment that is out of sync.

You might accept equal keys for non-unique indexes, but I don't
believe making that distinction in the code is worth the hassle.

Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
difference). So it should say "rightmost leaf page".


+            if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+                    !P_INCOMPLETE_SPLIT(lpageop) &&
+                    !P_IGNORE(lpageop) &&
+                    (PageGetFreeSpace(page) > itemsz) &&
+                    PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+                    _bt_compare(rel, natts, itup_scankey, page,
+                        P_FIRSTDATAKEY(lpageop)) > 0)
+            {
+                offset = InvalidOffsetNumber;
+                fastpath = true;
+            }
...later...
+    if (!fastpath)
+    {
+        /* find the first page containing this key */
+        stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE,
+                           NULL);
+
+        offset = InvalidOffsetNumber;

Setting "offset = InvalidOffsetNumber" in that contorted way is
unnecessary. You can remove the first assignment and instead
initialize unconditionally right after the fastpath block (that
doesn't make use of offset anyway):

In indexing.out:

+explain select sum(a) from fastpath where a = 6456;
+                                     QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate  (cost=4.31..4.32 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30
rows=1 width=4)
+         Index Cond: (a = 6456)
+(3 rows)

Having costs in explain tests can be fragile. Better use "explain
(costs off)". If you run "make check" continuously in a loop, you
should get some failures related to that pretty quickly.

+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from
(select generate_series(1,10000) as x) y;
+vacuum fastpath;

Vacuum can also be a pain. In parallel tests, vacuum won't do what you
expect it to do, as other concurrent transactions will prevent it
from, say, marking all-visible pages.

In fact, you can get those spurious failures by running "make -j8
check-world" repeatedly (with tap tests enabled). That causes enough
concurrency to cause trouble and sporadic failures.

Question whether you need it. I'm not sure why you need the explain
tests, which I imagine are the reason why you need that vacuum there.

If you do need it, I successfully used the following helper in another patch:

CREATE FUNCTION wait_barrier() RETURNS void LANGUAGE plpgsql AS $$
DECLARE vxids text[];
BEGIN
 -- wait for all transactions to commit to make deleted tuples vacuumable
 SELECT array_agg(DISTINCT virtualtransaction) FROM pg_locks WHERE pid
<> pg_backend_pid() INTO vxids;
 WHILE (SELECT count(virtualtransaction) FROM pg_locks WHERE pid <>
pg_backend_pid() AND virtualtransaction = ANY(vxids)) > 0 LOOP
    PERFORM pg_sleep(0.1);
 END LOOP;
END
$$;

You call it right before vacuum to make sure all potentially
troublesome transactions finished by the time vacuum runs:

SELECT wait_barrier();
VACUUM blah;

Name the function something else, though, to avoid name clashing with
that patch. Also... that function up there is also still up for review
itself, so take it with a grain of salt. It worked for me.


Re: Faster inserts with mostly-monotonically increasing values

От
Andrew Dunstan
Дата:
On Fri, Mar 23, 2018 at 10:20 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Mar 22, 2018 at 3:29 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>>
>> On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire <klaussfreire@gmail.com>
>> wrote:
>>>
>>>
>>> If you're not planning on making any further changes, would you mind
>>> posting a coalesced patch?
>>>
>>> Notice that I removed the last offset thing only halfway, so it would
>>> need some cleanup.
>>
>>
>> Here is an updated patch. I removed the last offset caching thing completely
>> and integrated your changes for conditional lock access. Some other
>> improvements to test cases  too. I realised that we must truncate and
>> re-insert to test index fastpath correctly.


This patch looks in pretty good shape. I have been trying hard to
think of some failure mode but haven't been able to come up with one.

We can put off for another day consideration of if/when e can skip the
check for serialization conflict.

> Some comments
>
> +    /*
> +     * It's very common to have an index on an auto-incremented or
> +     * monotonically increasing value. In such cases, every insertion happens
> +     * towards the end of the index. We try to optimise that case by caching
> +     * the right-most block of the index. If our cached block is still the
> +     * rightmost block, has enough free space to accommodate a new entry and
> +     * the insertion key is greater or equal to the first key in this page,
> +     * then we can safely conclude that the new key will be inserted in the
> +     * cached block. So we simply search within the cached page and insert the
> +     * key at the appropriate location. We call it a fastpath.
>
> It should say "the insertion key is strictly greater than the first key"


right.

>
> Equal cases cannot be accepted since it would break uniqueness checks,
> so the comment should say that. The code already is correctly checking
> for strict inequality, it's just the comment that is out of sync.
>
> You might accept equal keys for non-unique indexes, but I don't
> believe making that distinction in the code is worth the hassle.
>
> Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
> difference). So it should say "rightmost leaf page".


right.

[...]
>
> Setting "offset = InvalidOffsetNumber" in that contorted way is
> unnecessary. You can remove the first assignment and instead
> initialize unconditionally right after the fastpath block (that
> doesn't make use of offset anyway):


Yes, I noticed that and it's confusing, Just set it at the top.



>
> In indexing.out:
>
> +explain select sum(a) from fastpath where a = 6456;
> +                                     QUERY PLAN
> +------------------------------------------------------------------------------------
> + Aggregate  (cost=4.31..4.32 rows=1 width=8)
> +   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30
> rows=1 width=4)
> +         Index Cond: (a = 6456)
> +(3 rows)
>
> Having costs in explain tests can be fragile. Better use "explain
> (costs off)". If you run "make check" continuously in a loop, you
> should get some failures related to that pretty quickly.
>



Agree about costs off, but I'm fairly dubious of the value of using
EXPLAIN at all here.Nothing in the patch should result in any change
in EXPLAIN output.

I would probably just have a few regression lines that should be sure
to exercise the code path and leave it at that.


cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:
Hi Andrew and Claudio,

Thanks for the review!

On Fri, Mar 23, 2018 at 10:19 AM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
On Fri, Mar 23, 2018 at 10:20 AM, Claudio Freire <klaussfreire@gmail.com> wrote:


This patch looks in pretty good shape. I have been trying hard to
think of some failure mode but haven't been able to come up with one.


Great!
 

> Some comments
>
> +    /*
> +     * It's very common to have an index on an auto-incremented or
> +     * monotonically increasing value. In such cases, every insertion happens
> +     * towards the end of the index. We try to optimise that case by caching
> +     * the right-most block of the index. If our cached block is still the
> +     * rightmost block, has enough free space to accommodate a new entry and
> +     * the insertion key is greater or equal to the first key in this page,
> +     * then we can safely conclude that the new key will be inserted in the
> +     * cached block. So we simply search within the cached page and insert the
> +     * key at the appropriate location. We call it a fastpath.
>
> It should say "the insertion key is strictly greater than the first key"


Good catch. Fixed.
 

>
> Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
> difference). So it should say "rightmost leaf page".


right.


Fixed.
 
[...]
>
> Setting "offset = InvalidOffsetNumber" in that contorted way is
> unnecessary. You can remove the first assignment and instead
> initialize unconditionally right after the fastpath block (that
> doesn't make use of offset anyway):


Yes, I noticed that and it's confusing, Just set it at the top.


Good idea. Changed that way.
 
>
> Having costs in explain tests can be fragile. Better use "explain
> (costs off)". If you run "make check" continuously in a loop, you
> should get some failures related to that pretty quickly.
>

Agree about costs off, but I'm fairly dubious of the value of using
EXPLAIN at all here.Nothing in the patch should result in any change
in EXPLAIN output.

I agree. I initially added EXPLAIN to ensure that we're doing index-only scans. But you're correct, we don't need them in the tests itself.
 

I would probably just have a few regression lines that should be sure
to exercise the code path and leave it at that.


I changed the regression tests to include a few more scenarios, basically using multi-column indexes in different ways and they querying rows by ordering rows in different ways. I did not take away the vacuum and I believe it will actually help the tests by introducing some fuzziness in the tests i.e. if the vacuum does not do its job, we might execute a different plan and ensure that the output remains unchanged.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: Faster inserts with mostly-monotonically increasing values

От
Andrew Dunstan
Дата:
On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>>
>>
>> I would probably just have a few regression lines that should be sure
>> to exercise the code path and leave it at that.
>>
>
> I changed the regression tests to include a few more scenarios, basically
> using multi-column indexes in different ways and they querying rows by
> ordering rows in different ways. I did not take away the vacuum and I
> believe it will actually help the tests by introducing some fuzziness in the
> tests i.e. if the vacuum does not do its job, we might execute a different
> plan and ensure that the output remains unchanged.
>


If we're going to keep the vacuums in there, do we need to add a wait
barrier like Claudio suggested upthread?

Once we decide on that I propose to commit this.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Sun, Mar 25, 2018 at 6:00 AM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>>
>>
>> I would probably just have a few regression lines that should be sure
>> to exercise the code path and leave it at that.
>>
>
> I changed the regression tests to include a few more scenarios, basically
> using multi-column indexes in different ways and they querying rows by
> ordering rows in different ways. I did not take away the vacuum and I
> believe it will actually help the tests by introducing some fuzziness in the
> tests i.e. if the vacuum does not do its job, we might execute a different
> plan and ensure that the output remains unchanged.
>


If we're going to keep the vacuums in there, do we need to add a wait
barrier like Claudio suggested upthread?


I don't think we need the wait barrier since we're no longer printing the explain plan. In the worst case, the vacuum may not get to set pages all-visible, thus planner choosing something other than an index-only-scan, but I guess that fuzziness actually helps the regression tests. That way we get confirmation regarding the final result irrespective of the plan chosen. 

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Faster inserts with mostly-monotonically increasing values

От
Andrew Dunstan
Дата:
On Mon, Mar 26, 2018 at 6:06 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Sun, Mar 25, 2018 at 6:00 AM, Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>>
>> On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee
>> <pavan.deolasee@gmail.com> wrote:
>> >>
>> >>
>> >> I would probably just have a few regression lines that should be sure
>> >> to exercise the code path and leave it at that.
>> >>
>> >
>> > I changed the regression tests to include a few more scenarios,
>> > basically
>> > using multi-column indexes in different ways and they querying rows by
>> > ordering rows in different ways. I did not take away the vacuum and I
>> > believe it will actually help the tests by introducing some fuzziness in
>> > the
>> > tests i.e. if the vacuum does not do its job, we might execute a
>> > different
>> > plan and ensure that the output remains unchanged.
>> >
>>
>>
>> If we're going to keep the vacuums in there, do we need to add a wait
>> barrier like Claudio suggested upthread?
>>
>
> I don't think we need the wait barrier since we're no longer printing the
> explain plan. In the worst case, the vacuum may not get to set pages
> all-visible, thus planner choosing something other than an index-only-scan,
> but I guess that fuzziness actually helps the regression tests. That way we
> get confirmation regarding the final result irrespective of the plan chosen.
>


Fair enough. Committed.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

От
Heikki Linnakangas
Дата:
>         /* XLOG stuff */
>         if (RelationNeedsWAL(rel))
>         {
>             ...
> 
>             if (P_ISLEAF(lpageop))
>             {
>                 xlinfo = XLOG_BTREE_INSERT_LEAF;
> 
>                 /*
>                  * Cache the block information if we just inserted into the
>                  * rightmost leaf page of the index.
>                  */
>                 if (P_RIGHTMOST(lpageop))
>                     RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
>             }
>             ...


Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block? 
ISTM that we're failing to take advantage of this optimization for 
unlogged tables, for no particular reason. Just an oversight?

- Heikki


Re: Faster inserts with mostly-monotonically increasing values

От
Claudio Freire
Дата:
On Tue, Apr 10, 2018 at 11:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>                 /* XLOG stuff */
>>                 if (RelationNeedsWAL(rel))
>>                 {
>>                         ...
>>
>>                         if (P_ISLEAF(lpageop))
>>                         {
>>                                 xlinfo = XLOG_BTREE_INSERT_LEAF;
>>
>>                                 /*
>>                                  * Cache the block information if we just
>> inserted into the
>>                                  * rightmost leaf page of the index.
>>                                  */
>>                                 if (P_RIGHTMOST(lpageop))
>>                                         RelationSetTargetBlock(rel,
>> BufferGetBlockNumber(buf));
>>                         }
>>                         ...
>
>
>
> Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block?
> ISTM that we're failing to take advantage of this optimization for unlogged
> tables, for no particular reason. Just an oversight?
>
> - Heikki

Indeed.

Maybe Pavan knows of one, but I don't see any reason not to apply this
to unlogged tables as well. It slipped the review.


Re: Faster inserts with mostly-monotonically increasing values

От
Pavan Deolasee
Дата:


On Tue, Apr 10, 2018 at 7:49 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Tue, Apr 10, 2018 at 11:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

>
> Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block?
> ISTM that we're failing to take advantage of this optimization for unlogged
> tables, for no particular reason. Just an oversight?
>
> - Heikki

Indeed.

Maybe Pavan knows of one, but I don't see any reason not to apply this
to unlogged tables as well. It slipped the review.

Yes, looks like an oversight :-( I will fix it along with the other changes that Peter requested.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services