Обсуждение: Optimize single tuple fetch from nbtree index

Поиск
Список
Период
Сортировка

Optimize single tuple fetch from nbtree index

От
Floris Van Nee
Дата:

Hi hackers,


While I was reviewing some code in another patch, I stumbled upon a possible optimization in the btree index code in nbtsearch.c for queries using 'LIMIT 1'. I have written a small patch that implements this optimization, but I'd like some feedback on the design of the patch, whether it is correct at all to use this optimization, and whether the performance tradeoffs are deemed worth it by the community.


Basically, an example of the case I'd like to optimize is the following. Given a table 'tbl' with an index on columns (k,ts DESC):

SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;


And, even more importantly, when this query gets called in an inner loop like:


SELECT* FROM generate_series(:start_ts, :end_ts, :interval) ts -- perhaps thousands of iterations, could also be a loop over values of 'k' rather than timestamps. this is just an example

CROSS JOIN LATERAL (

   SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;

) _;


With time-series data, this case often arises as you have a certain natural key for which you store updates as they occur. Getting the state of k at a specific time then boils down to the given query there, which is almost always the fastest way to get this information, since the index scan with LIMIT 1 is very fast already. However, there seems to be a possibility to make this even faster (up to nearly 3x faster in test cases that use this nested loop of index scans)

Every time the index scan is done, all tuples from the leaf page are read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an exception for this *only* the first time amgettuple gets called. This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off.


There are a few caveats:

- Possible performance decrease for queries that need a small number of tuples (but more than one), because they now need to lock the same page twice. This can happen in several cases, for example: LIMIT 3; LIMIT 1 but the first tuple returned does not match other scan conditions; LIMIT 1 but the tuple returned is not visible; no LIMIT at all but there are just only a few matching rows.

- We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation.


I did performance tests for some best case and worst case test scenarios. TPS results were stable and reproducible in re-runs on my, otherwise idle, server. Attached are the full results and how to reproduce. I picked test cases that show best performance as well as worst performance compared to master. Summary: the greatest performance improvement can be seen for the cases with the subquery in a nested loop. In a nested loop of 100 times, the performance is roughly two times better, for 10000 times the performance is roughly three times better. For most test cases that don't use LIMIT 1, I couldn't find a noticeable difference, except for the nested loop with a LIMIT 3 (or similarly, a nested loop without any LIMIT-clause that returns just three tuples). This is also theoretically the worst-case test case, because it has to lock the page again and then read it, just for one tuple. In this case, I saw TPS decrease by 2-3% in a few cases (details in the attached file), due to it having to lock/unlock the same page in both _bt_first and _bt_next.


A few concrete questions to the community:

- Does the community also see this as a useful optimization?

- Is the way it is currently implemented safe? I struggled quite a bit to get everything working with respect to page splits and insertions. In particular, I don't know why in my patch, _bt_find_offset_for_tid needs to consider searching for items with an offset *before* the passed offset. As far as my understanding goes, this could only happen when the index gets vacuumed in the mean-time. However, we hold a pin on the buffer the whole time (we even assert this), so vacuum should not have been possible. Still, this code gets triggered sometimes, so it seems to be necessary. Perhaps someone in the community who's more of an expert on this can shed some light on it.

- What are considered acceptable performance tradeoffs in this case? Is a performance degradation in any part generally not acceptable at all?


I'd also welcome any feedback on the process - this is my first patch and while I tried to follow the guidelines, I may have missed something along the way.


Attachments:

- out_limit.txt: pgbench results for patched branch

- out_master.txt pgbench results for master branch (can be diffed with out_limit.txt to efficiently see the difference)

- init_test.sql: creates a simple table for the test cases and fills it with data

- test_fwd.sql: the nested loop example, parameters :nlimit and :nitems to specify how many rows per inner loop to limit to and how many iterations of the loop need to be done. we're returning a sum() over a column to make sure transferring result data over the socket is not the bottleneck - this is to show the worst-case behavior of this patch. When selecting the full row instead of the sum, the performance difference is negligible for the 'bad' cases, while still providing great (up to 2.5x) improvements for the 'good' cases. This can be tested by changing the sum() to select a column per row instead.

- test_fwd_eq.sql: nested loop with simple unique equality select

- test_fwd_single.sql: single query with LIMIT without nested loop


-Floris


Вложения

Re: Optimize single tuple fetch from nbtree index

От
Tom Lane
Дата:
Floris Van Nee <florisvannee@Optiver.com> writes:
> Every time the index scan is done, all tuples from the leaf page are
> read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an
> exception for this *only* the first time amgettuple gets called.

Regardless of whether there's actually a LIMIT 1?  That seems disastrous
for every other case than the narrow one where the optimization wins.
Because every other case is now going to have to touch the index page
twice.  That's more CPU and about double the contention --- if you could
not measure any degradation from that, you're not measuring the right
thing.

In principle, you could pass down knowledge of whether there's a LIMIT,
using the same mechanism used to enable top-N sorting.  But it'd have to
also go through the AM interface layer, so I'm not sure how messy this
would be.

> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just
thefirst (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required,
therewon't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples
fromthe index page at the point where we left off. 

How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

> - We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between
_bt_firstand the first call to _bt_next. This made my patch quite a bit more complicated than my initial
implementation.

Meh.  I think the odds that you got this 100% right are small, and the
odds that it would be maintainable are smaller.  There's too much that
can happen if you're not holding any lock --- and there's a lot of active
work on btree indexes, which could break whatever assumptions you might
make today.

I'm not unalterably opposed to doing something like this, but my sense
is that the complexity and probable negative performance impact on other
cases are not going to look like a good trade-off for optimizing the
case at hand.

BTW, you haven't even really made the case that optimizing a query that
behaves this way is the right thing to be doing ... maybe some other
plan shape that isn't a nestloop around a LIMIT query would be a better
solution.

            regards, tom lane



Re: Optimize single tuple fetch from nbtree index

От
Floris Van Nee
Дата:
Hi Tom,

Thanks for your quick reply!

> Regardless of whether there's actually a LIMIT 1?  That seems disastrous
> for every other case than the narrow one where the optimization wins.
> Because every other case is now going to have to touch the index page
> twice.  That's more CPU and about double the contention --- if you could
> not measure any degradation from that, you're not measuring the right
> thing.

I thought the same as well at first. Note that I did measure degradation of 2-3% as mentioned on some cases, but
initiallyI also expected worse. Do you have any ideas on cases that would suffer the most? I thought the tight inner
nestedloop that I posted in my performance tests would have this index lookup as bottleneck. I know they are the
bottleneckfor the LIMIT 1 query (because these improve by a factor 2-3 with the patch). And my theory is that for a
LIMIT3, the price paid for this optimization is highest, because it would touch the page twice and read all items from
it,while only returning three of them. 

> In principle, you could pass down knowledge of whether there's a LIMIT,
> using the same mechanism used to enable top-N sorting.  But it'd have to
> also go through the AM interface layer, so I'm not sure how messy this
> would be.

This was an idea I had as well and I would be willing to implement such a thing if this is deemed interesting enough by
thecommunity. However, I didn't want to do this for the first version of this patch, as it would be quite some extra
work,which would be useless if the idea of the patch itself gets rejected already. :-) I'd appreciate any pointers in
theright direction - I can take a look at how top-N sorting pushes the LIMIT down. Given enough interest for the basic
ideaof this patch, I will implement it. 

>> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read
justthe first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required,
therewon't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples
fromthe index page at the point where we left off. 

> How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

Indeed we don't know that - that's why this initial patch does not make any assumptions about this and just assumes the
good-weatherscenario that everything is visible. I'm not sure if it's possible to give an estimation of this and
whetheror not that would be useful. Currently, if it turns out that the tuple is not visible, there'll just be another
callto _bt_next again which will resume reading the page as normal. I'm open to implement any suggestions that may
improvethis. 

>> - We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between
_bt_firstand the first call to _bt_next. This made my patch quite a bit more complicated than my initial
implementation.

> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

Agreed, which is also why I posted this initial version of the patch here already, to get some input from the experts
onthis topic what assumptions can be made now and in the future. If it turns out that it's completely not feasible to
doan optimization like this, because of other constraints in the btree implementation, then we're done pretty quickly
here.:-) For what it's worth: the patch at least passes make check consistently - I caught a lot of these edge cases
relatedto page splits and insertions while running the regression tests, which runs the modified bits of code quite
oftenand in parallel. There may be plenty of edge cases left however... 

> I'm not unalterably opposed to doing something like this, but my sense
> is that the complexity and probable negative performance impact on other
> cases are not going to look like a good trade-off for optimizing the
> case at hand.

I do think it could be a big win if we could get something like this working. Cases with a LIMIT seem common enough to
meto make it possible to add some extra optimizations, especially if that could lead to 2-3x the TPS for these kind of
queries.However, it indeed needs to be within a reasonable complexity. If it turns out that in order for us to optimize
this,we need to add a lot of extra complexity, it may not be worth it to add it. 

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

It is pretty difficult to come up with any faster plans than this unfortunately. We have a large database with many
tableswith timeseries data, and when tables get large and when there's an efficient multi-column index and when you
wantto do these kind of time-base or item-based lookups, the nested loop is generally the fastest option. I can
elaboratemore on this, but I'm not sure if this thread is the best place for that. 

I appreciate you taking the time to take a look at this. I'd be happy to look more into any suggestions you come up
with.Working on this has taught me a lot about the internals of Postgres already - I find it really interesting! 

-Floris



Re: Optimize single tuple fetch from nbtree index

От
Peter Geoghegan
Дата:
On Fri, Aug 2, 2019 at 1:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

I agree that this sounds very scary.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

I wonder if some variety of block nested loop join would be helpful
here. I'm not aware of any specific design that would help with
Floris' case, but the idea of reducing the number of scans required on
the inner side by buffering outer side tuples (say based on the
"k=:val" constant) seems like it might generalize well enough. I
suggest Floris look into that possibility. This paper might be worth a
read:

https://dl.acm.org/citation.cfm?id=582278

(Though it also might not be worth a read -- I haven't actually read it myself.)

--
Peter Geoghegan



Re: Optimize single tuple fetch from nbtree index

От
Peter Geoghegan
Дата:
On Fri, Aug 2, 2019 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I wonder if some variety of block nested loop join would be helpful
> here. I'm not aware of any specific design that would help with
> Floris' case, but the idea of reducing the number of scans required on
> the inner side by buffering outer side tuples (say based on the
> "k=:val" constant) seems like it might generalize well enough. I
> suggest Floris look into that possibility. This paper might be worth a
> read:
>
> https://dl.acm.org/citation.cfm?id=582278

Actually, having looked at the test case in more detail, that now
seems less likely. The test case seems designed to reward making it
cheaper to access one specific tuple among a fairly large group of
related tuples -- reducing the number of inner scans is not going to
be possible there.

If this really is totally representative of the case that Floris cares
about, I suppose that the approach taken more or less makes sense.
Unfortunately, it doesn't seem like an optimization that many other
users would find compelling, partly because it's only concerned with
fixed overheads, and partly because most queries don't actually look
like this.

-- 
Peter Geoghegan



Re: Optimize single tuple fetch from nbtree index

От
Michail Nikolaev
Дата:
Hello everyone.

I am also was looking into possibility of such optimisation few days ago (attempt to reduce memcpy overhead on IndexOnlyScan).

One thing I noticed here - whole page is scanned only if index quals are "opened" at some side.

So,  in case of
     SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;
whole index page will be read.

But
    SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp AND ts<:=timestamp - :interval  ORDER BY k, ts DESC LIMIT 1;
is semantically the same, but only few :interval records will be processed.

So, you could try to compare such query in your benchmarks.

Also, some info about current design is contained in src\backend\access\nbtree\README ("To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once").

Thanks,
 Michail.

Re: Optimize single tuple fetch from nbtree index

От
Floris Van Nee
Дата:
Hi Peter,

> Actually, having looked at the test case in more detail, that now
> seems less likely. The test case seems designed to reward making it
> cheaper to access one specific tuple among a fairly large group of
> related tuples -- reducing the number of inner scans is not going to
> be possible there.

> If this really is totally representative of the case that Floris cares
> about, I suppose that the approach taken more or less makes sense.
> Unfortunately, it doesn't seem like an optimization that many other
> users would find compelling, partly because it's only concerned with
> fixed overheads, and partly because most queries don't actually look
> like this.

Thanks for taking a look. Unfortunately this is exactly the case I care about. I'm a bit puzzled as to why this case
wouldn'tcome up more often by other users though. We have many large tables with timeseries data and it seems to me
thatwith timeseries data, two of the most common queries are: 
(1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know that there's an update *exactly* at
timepointt - so you're left with trying to find the latest update smaller than t) 
(2) What is the state of { a } at timepoints { t1, t2, t3 ... }
Given that a1,a2,a3... are indepedently updating, but similar time series (eg. sensor a1 and sensor a2, but both
providea temperature value and update independently from each other). 
Both of these can also be done with some kind of DISTINCT ON clause, but this is often already much slower than just
doinga nested loop of fast index lookups with LIMIT 1 (this depends on the frequency of the timeseries data itself
versusthe sampling rate of your query though, for high frequency time series and/or low frequency sampling the LIMIT 1
approachis much faster). 

Note that there is actually some related work to this - in the Index Skip Scan thread [1] a function called
_bt_read_closestwas developed which also partially reads the page. A Skip Scan has a very similar access pattern to the
usecase I describe here, because it's also very likely to just require one tuple from the page. Even though the
implementationin that patch is currently incorrect, performance of the Skip Scan would likely also be quite a bit
fasterif it had a correct implementation of this partial page-read and it wouldn't have to read the full page every
time.

I have one further question about these index offsets. There are several comments in master that indicate that it's
impossiblethat an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems
hasa comment like this: 

* Note that if we hold a pin on the target page continuously from initially
 * reading the items until applying this function, VACUUM cannot have deleted
 * any items from the page, and so there is no need to search left from the
 * recorded offset.  (This observation also guarantees that the item is still
 * the right one to delete, which might otherwise be questionable since heap
 * TIDs can get recycled.)    This holds true even if the page has been modified
 * by inserts and page splits, so there is no need to consult the LSN.

Still, exactly this case happens in practice. In my tests I was able to get behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but *keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left
onthe page from the previous indexOffset. 

However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items
lefteven though there are others holding a pin? 

-Floris

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169



Re: Optimize single tuple fetch from nbtree index

От
Anastasia Lubennikova
Дата:
05.08.2019 14:34, Floris Van Nee wrote:
> I have one further question about these index offsets. There are several comments in master that indicate that it's
impossiblethat an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems
hasa comment like this:
 
>   
> * Note that if we hold a pin on the target page continuously from initially
>   * reading the items until applying this function, VACUUM cannot have deleted
>   * any items from the page, and so there is no need to search left from the
>   * recorded offset.  (This observation also guarantees that the item is still
>   * the right one to delete, which might otherwise be questionable since heap
>   * TIDs can get recycled.)    This holds true even if the page has been modified
>   * by inserts and page splits, so there is no need to consult the LSN.
>   
> Still, exactly this case happens in practice. In my tests I was able to get behavior like:
> 1) pin + lock a page in _bt_first
> 2) read a tuple, record indexOffset (for example offset=100) and heap tid
> 3) unlock page, but*keep*  the pin (end of _bt_first of my patch)
> 4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
> 5) look inside the current page for the heap Tid that we registered earlier
> 6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
> This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look
lefton the page from the previous indexOffset.
 
>
> However, this is in contradiction with the comments (and code) of _bt_killitems.
> Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index
itemsleft even though there are others holding a pin?
 

Hello,
welcome to hackers with your first patch)

As far as I understood from the thread above, the design of this 
optimization is under discussion, so I didn't review the proposed patch 
itself.
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to 
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned 
again between two reads,
and the condition of holding it continuously has not been met.

I didn't dig into the code, but this line looks suspicious (see my 
findings about BTScanPosIsPinned below):

         /* bump pin on current buffer for assignment to mark buffer */
         if (BTScanPosIsPinned(so->currPos))
             IncrBufferRefCount(so->currPos.buf);


While reading the code to answer your question, I noticed that 
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it 
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().

I propose that we update BTScanPosIsPinned macro. Or, at least write a 
comment, why its current behavior is fine.
There are a few existing callers, that are definitely expecting that 
this macro checks a pin, which it doesn't do.
I don't quite understand if that already causes any subtle bug, or the 
current algorithm is fine.

Peter, Tom, what do you think?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Optimize single tuple fetch from nbtree index

От
Peter Geoghegan
Дата:
On Fri, Aug 23, 2019 at 10:14 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Though, I got interested in the comment inconsistency you have found.
> I added debug message into this code branch of the patch and was able to
> see it in regression.diffs after 'make check':
> Speaking of your patch, it seems that the buffer was unpinned and pinned
> again between two reads,
> and the condition of holding it continuously has not been met.

See commit 2ed5b87f. The code is supposed to do that, but it might do
it more often than is truly necessary. We don't want to block VACUUM
by holding a buffer pin for a very long time, which is theoretically
possible here. Do you think that it is actually unnecessary here? In
other words, do you think that we can fix this without breaking cases
that commit 2ed5b87f cares about?

I have been suspicious of this commit all along. For example, I
noticed that it can cause the kill_prior_tuple mechanism to be
ineffective in a way that didn't happen prior to Postgres 9.5:

https://postgr.es/m/CAH2-Wz=SfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA@mail.gmail.com

That particular complaint was never addressed. I meant to do more on
commit 2ed5b87f.

> I didn't dig into the code, but this line looks suspicious (see my
> findings about BTScanPosIsPinned below):
>
>          /* bump pin on current buffer for assignment to mark buffer */
>          if (BTScanPosIsPinned(so->currPos))
>              IncrBufferRefCount(so->currPos.buf);
>
>
> While reading the code to answer your question, I noticed that
> BTScanPosIsPinned macro name is misleading.
> It calls BufferIsValid(), not BufferIsPinned() as one could expect.
> And BufferIsValid in bufmgr.h comment explicitly states that it
> shouldn't be confused with BufferIsPinned.
> The same goes for BTScanPosUnpinIfPinned().

I have always hated this macro. I think that code like the specific
code you quoted might be correct, kind of, but it looks like the
author was trying to change as little as possible about the code as it
existed in 2015, rather than changing things so that everything made
sense. It looks like a messy accretion.

Let me see if I can get it straight:

We're incrementing the ref count on the buffer if and only if it is
pinned (by which we mean valid), though only when the scan is valid
(which is not the same as pinned). Whether or not we increment the
count of a valid scan doesn't affect anything else we do (i.e. we
still restore a marked position either way).

This is just awful.

> I propose that we update BTScanPosIsPinned macro. Or, at least write a
> comment, why its current behavior is fine.
> There are a few existing callers, that are definitely expecting that
> this macro checks a pin, which it doesn't do.
> I don't quite understand if that already causes any subtle bug, or the
> current algorithm is fine.

I think that you're right -- at a minimum, this requires more
documentation. This code is a few years old, but I still wouldn't be
surprised if it turned out to be slightly wrong in a way that was
important. We still have no way of detecting if a buffer is accessed
without a pin. There have been numerous bugs like that before. (We
have talked about teaching Valgrind to detect the case, but that never
actually happened.)

--
Peter Geoghegan



Re: Optimize single tuple fetch from nbtree index

От
Floris Van Nee
Дата:
> Hello,
> welcome to hackers with your first patch)

Thank you.

> Though, I got interested in the comment inconsistency you have found.
> I added debug message into this code branch of the patch and was able to
> see it in regression.diffs after 'make check':
> Speaking of your patch, it seems that the buffer was unpinned and pinned
> again between two reads,
> and the condition of holding it continuously has not been met.

May I ask what makes you conclude that the condition of holding the pin continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was
continuouslyheld by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin
onthe buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer
gotcompacted while the backend held a pin on it. 
After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c
This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room on
apage. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin on
thebuffer. 
So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may
happeneven while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the
commentsin eg. _bt_killitems? Because currently those comments make it look like this case never occurs. 

> While reading the code to answer your question, I noticed that
> BTScanPosIsPinned macro name is misleading.
> It calls BufferIsValid(), not BufferIsPinned() as one could expect.
> And BufferIsValid in bufmgr.h comment explicitly states that it
> shouldn't be confused with BufferIsPinned.
> The same goes for BTScanPosUnpinIfPinned().

I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this
introducesproblems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned
actuallychecks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the
buffergets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any
consecutivecall to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in
commentsthough. 

-Floris



Re: Optimize single tuple fetch from nbtree index

От
Anastasia Lubennikova
Дата:
25.08.2019 0:59, Floris Van Nee wrote:
>> Though, I got interested in the comment inconsistency you have found.
>> I added debug message into this code branch of the patch and was able to
>> see it in regression.diffs after 'make check':
>> Speaking of your patch, it seems that the buffer was unpinned and pinned
>> again between two reads,
>> and the condition of holding it continuously has not been met.
> May I ask what makes you conclude that the condition of holding the pin continuously has not been met?
> Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was
continuouslyheld by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin
onthe buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer
gotcompacted while the backend held a pin on it.
 
> After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c
> This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room
ona page. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin
onthe buffer.
 
> So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may
happeneven while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the
commentsin eg. _bt_killitems? Because currently those comments make it look like this case never occurs.
 

You're right, the pin was not released between page reads.
I also added debug to UnpinBuffer, but now I see that I had interpreted 
it wrongly.

As far as I understand, the issue with your patch is that it breaks the 
*scan stops "between" pages* assumption
and thus it unsafely interacts with _bt_vacuum_one_page() cleanup.

See README:
 >Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage.  The heap tuple IDs are then processed while
not holding any page lock within the index.  We do continue to hold a pin
on the leaf page in some circumstances, to protect against concurrent
deletions (see below).  In this state the scan is effectively stopped
"between" pages, either before or after the page it has pinned. This is
safe in the presence of concurrent insertions and even page splits, because
items are never moved across pre-existing page boundaries --- so the scan
cannot miss any items it should have seen, nor accidentally return the same
item twice.

and

 >Once an index tuple has been marked LP_DEAD it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion.

It seems that it contradicts the very idea of your patch, so probably we 
should look for other ways to optimize this use-case.
Maybe this restriction can be relaxed for write only tables, that never 
have to reread the page because of visibility, or something like that.
Also we probably can add to IndexScanDescData info about expected number 
of tuples, to allow index work more optimal
and avoid the overhead for other loads.=

>> While reading the code to answer your question, I noticed that
>> BTScanPosIsPinned macro name is misleading.
>> It calls BufferIsValid(), not BufferIsPinned() as one could expect.
>> And BufferIsValid in bufmgr.h comment explicitly states that it
>> shouldn't be confused with BufferIsPinned.
>> The same goes for BTScanPosUnpinIfPinned().
> I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this
introducesproblems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned
actuallychecks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the
buffergets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any
consecutivecall to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in
commentsthough.
 

That's true. It took me quite some time to understand that existing code 
is correct.
There is a comment for the structure's field that claims that 
BufferIsValid is the same that BufferIsPinned in ScanPos context.
Attached patch contains some comments' updates. Any suggestions on how 
to improve them are welcome.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Вложения

Re: Optimize single tuple fetch from nbtree index

От
Floris Van Nee
Дата:
> It seems that it contradicts the very idea of your patch, so probably we
> should look for other ways to optimize this use-case.
> Maybe this restriction can be relaxed for write only tables, that never
> have to reread the page because of visibility, or something like that.
> Also we probably can add to IndexScanDescData info about expected number
> of tuples, to allow index work more optimal
> and avoid the overhead for other loads.=

The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current
implementationof the patch should be correct like this - that's why I added the look-back code on the page if the tuple
couldn'tbe found anymore on the same location on the page. Similarly, it'll look on the page to the right if it
detecteda page split. These two measures combined should give a correct implementation of the 'it's possible that a
scanstops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the
performanceadvantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to
othersuggestions though. 

> That's true. It took me quite some time to understand that existing code
> is correct.
> There is a comment for the structure's field that claims that
> BufferIsValid is the same that BufferIsPinned in ScanPos context.
> Attached patch contains some comments' updates. Any suggestions on how
> to improve them are welcome.

I'll have a look tomorrow. Thanks a lot for writing this up!

-Floris



Re: Optimize single tuple fetch from nbtree index

От
Floris Van Nee
Дата:
>> It seems that it contradicts the very idea of your patch, so probably we
>> should look for other ways to optimize this use-case.
>> Maybe this restriction can be relaxed for write only tables, that never
>> have to reread the page because of visibility, or something like that.
>> Also we probably can add to IndexScanDescData info about expected number
>> of tuples, to allow index work more optimal
>> and avoid the overhead for other loads.=

> The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current
implementationof the patch should be correct like this - that's why I added the look-back code on the page if the tuple
couldn'tbe found anymore on the same location on the page. Similarly, it'll look on the page to the right if it
detecteda page split. These two measures combined should give a correct implementation of the 'it's possible that a
scanstops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the
performanceadvantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to
othersuggestions though. 

Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first
actuallygets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a
non-visibletuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like
thefollowing lead to this (in my patch)? 
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap
already,however it's not marked dead yet in the index. 
2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages
tosomehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby
actuallyremoving this item from the page. 
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but
cannotfind it anymore because it got removed completely by backend 2. 

If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things
needto happen concurrently while doing a visiblity check. 

As for your patch, I've had some time to take a look at it. For the two TODOs:

+        /* TODO Is it possible that currPage is not valid anymore? */
+        Assert(BTScanPosIsValid(so->currPos))

This Assert exists already a couple of lines earlier at the start of this function.

+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */

The comments in the source mention the following about this:
         * We note the buffer's block number so that we can release the pin later.
         * This allows us to re-read the buffer if it is needed again for hinting.
         */
        so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);

As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we
don'tstore currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think
that'sthe reason that currPage is stored. 

Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be
happyif this gets added. 

-Floris