Обсуждение: Making all nbtree entries unique by having heap TIDs participate in comparisons

Поиск
Список
Период
Сортировка

Making all nbtree entries unique by having heap TIDs participate in comparisons

От
Peter Geoghegan
Дата:
I've been thinking about using heap TID as a tie-breaker when
comparing B-Tree index tuples for a while now [1]. I'd like to make
all tuples at the leaf level unique, as assumed by L&Y. This can
enable "retail index tuple deletion", which I think we'll probably end
up implementing in some form or another, possibly as part of the zheap
project. It's also possible that this work will facilitate GIN-style
deduplication based on run length encoding of TIDs, or storing
versioned heap TIDs in an out-of-line nbtree-versioning structure
(unique indexes only). I can see many possibilities, but we have to
start somewhere.

I attach an unfinished prototype of suffix truncation, that also
sometimes *adds* a new attribute in pivot tuples. It adds an extra
heap TID from the leaf level when truncating away non-distinguishing
attributes during a leaf page split, though only when it must. The
patch also has nbtree treat heap TID as a first class part of the key
space of the index. Claudio wrote a patch that did something similar,
though without the suffix truncation part [2] (I haven't studied his
patch, to be honest). My patch is actually a very indirect spin-off of
Anastasia's covering index patch, and I want to show what I have in
mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members. On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

Note that the approach to suffix truncation that I've taken isn't even
my preferred approach [3] -- it's a medium-term solution that enables
making a heap TID attribute part of the key space, which enables
everything else. Cheap incremental/retail tuple deletion is the real
prize here; don't lose sight of that when looking through my patch. If
we're going to teach nbtree to truncate this new implicit heap TID
attribute, which seems essential, then we might as well teach nbtree
to do suffix truncation of other (user-visible) attributes while we're
at it. This patch isn't a particularly effective implementation of
suffix truncation, because that's not what I'm truly interested in
improving here (plus I haven't even bothered to optimize the logic for
picking a split point in light of suffix truncation).

amcheck
=======

This patch adds amcheck coverage, which seems like essential
infrastructure for developing a feature such as this. Extensive
amcheck coverage gave me confidence in my general approach. The basic
idea, invariant-wise, is to treat truncated attributes (often
including a truncated heap TID attribute in internal pages) as "minus
infinity" attributes, which participate in comparisons if and only if
we reach such attributes before the end of the scan key (a smaller
keysz for the index scan could prevent this). I've generalized the
minus infinity concept that _bt_compare() has always considered as a
special case, extending it to individual attributes. It's actually
possible to remove that old hard-coded _bt_compare() logic with this
patch applied without breaking anything, since we can rely on the
comparison of an explicitly 0-attribute tuple working the same way
(pg_upgrade'd databases will break if we do this, however, so I didn't
go that far).

Note that I didn't change the logic that has _bt_binsrch() treat
internal pages in a special way when tuples compare as equal. We still
need that logic for cases where keysz is less than the number of
indexed columns. It's only possible to avoid this _bt_binsrch() thing
for internal pages when all attributes, including heap TID, were
specified and compared (an insertion scan key has to have an entry for
every indexed column, including even heap TID). Doing better there
doesn't seem worth the trouble of teaching _bt_compare() to tell the
_bt_binsrch() caller about this as a special case. That means that we
still move left on equality in some cases where it isn't strictly
necessary, contrary to L&Y. However, amcheck verifies that the classic
"Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1")
when verifying parent/child relationships, which demonstrates that I
have restored the classic invariant (I just don't find it worthwhile
to take advantage of it within _bt_binsrch() just yet).

Most of this work was done while I was an employee of VMware, though I
joined Crunchy data on Monday and cleaned it up a bit more since then.
I'm excited about joining Crunchy, but I should also acknowledge
VMware's strong support of my work.

[1]
https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
[2] https://postgr.es/m/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com
[3] https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Robert Haas
Дата:
On Thu, Jun 14, 2018 at 2:44 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I've been thinking about using heap TID as a tie-breaker when
> comparing B-Tree index tuples for a while now [1]. I'd like to make
> all tuples at the leaf level unique, as assumed by L&Y. This can
> enable "retail index tuple deletion", which I think we'll probably end
> up implementing in some form or another, possibly as part of the zheap
> project. It's also possible that this work will facilitate GIN-style
> deduplication based on run length encoding of TIDs, or storing
> versioned heap TIDs in an out-of-line nbtree-versioning structure
> (unique indexes only). I can see many possibilities, but we have to
> start somewhere.

Yes, retail index deletion is essential for the delete-marking
approach that is proposed for zheap.

It could also be extremely useful in some workloads with the regular
heap.  If the indexes are large -- say, 100GB -- and the number of
tuples that vacuum needs to kill is small -- say, 5 -- scanning them
all to remove the references to those tuples is really inefficient.
If we had retail index deletion, then we could make a cost-based
decision about which approach to use in a particular case.

> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members.

Can you enumerate some of the technical obstacles that you see?

> On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).

I think it would be helpful if you could talk more about these
regressions (and the wins).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Jun 15, 2018 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yes, retail index deletion is essential for the delete-marking
> approach that is proposed for zheap.

Makes sense.

I don't know that much about zheap. I'm sure that retail index tuple
deletion is really important in general, though. The Gray & Reuter
book treats unique keys as a basic assumption, as do other
authoritative reference works and papers. Other database systems
probably make unique indexes simply use the user-visible attributes as
unique values, but appending heap TID as a unique-ifier is probably a
reasonably common design for secondary indexes (it would also be nice
if we could simply not store duplicates for unique indexes, rather
than using heap TID). I generally have a very high opinion on the
nbtree code, but this seems like a problem that ought to be fixed.

I've convinced myself that I basically have the right idea with this
patch, because the classic L&Y invariants have all been tested with an
enhanced amcheck run against all indexes in a regression test
database. There was other stress-testing, too. The remaining problems
are fixable, but I need some guidance.

> It could also be extremely useful in some workloads with the regular
> heap.  If the indexes are large -- say, 100GB -- and the number of
> tuples that vacuum needs to kill is small -- say, 5 -- scanning them
> all to remove the references to those tuples is really inefficient.
> If we had retail index deletion, then we could make a cost-based
> decision about which approach to use in a particular case.

I remember talking to Andres about this in a bar 3 years ago. I can
imagine variations of pruning that do some amount of this when there
are lots of duplicates. Perhaps something like InnoDB's purge threads,
which do things like in-place deletes of secondary indexes after an
updating (or deleting) xact commits. I believe that that mechanism
targets secondary indexes specifically, and that is operates quite
eagerly.

> Can you enumerate some of the technical obstacles that you see?

The #1 technical obstacle is that I simply don't know where I should
try to take this patch, given that it probably needs to be tied to
some much bigger project, such as zheap. I have an open mind, though,
and intend to help if I can. I'm not really sure what the #2 and #3
problems are, because I'd need to be able to see a few steps ahead to
be sure. Maybe #2 is that I'm doing something wonky to avoid breaking
duplicate checking for unique indexes. (The way that unique duplicate
checking has always worked [1] is perhaps questionable, though.)

> I think it would be helpful if you could talk more about these
> regressions (and the wins).

I think that the performance regressions are due to the fact that when
you have a huge number of duplicates today, it's useful to be able to
claim space to fit further duplicates from almost any of the multiple
leaf pages that contain or have contained duplicates. I'd hoped that
the increased temporal locality that the patch gets would more than
make up for that. As far as I can tell, the problem is that temporal
locality doesn't help enough. I saw that performance was somewhat
improved with extreme Zipf distribution contention, but it went the
other way with less extreme contention. The details are not that fresh
in my mind, since I shelved this patch for a while following limited
performance testing.

The code could certainly use more performance testing, and more
general polishing. I'm not strongly motivated to do that right now,
because I don't quite see a clear path to making this patch useful.
But, as I said, I have an open mind about what the next step should
be.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Claudio Freire
Дата:
On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

> > I think it would be helpful if you could talk more about these
> > regressions (and the wins).
>
> I think that the performance regressions are due to the fact that when
> you have a huge number of duplicates today, it's useful to be able to
> claim space to fit further duplicates from almost any of the multiple
> leaf pages that contain or have contained duplicates. I'd hoped that
> the increased temporal locality that the patch gets would more than
> make up for that. As far as I can tell, the problem is that temporal
> locality doesn't help enough. I saw that performance was somewhat
> improved with extreme Zipf distribution contention, but it went the
> other way with less extreme contention. The details are not that fresh
> in my mind, since I shelved this patch for a while following limited
> performance testing.
>
> The code could certainly use more performance testing, and more
> general polishing. I'm not strongly motivated to do that right now,
> because I don't quite see a clear path to making this patch useful.
> But, as I said, I have an open mind about what the next step should
> be.

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

I never got around to implementing it though, and it does get tricky
if you don't want to allow unbounded latency spikes.


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Way back when I was dabbling in this kind of endeavor, my main idea to
> counteract that, and possibly improve performance overall, was a
> microvacuum kind of thing that would do some on-demand cleanup to
> remove duplicates or make room before page splits. Since nbtree
> uniqueification enables efficient retail deletions, that could end up
> as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits. Maybe we could justify
having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
they're visible to anyone, and if not recycle), because we at least
know that the deleting transaction committed there. That is, they
could be recently dead or dead, and it may be worth going to the extra
trouble of checking which when we know that it's one of the two
possibilities.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Claudio Freire
Дата:
On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > Way back when I was dabbling in this kind of endeavor, my main idea to
> > counteract that, and possibly improve performance overall, was a
> > microvacuum kind of thing that would do some on-demand cleanup to
> > remove duplicates or make room before page splits. Since nbtree
> > uniqueification enables efficient retail deletions, that could end up
> > as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits. Maybe we could justify
> having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
> they're visible to anyone, and if not recycle), because we at least
> know that the deleting transaction committed there. That is, they
> could be recently dead or dead, and it may be worth going to the extra
> trouble of checking which when we know that it's one of the two
> possibilities.

Yes, but currently bt_vacuum_one_page does local work on the pinned
page. Doing dead tuple deletion however involves reading the heap to
check visibility at the very least, and doing it on the whole page
might involve several heap fetches, so it's an order of magnitude
heavier if done naively.

But the idea is to do just that, only not naively.


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Amit Kapila
Дата:
On Mon, Jun 18, 2018 at 10:33 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Way back when I was dabbling in this kind of endeavor, my main idea to
>> counteract that, and possibly improve performance overall, was a
>> microvacuum kind of thing that would do some on-demand cleanup to
>> remove duplicates or make room before page splits. Since nbtree
>> uniqueification enables efficient retail deletions, that could end up
>> as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits.
>

No, I don't think that is the case because we want to perform in-place
updates for indexed-column-updates.  If we won't delete-mark the index
tuple before performing in-place update, then we will have two tuples
in the index which point to the same heap-TID.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I imagine that retail index tuple deletion (the whole point of this
>> project) would be run by a VACUUM-like process that kills tuples that
>> are dead to everyone. Even with something like zheap, you cannot just
>> delete index tuples until you establish that they're truly dead. I
>> guess that the delete marking stuff that Robert mentioned marks tuples
>> as dead when the deleting transaction commits.
>>
>
> No, I don't think that is the case because we want to perform in-place
> updates for indexed-column-updates.  If we won't delete-mark the index
> tuple before performing in-place update, then we will have two tuples
> in the index which point to the same heap-TID.

How can an old MVCC snapshot that needs to find the heap tuple using
some now-obsolete key values get to the heap tuple via an index scan
if there are no index tuples that stick around until "recently dead"
heap tuples become "fully dead"? How can you avoid keeping around both
old and new index tuples at the same time?

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Amit Kapila
Дата:
On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I imagine that retail index tuple deletion (the whole point of this
>>> project) would be run by a VACUUM-like process that kills tuples that
>>> are dead to everyone. Even with something like zheap, you cannot just
>>> delete index tuples until you establish that they're truly dead. I
>>> guess that the delete marking stuff that Robert mentioned marks tuples
>>> as dead when the deleting transaction commits.
>>>
>>
>> No, I don't think that is the case because we want to perform in-place
>> updates for indexed-column-updates.  If we won't delete-mark the index
>> tuple before performing in-place update, then we will have two tuples
>> in the index which point to the same heap-TID.
>
> How can an old MVCC snapshot that needs to find the heap tuple using
> some now-obsolete key values get to the heap tuple via an index scan
> if there are no index tuples that stick around until "recently dead"
> heap tuples become "fully dead"? How can you avoid keeping around both
> old and new index tuples at the same time?
>

Both values will be present in the index, but the old value will be
delete-marked.  It is correct that we can't remove the value (index
tuple) from the index until it is truly dead (not visible to anyone),
but during a delete or index-update operation, we need to traverse the
index to mark the entries as delete-marked.  See, at this stage, I
don't want to go in too much detail discussion of how delete-marking
will happen in zheap and also I am not sure this thread is the right
place to discuss details of that technology.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Both values will be present in the index, but the old value will be
> delete-marked.  It is correct that we can't remove the value (index
> tuple) from the index until it is truly dead (not visible to anyone),
> but during a delete or index-update operation, we need to traverse the
> index to mark the entries as delete-marked.  See, at this stage, I
> don't want to go in too much detail discussion of how delete-marking
> will happen in zheap and also I am not sure this thread is the right
> place to discuss details of that technology.

I don't understand, but okay. I can provide feedback once a design for
delete marking is available.


-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Jun 14, 2018 at 11:44 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I attach an unfinished prototype of suffix truncation, that also
> sometimes *adds* a new attribute in pivot tuples. It adds an extra
> heap TID from the leaf level when truncating away non-distinguishing
> attributes during a leaf page split, though only when it must. The
> patch also has nbtree treat heap TID as a first class part of the key
> space of the index. Claudio wrote a patch that did something similar,
> though without the suffix truncation part [2] (I haven't studied his
> patch, to be honest). My patch is actually a very indirect spin-off of
> Anastasia's covering index patch, and I want to show what I have in
> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members. On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).

I attach a revised version, which is still very much of prototype
quality, but manages to solve a few of the problems that v1 had.
Andrey Lepikhov (CC'd) asked me to post any improved version I might
have for use with his retail index tuple deletion patch, so I thought
I'd post what I have.

The main development for v2 is that the sort order of the implicit
heap TID attribute is flipped. In v1, it was in "ascending" order. In
v2, comparisons of heap TIDs are inverted to make the attribute order
"descending". This has a number of advantages:

* It's almost consistent with the current behavior when there are
repeated insertions of duplicates. Currently, this tends to result in
page splits of the leftmost leaf page among pages that mostly consist
of the same duplicated value. This means that the destabilizing impact
on DROP SCHEMA ... CASCADE regression test output noted before [1] is
totally eliminated. There is now only a single trivial change to
regression test "expected" files, whereas in v1 dozens of "expected"
files had to be changed, often resulting in less useful reports for
the user.

* The performance regression I observed with various pgbench workloads
seems to have gone away, or is now within the noise range. A patch
like this one requires a lot of validation and testing, so this should
be taken with a grain of salt.

I may have been too quick to give up on my original ambition of
writing a stand-alone patch that can be justified entirely on its own
merits, without being tied to some much more ambitious project like
retail index tuple deletion by VACUUM, or zheap's deletion marking. I
still haven't tried to replace the kludgey handling of unique index
enforcement, even though that would probably have a measurable
additional performance benefit. I think that this patch could become
an unambiguous win.

[1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
Attached is my v3, which has some significant improvements:

* The hinting for unique index inserters within _bt_findinsertloc()
has been restored, more or less.

* Bug fix for case where left side of split comes from tuple being
inserted. We need to pass this to _bt_suffix_truncate() as the left
side of the split, which we previously failed to do. The amcheck
coverage I've added allowed me to catch this issue during a benchmark.
(I use amcheck during benchmarks to get some amount of stress-testing
in.)

* New performance optimization that allows us to descend a downlink
when its user-visible attributes have scankey-equal values. We avoid
an unnecessary move left by using a sentinel scan tid that's less than
any possible real heap TID, but still greater than minus infinity to
_bt_compare().

I am now considering pursuing this as a project in its own right,
which can be justified without being part of some larger effort to add
retail index tuple deletion (e.g. by VACUUM). I think that I can get
it to the point of being a totally unambiguous win, if I haven't
already. So, this patch is no longer just an interesting prototype of
a new architectural direction we should take. In any case, it has far
fewer problems than v2.

Testing the performance characteristics of this patch has proven
difficult. My home server seems to show a nice win with a pgbench
workload that uses a Gaussian distribution for the pgbench_accounts
queries (script attached). That seems consistent and reproducible. My
home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
scale 4800 look like this:

Master:

25 clients:
tps = 15134.223357 (excluding connections establishing)

50 clients:
tps = 13708.419887 (excluding connections establishing)

75 clients:
tps = 12951.286926 (excluding connections establishing)

90 clients:
tps = 12057.852088 (excluding connections establishing)

Patch:

25 clients:
tps = 17857.863353 (excluding connections establishing)

50 clients:
tps = 14319.514825 (excluding connections establishing)

75 clients:
tps = 14015.794005 (excluding connections establishing)

90 clients:
tps = 12495.683053 (excluding connections establishing)

I ran this twice, and got pretty consistent results each time (there
were many other benchmarks on my home server -- this was the only one
that tested this exact patch, though). Note that there was only one
pgbench initialization for each set of runs. It looks like a pretty
strong result for the patch - note that the accounts table is about
twice the size of available main memory. The server is pretty well
overloaded in every individual run.

Unfortunately, I have a hard time showing much of any improvement on a
storage-optimized AWS instance with EBS storage, with scaled up
pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
appears to more or less make no overall difference there, for reasons
that I have yet to get to the bottom of. I conceived this AWS
benchmark as something that would have far longer run times with a
scaled-up database size. My expectation was that it would confirm the
preliminary result, but it hasn't.

Maybe the issue is that it's far harder to fill the I/O queue on this
AWS instance? Or perhaps its related to the higher latency of EBS,
compared to the local SSD on my home server? I would welcome any ideas
about how to benchmark the patch. It doesn't necessarily have to be a
huge win for a very generic workload like the one I've tested, since
it would probably still be enough of a win for things like free space
management in secondary indexes [1]. Plus, of course, it seems likely
that we're going to eventually add retail index tuple deletion in some
form or another, which this is prerequisite to.

For a project like this, I expect an unambiguous, across the board win
from the committed patch, even if it isn't a huge win. I'm encouraged
by the fact that this is starting to look like credible as a
stand-alone patch, but I have to admit that there's probably still
significant gaps in my understanding of how it affects real-world
performance. I don't have a lot of recent experience with benchmarking
workloads like this one.

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andrey Lepikhov
Дата:
I use v3 version of the patch for a Retail Indextuple Deletion and from 
time to time i catch regression test error (see attachment).
As i see in regression.diff, the problem is instability order of DROP 
... CASCADE deletions.
Most frequently i get error on a test called 'updatable views'.
I check nbtree invariants during all tests, but index relations is in 
consistent state all time.
My hypothesis is: instability order of logical duplicates in index 
relations on a pg_depend relation.
But 'updatable views' test not contains any sources of instability: 
concurrent insertions, updates, vacuum and so on. This fact discourage me.
May be you have any ideas on this problem?


18.07.2018 00:21, Peter Geoghegan пишет:
> Attached is my v3, which has some significant improvements:
> 
> * The hinting for unique index inserters within _bt_findinsertloc()
> has been restored, more or less.
> 
> * Bug fix for case where left side of split comes from tuple being
> inserted. We need to pass this to _bt_suffix_truncate() as the left
> side of the split, which we previously failed to do. The amcheck
> coverage I've added allowed me to catch this issue during a benchmark.
> (I use amcheck during benchmarks to get some amount of stress-testing
> in.)
> 
> * New performance optimization that allows us to descend a downlink
> when its user-visible attributes have scankey-equal values. We avoid
> an unnecessary move left by using a sentinel scan tid that's less than
> any possible real heap TID, but still greater than minus infinity to
> _bt_compare().
> 
> I am now considering pursuing this as a project in its own right,
> which can be justified without being part of some larger effort to add
> retail index tuple deletion (e.g. by VACUUM). I think that I can get
> it to the point of being a totally unambiguous win, if I haven't
> already. So, this patch is no longer just an interesting prototype of
> a new architectural direction we should take. In any case, it has far
> fewer problems than v2.
> 
> Testing the performance characteristics of this patch has proven
> difficult. My home server seems to show a nice win with a pgbench
> workload that uses a Gaussian distribution for the pgbench_accounts
> queries (script attached). That seems consistent and reproducible. My
> home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
> 250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
> scale 4800 look like this:
> 
> Master:
> 
> 25 clients:
> tps = 15134.223357 (excluding connections establishing)
> 
> 50 clients:
> tps = 13708.419887 (excluding connections establishing)
> 
> 75 clients:
> tps = 12951.286926 (excluding connections establishing)
> 
> 90 clients:
> tps = 12057.852088 (excluding connections establishing)
> 
> Patch:
> 
> 25 clients:
> tps = 17857.863353 (excluding connections establishing)
> 
> 50 clients:
> tps = 14319.514825 (excluding connections establishing)
> 
> 75 clients:
> tps = 14015.794005 (excluding connections establishing)
> 
> 90 clients:
> tps = 12495.683053 (excluding connections establishing)
> 
> I ran this twice, and got pretty consistent results each time (there
> were many other benchmarks on my home server -- this was the only one
> that tested this exact patch, though). Note that there was only one
> pgbench initialization for each set of runs. It looks like a pretty
> strong result for the patch - note that the accounts table is about
> twice the size of available main memory. The server is pretty well
> overloaded in every individual run.
> 
> Unfortunately, I have a hard time showing much of any improvement on a
> storage-optimized AWS instance with EBS storage, with scaled up
> pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
> vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
> appears to more or less make no overall difference there, for reasons
> that I have yet to get to the bottom of. I conceived this AWS
> benchmark as something that would have far longer run times with a
> scaled-up database size. My expectation was that it would confirm the
> preliminary result, but it hasn't.
> 
> Maybe the issue is that it's far harder to fill the I/O queue on this
> AWS instance? Or perhaps its related to the higher latency of EBS,
> compared to the local SSD on my home server? I would welcome any ideas
> about how to benchmark the patch. It doesn't necessarily have to be a
> huge win for a very generic workload like the one I've tested, since
> it would probably still be enough of a win for things like free space
> management in secondary indexes [1]. Plus, of course, it seems likely
> that we're going to eventually add retail index tuple deletion in some
> form or another, which this is prerequisite to.
> 
> For a project like this, I expect an unambiguous, across the board win
> from the committed patch, even if it isn't a huge win. I'm encouraged
> by the fact that this is starting to look like credible as a
> stand-alone patch, but I have to admit that there's probably still
> significant gaps in my understanding of how it affects real-world
> performance. I don't have a lot of recent experience with benchmarking
> workloads like this one.
> 
> [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
> 

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Aug 1, 2018 at 9:48 PM, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I use v3 version of the patch for a Retail Indextuple Deletion and from time
> to time i catch regression test error (see attachment).
> As i see in regression.diff, the problem is instability order of DROP ...
> CASCADE deletions.
> Most frequently i get error on a test called 'updatable views'.
> I check nbtree invariants during all tests, but index relations is in
> consistent state all time.
> My hypothesis is: instability order of logical duplicates in index relations
> on a pg_depend relation.
> But 'updatable views' test not contains any sources of instability:
> concurrent insertions, updates, vacuum and so on. This fact discourage me.
> May be you have any ideas on this problem?

It's caused by an implicit dependency on the order of items in an
index. See https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql.

I've been making "\set VERBOSITY terse" changes like this whenever it
happens in a new place. It seems to have finally stopped happening.
Note that this is a preexisting issue; there are already places in the
regression tests where we paper over the problem in a similar way. I
notice that it tends to happen when the machine running the regression
tests is heavily loaded.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
Attached is v4. I have two goals in mind for this revision, goals that
are of great significance to the project as a whole:

* Making better choices around leaf page split points, in order to
maximize suffix truncation and thereby maximize fan-out. This is
important when there are mostly-distinct index tuples on each leaf
page (i.e. most of the time). Maximizing the effectiveness of suffix
truncation needs to be weighed against the existing/main
consideration: evenly distributing space among each half of a page
split. This is tricky.

* Not regressing the logic that lets us pack leaf pages full when
there are a great many logical duplicates. That is, I still want to
get the behavior I described on the '"Write amplification" is made
worse by "getting tired" while inserting into nbtree secondary
indexes' thread [1]. This is not something that happens as a
consequence of thinking about suffix truncation specifically, and
seems like a fairly distinct thing to me. It's actually a bit similar
to the rightmost 90/10 page split case.

v4 adds significant new logic to make us do better on the first goal,
without hurting the second goal. It's easy to regress one while
focussing on the other, so I've leaned on a custom test suite
throughout development. Previous versions mostly got the first goal
wrong, but got the second goal right. For the time being, I'm
focussing on index size, on the assumption that I'll be able to
demonstrate a nice improvement in throughput or latency later. I can
get the main TPC-C order_line pkey about 7% smaller after an initial
bulk load with the new logic added to get the first goal (note that
the benefits with a fresh CREATE INDEX are close to zero). The index
is significantly smaller, even though the internal page index tuples
can themselves never be any smaller due to alignment -- this is all
about not restricting what can go on each leaf page by too much. 7% is
not as dramatic as the "get tired" case, which saw something like a
50% decrease in bloat for one pathological case, but it's still
clearly well worth having. The order_line primary key is the largest
TPC-C index, and I'm merely doing a standard bulk load to get this 7%
shrinkage. The TPC-C order_line primary key happens to be kind of
adversarial or pathological to B-Tree space management in general, but
it's still fairly realistic.

For the first goal, page splits now weigh what I've called the
"distance" between tuples, with a view to getting the most
discriminating split point -- the leaf split point that maximizes the
effectiveness of suffix truncation, within a range of acceptable split
points (acceptable from the point of view of not implying a lopsided
page split). This is based on probing IndexTuple contents naively when
deciding on a split point, without regard for the underlying
opclass/types. We mostly just use char integer comparisons to probe,
on the assumption that that's a good enough proxy for using real
insertion scankey comparisons (only actual truncation goes to those
lengths, since that's a strict matter of correctness). This distance
business might be considered a bit iffy by some, so I want to get
early feedback. This new "distance" code clearly needs more work, but
I felt that I'd gone too long without posting a new version.

For the second goal, I've added a new macro that can be enabled for
debugging purposes. This has the implementation sort heap TIDs in ASC
order, rather than DESC order. This nicely demonstrates how my two
goals for v4 are fairly independent; uncommenting "#define
BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
duplicates are inserted, but won't regress things like the TPC-C
indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
tests, though in a benign way can safely be ignored.)

Open items:

* Do more traditional benchmarking.

* Add pg_upgrade support.

* Simplify _bt_findsplitloc() logic.

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
Attached is v5, which significantly simplifies the _bt_findsplitloc()
logic. It's now a great deal easier to follow. It would be helpful if
someone could do code-level review of the overhauled
_bt_findsplitloc(). That's the most important part of the patch. It
involves relatively subjective trade-offs around total effort spent
during a page split, space utilization, and avoiding "false sharing"
(I call the situation where a range of duplicate values straddle two
leaf pages unnecessarily "false sharing", since it necessitates that
subsequent index scans visit two index scans rather than just one,
even when that's avoidable.)

This version has slightly improved performance, especially for cases
where an index gets bloated without any garbage being generated. With
the UK land registry data [1], an index on (county, city, locality) is
shrunk by just over 18% by the new logic (I recall that it was shrunk
by ~15% in an earlier version). In concrete terms, it goes from being
1.288 GiB on master to being 1.054 GiB with v5 of the patch. This is
mostly because the patch intelligently packs together duplicate-filled
pages tightly (in particular, it avoids "getting tired"), but also
because it makes pivots less restrictive about where leaf tuples can
go. I still manage to shrink the largest TPC-C and TPC-H indexes by at
least 5% following an initial load performed by successive INSERTs.
Those are unique indexes, so the benefits are certainly not limited to
cases involving many duplicates.

3 modes
-------

My new approach is to teach _bt_findsplitloc() 3 distinct modes of
operation: Regular/default mode, many duplicates mode, and single
value mode. The higher level split code always asks for a default mode
call to _bt_findsplitloc(), so that's always where we start. For leaf
page splits, _bt_findsplitloc() will occasionally call itself
recursively in either many duplicates mode or single value mode. This
happens when the default strategy doesn't work out.

* Default mode almost does what we do already, but remembers the top n
candidate split points, sorted by the delta between left and right
post-split free space, rather than just looking for the overall lowest
delta split point.

Then, we go through a second pass over the temp array of "acceptable"
split points, that considers the needs of suffix truncation.

* Many duplicates mode is used when we fail to find a "distinguishing"
split point in regular mode, but have determined that it's possible to
get one if a new, exhaustive search is performed.

We go to great lengths to avoid having to append a heap TID to the new
left page high key -- that's what I mean by "distinguishing". We're
particularly concerned with false sharing by subsequent point lookup
index scans here.

* Single value mode is used when we see that even many duplicates mode
would be futile, as the leaf page is already *entirely* full of
logical duplicates.

Single value mode isn't exhaustive, since there is clearly nothing to
exhaustively search for. Instead, it packs together as many tuples as
possible on the right side of the split. Since heap TIDs sort in
descending order, this is very much like a "leftmost" split that tries
to free most of the space on the left side, and pack most of the page
contents on the right side. Except that it's leftmost, and in
particular is leftmost among pages full of logical duplicates (as
opposed to being leftmost/rightmost among pages on an entire level of
the tree, as with the traditional rightmost 90:10 split thing).

Other changes
-------------

* I now explicitly use fillfactor in the manner of a rightmost split
to get the single value mode behavior.

I call these types of splits (rightmost and single value mode splits)
"weighted" splits in the patch. This is much more consistent with our
existing conventions than my previous approach.

* Improved approached to inexpensively determining how effective
suffix truncation will be for a given candidate split point.

I no longer naively probe the contents of index tuples to do char
comparisons.  Instead, I use a tuple descriptor to get offsets to each
attribute in each tuple in turn, then calling to datumIsEqual() to
determine if they're equal. This is almost as good as a full scan key
comparison. This actually seems to be a bit faster, and also takes
care of INCLUDE indexes without special care (no need to worry about
probing non-key attributes, and reaching a faulty conclusion about
which split point helps with suffix truncation).

I still haven't managed to add pg_upgrade support, but that's my next
step. I am more or less happy with the substance of the patch in v5,
and feel that I can now work backwards towards figuring out the best
way to deal with on-disk compatibility. It shouldn't be too hard --
most of the effort will involve coming up with a good test suite.

[1] https://wiki.postgresql.org/wiki/Sample_Databases
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andrey Lepikhov
Дата:
I use the v5 version in quick vacuum strategy and in the heap&index 
cleaner (see new patches at corresponding thread a little bit later). It 
works fine and give quick vacuum 2-3% performance growup in comparison 
with version v3 at my 24-core test server.
Note, that the interface of _bt_moveright() and _bt_binsrch() functions 
with combination of scankey, scantid and nextkey parameters is too 
semantic loaded.
Everytime of code reading i spend time to remember, what this functions 
do exactly.
May be it needed to rewrite comments. For example, _bt_moveright() 
comments may include phrase:
nextkey=false: Traverse to the next suitable index page if the current 
page does not contain the value (scan key; scan tid).

What do you think about submitting the patch to the next CF?

12.09.2018 23:11, Peter Geoghegan пишет:
> Attached is v4. I have two goals in mind for this revision, goals that
> are of great significance to the project as a whole:
> 
> * Making better choices around leaf page split points, in order to
> maximize suffix truncation and thereby maximize fan-out. This is
> important when there are mostly-distinct index tuples on each leaf
> page (i.e. most of the time). Maximizing the effectiveness of suffix
> truncation needs to be weighed against the existing/main
> consideration: evenly distributing space among each half of a page
> split. This is tricky.
> 
> * Not regressing the logic that lets us pack leaf pages full when
> there are a great many logical duplicates. That is, I still want to
> get the behavior I described on the '"Write amplification" is made
> worse by "getting tired" while inserting into nbtree secondary
> indexes' thread [1]. This is not something that happens as a
> consequence of thinking about suffix truncation specifically, and
> seems like a fairly distinct thing to me. It's actually a bit similar
> to the rightmost 90/10 page split case.
> 
> v4 adds significant new logic to make us do better on the first goal,
> without hurting the second goal. It's easy to regress one while
> focussing on the other, so I've leaned on a custom test suite
> throughout development. Previous versions mostly got the first goal
> wrong, but got the second goal right. For the time being, I'm
> focussing on index size, on the assumption that I'll be able to
> demonstrate a nice improvement in throughput or latency later. I can
> get the main TPC-C order_line pkey about 7% smaller after an initial
> bulk load with the new logic added to get the first goal (note that
> the benefits with a fresh CREATE INDEX are close to zero). The index
> is significantly smaller, even though the internal page index tuples
> can themselves never be any smaller due to alignment -- this is all
> about not restricting what can go on each leaf page by too much. 7% is
> not as dramatic as the "get tired" case, which saw something like a
> 50% decrease in bloat for one pathological case, but it's still
> clearly well worth having. The order_line primary key is the largest
> TPC-C index, and I'm merely doing a standard bulk load to get this 7%
> shrinkage. The TPC-C order_line primary key happens to be kind of
> adversarial or pathological to B-Tree space management in general, but
> it's still fairly realistic.
> 
> For the first goal, page splits now weigh what I've called the
> "distance" between tuples, with a view to getting the most
> discriminating split point -- the leaf split point that maximizes the
> effectiveness of suffix truncation, within a range of acceptable split
> points (acceptable from the point of view of not implying a lopsided
> page split). This is based on probing IndexTuple contents naively when
> deciding on a split point, without regard for the underlying
> opclass/types. We mostly just use char integer comparisons to probe,
> on the assumption that that's a good enough proxy for using real
> insertion scankey comparisons (only actual truncation goes to those
> lengths, since that's a strict matter of correctness). This distance
> business might be considered a bit iffy by some, so I want to get
> early feedback. This new "distance" code clearly needs more work, but
> I felt that I'd gone too long without posting a new version.
> 
> For the second goal, I've added a new macro that can be enabled for
> debugging purposes. This has the implementation sort heap TIDs in ASC
> order, rather than DESC order. This nicely demonstrates how my two
> goals for v4 are fairly independent; uncommenting "#define
> BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
> duplicates are inserted, but won't regress things like the TPC-C
> indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
> tests, though in a benign way can safely be ignored.)
> 
> Open items:
> 
> * Do more traditional benchmarking.
> 
> * Add pg_upgrade support.
> 
> * Simplify _bt_findsplitloc() logic.
> 
> [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
> 

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Sep 19, 2018 at 9:56 PM, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Note, that the interface of _bt_moveright() and _bt_binsrch() functions with
> combination of scankey, scantid and nextkey parameters is too semantic
> loaded.
> Everytime of code reading i spend time to remember, what this functions do
> exactly.
> May be it needed to rewrite comments.

I think that it might be a good idea to create an "BTInsertionScankey"
struct, or similar, since keysz, nextkey, the scankey array and now
scantid are all part of that, and are all common to these 4 or so
functions. It could have a flexible array at the end, so that we still
only need a single palloc(). I'll look into that.

> What do you think about submitting the patch to the next CF?

Clearly the project that you're working on is a difficult one. It's
easy for me to understand why you might want to take an iterative
approach, with lots of prototyping. Your patch needs attention to
advance, and IMV the CF is the best way to get that attention. So, I
think that it would be fine to go submit it now.

I must admit that I didn't even notice that your patch lacked a CF
entry. Everyone has a different process, perhaps.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan <pg@bowt.ie> wrote:
> 3 modes
> -------
>
> My new approach is to teach _bt_findsplitloc() 3 distinct modes of
> operation: Regular/default mode, many duplicates mode, and single
> value mode.

I think that I'll have to add a fourth mode, since I came up with
another strategy that is really effective though totally complementary
to the other 3 -- "multiple insertion point" mode. Credit goes to
Kevin Grittner for pointing out that this technique exists about 2
years ago [1]. The general idea is to pick a split point just after
the insertion point of the new item (the incoming tuple that prompted
a page split) when it looks like there are localized monotonically
increasing ranges.  This is like a rightmost 90:10 page split, except
the insertion point is not at the rightmost page on the level -- it's
rightmost within some local grouping of values.

This makes the two largest TPC-C indexes *much* smaller. Previously,
they were shrunk by a little over 5% by using the new generic
strategy, a win that now seems like small potatoes. With this new
mode, TPC-C's order_line primary key, which is the largest index of
all, is ~45% smaller following a standard initial bulk load at
scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020
blocks (429.84 MiB). It's actually slightly smaller than it would be
after a fresh REINDEX with the new strategy. We see almost as big a
win with the second largest TPC-C index, the stock table's primary key
-- it's ~40% smaller.

Here is the definition of the biggest index, the order line primary key index:

pg@tpcc[3666]=# \d order_line_pkey
     Index "public.order_line_pkey"
  Column   │  Type   │ Key? │ Definition
───────────┼─────────┼──────┼────────────
 ol_w_id   │ integer │ yes  │ ol_w_id
 ol_d_id   │ integer │ yes  │ ol_d_id
 ol_o_id   │ integer │ yes  │ ol_o_id
 ol_number │ integer │ yes  │ ol_number
primary key, btree, for table "public.order_line"

The new strategy/mode works very well because we see monotonically
increasing inserts on ol_number (an order's item number), but those
are grouped by order. It's kind of an adversarial case for our
existing implementation, and yet it seems like it's probably a fairly
common scenario in the real world.

Obviously these are very significant improvements. They really exceed
my initial expectations for the patch. TPC-C is generally considered
to be by far the most influential database benchmark of all time, and
this is something that we need to pay more attention to. My sense is
that the TPC-C benchmark is deliberately designed to almost require
that the system under test have this "multiple insertion point" B-Tree
optimization, suffix truncation, etc. This is exactly the same index
that we've seen reports of out of control bloat on when people run
TPC-C over hours or days [2].

My next task is to find heuristics to make the new page split
mode/strategy kick in when it's likely to help, but not kick in when
it isn't (when we want something close to a generic 50:50 page split).
These heuristics should look similar to what I've already done to get
cases with lots of duplicates to behave sensibly. Anyone have any
ideas on how to do this? I might end up inferring a "multiple
insertion point" case from the fact that there are multiple
pass-by-value attributes for the index, with the new/incoming tuple
having distinct-to-immediate-left-tuple attribute values for the last
column, but not the first few. It also occurs to me to consider the
fragmentation of the page as a guide, though I'm less sure about that.
I'll probably need to experiment with a variety of datasets before I
settle on something that looks good. Forcing the new strategy without
considering any of this actually works surprisingly well on cases
where you'd think it wouldn't, since a 50:50 page split is already
something of a guess about where future insertions will end up.

[1] https://postgr.es/m/CACjxUsN5fV0kV=YirXwA0S7LqoOJuy7soPtipDhUCemhgwoVFg@mail.gmail.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Eisentraut
Дата:
On 19/09/2018 20:23, Peter Geoghegan wrote:
> Attached is v5,

So.  I don't know much about the btree code, so don't believe anything I
say.

I was very interested in the bloat test case that you posted on
2018-07-09 and I tried to understand it more.  The current method for
inserting a duplicate value into a btree is going to the leftmost point
for that value and then move right until we find some space or we get
"tired" of searching, in which case just make some space right there.
The problem is that it's tricky to decide when to stop searching, and
there are scenarios when we stop too soon and repeatedly miss all the
good free space to the right, leading to bloat even though the index is
perhaps quite empty.

I tried playing with the getting-tired factor (it could plausibly be a
reloption), but that wasn't very successful.  You can use that to
postpone the bloat, but you won't stop it, and performance becomes terrible.

You propose to address this by appending the tid to the index key, so
each key, even if its "payload" is a duplicate value, is unique and has
a unique place, so we never have to do this "tiresome" search.  This
makes a lot of sense, and the results in the bloat test you posted are
impressive and reproducible.

I tried a silly alternative approach by placing a new duplicate key in a
random location.  This should be equivalent since tids are effectively
random.  I didn't quite get this to fully work yet, but at least it
doesn't blow up, and it gets the same regression test ordering
differences for pg_depend scans that you are trying to paper over. ;-)

As far as the code is concerned, I agree with Andrey Lepikhov that one
more abstraction layer that somehow combines the scankey and the tid or
some combination like that would be useful, instead of passing the tid
as a separate argument everywhere.

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
 That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc.  In the last few
versions, the patches have still been growing significantly in size and
functionality, and most of the supposed benefits are not readily visible
in tests.

And of course we need to think about how to handle upgrades, but you
have already started a separate discussion about that.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> So.  I don't know much about the btree code, so don't believe anything I
> say.

I think that showing up and reviewing this patch makes you somewhat of
an expert, by default. There just isn't enough expertise in this area.

> I was very interested in the bloat test case that you posted on
> 2018-07-09 and I tried to understand it more.

Up until recently, I thought that I would justify the patch primarily
as a project to make B-Trees less bloated when there are many
duplicates, with maybe as many as a dozen or more secondary benefits.
That's what I thought it would say in the release notes, even though
the patch was always a broader strategic thing. Now I think that the
TPC-C multiple insert point bloat issue might be the primary headline
benefit, though.

I hate to add more complexity to get it to work well, but just look at
how much smaller the indexes are following an initial bulk load (bulk
insertions) using my working copy of the patch:

Master

customer_pkey: 75 MB
district_pkey: 40 kB
idx_customer_name: 107 MB
item_pkey: 2216 kB
new_order_pkey: 22 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
oorder_pkey: 78 MB
order_line_pkey: 774 MB
stock_pkey: 181 MB
warehouse_pkey: 24 kB

Patch

customer_pkey: 50 MB
district_pkey: 40 kB
idx_customer_name: 105 MB
item_pkey: 2216 kB
new_order_pkey: 12 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
oorder_pkey: 42 MB
order_line_pkey: 429 MB
stock_pkey: 111 MB
warehouse_pkey: 24 kB

All of the indexes used by oltpbench to do TPC-C are listed, so you're
seeing the full picture for TPC-C bulk loading here (actually, there
is another index that has an identical definition to
oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted
as redundant). As you can see, all the largest indexes are
*significantly* smaller, with the exception of
oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this
improvement until I post the next version, though, since this is a
brand new development. Note that VACUUM hasn't been run at all, and
doesn't need to be run, as there are no dead tuples. Note also that
this has *nothing* to do with getting tired -- almost all of these
indexes are unique indexes.

Note that I'm also testing TPC-E and TPC-H in a very similar way,
which have both been improved noticeably, but to a degree that's much
less compelling than what we see with TPC-C. They have "getting tired"
cases that benefit quite a bit, but those are the minority.

Have you ever used HammerDB? I got this data from oltpbench, but I
think that HammerDB might be the way to go with TPC-C testing
Postgres.

> You propose to address this by appending the tid to the index key, so
> each key, even if its "payload" is a duplicate value, is unique and has
> a unique place, so we never have to do this "tiresome" search.This
> makes a lot of sense, and the results in the bloat test you posted are
> impressive and reproducible.

Thanks.

> I tried a silly alternative approach by placing a new duplicate key in a
> random location.  This should be equivalent since tids are effectively
> random.

You're never going to get any other approach to work remotely as well,
because while the TIDs may seem to be random in some sense, they have
various properties that are very useful from a high level, data life
cycle point of view. For insertions of duplicates, heap TID has
temporal locality --  you are only going to dirty one or two leaf
pages, rather than potentially dirtying dozens or hundreds.
Furthermore, heap TID is generally strongly correlated with primary
key values, so VACUUM can be much much more effective at killing
duplicates in low cardinality secondary indexes when there are DELETEs
with a range predicate on the primary key. This is a lot more
realistic than the 2018-07-09 test case, but it still could make as
big of a difference.

>  I didn't quite get this to fully work yet, but at least it
> doesn't blow up, and it gets the same regression test ordering
> differences for pg_depend scans that you are trying to paper over. ;-)

FWIW, I actually just added to the papering over, rather than creating
a new problem. There are plenty of instances of "\set VERBOSITY terse"
in the regression tests already, for the same reason. If you run the
regression tests with ignore_system_indexes=on, there are very similar
failures [1].

> As far as the code is concerned, I agree with Andrey Lepikhov that one
> more abstraction layer that somehow combines the scankey and the tid or
> some combination like that would be useful, instead of passing the tid
> as a separate argument everywhere.

I've already drafted this in my working copy. It is a clear
improvement. You can expect it in the next version.

> I think it might help this patch move along if it were split up a bit,
> for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
> That way, it would also be easier to test out each piece separately.
> For example, how much space does suffix truncation save in what
> scenario, are there any performance regressions, etc.

I'll do my best. I don't think I can sensibly split out suffix
truncation from the TID stuff -- those seem truly inseparable, since
my mental model for suffix truncation breaks without fully unique
keys. I can break out all the cleverness around choosing a split point
into its own patch, though -- _bt_findsplitloc() has only been changed
to give weight to several factors that become important. It's the
"brain" of the optimization, where 90% of the complexity actually
lives.

Removing the _bt_findsplitloc() changes will make the performance of
the other stuff pretty poor, and in particular will totally remove the
benefit for cases that "become tired" on the master branch. That could
be slightly interesting, I suppose.

> In the last few
> versions, the patches have still been growing significantly in size and
> functionality, and most of the supposed benefits are not readily visible
> in tests.

I admit that this patch has continued to evolve up until this week,
despite the fact that I thought it would be a lot more settled by now.
It has actually become simpler in recent months, though. And, I think
that the results justify the iterative approach I've taken. This stuff
is inherently very subtle, and I've had to spend a lot of time paying
attention to tiny regressions across a fairly wide variety of test
cases.

> And of course we need to think about how to handle upgrades, but you
> have already started a separate discussion about that.

Right.

[1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com
--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andrey Lepikhov
Дата:
28.09.2018 23:08, Peter Geoghegan wrote:
> On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> I think it might help this patch move along if it were split up a bit,
>> for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
>> That way, it would also be easier to test out each piece separately.
>> For example, how much space does suffix truncation save in what
>> scenario, are there any performance regressions, etc.
> 
> I'll do my best. I don't think I can sensibly split out suffix
> truncation from the TID stuff -- those seem truly inseparable, since
> my mental model for suffix truncation breaks without fully unique
> keys. I can break out all the cleverness around choosing a split point
> into its own patch, though -- _bt_findsplitloc() has only been changed
> to give weight to several factors that become important. It's the
> "brain" of the optimization, where 90% of the complexity actually
> lives.
> 
> Removing the _bt_findsplitloc() changes will make the performance of
> the other stuff pretty poor, and in particular will totally remove the
> benefit for cases that "become tired" on the master branch. That could
> be slightly interesting, I suppose.

I am reviewing this patch too. And join to Peter Eisentraut opinion 
about splitting the patch into a hierarchy of two or three patches: 
"functional" - tid stuff and "optimizational" - suffix truncation & 
splitting. My reasons are simplification of code review, investigation 
and benchmarking.
Now benchmarking is not clear. Possible performance degradation from TID 
ordering interfere with positive effects from the optimizations in 
non-trivial way.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Sep 28, 2018 at 10:58 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I am reviewing this patch too. And join to Peter Eisentraut opinion
> about splitting the patch into a hierarchy of two or three patches:
> "functional" - tid stuff and "optimizational" - suffix truncation &
> splitting. My reasons are simplification of code review, investigation
> and benchmarking.

As I mentioned to Peter, I don't think that I can split out the heap
TID stuff from the suffix truncation stuff. At least not without
making the patch even more complicated, for no benefit. I will split
out the "brain" of the patch (the _bt_findsplitloc() stuff, which
decides on a split point using sophisticated rules) from the "brawn"
(the actually changes to how index scans work, including the heap TID
stuff, as well as the code for actually physically performing suffix
truncation). The brain of the patch is where most of the complexity
is, as well as most of the code. The brawn of the patch is _totally
unusable_ without intelligence around split points, but I'll split
things up along those lines anyway. Doing so should make the whole
design a little easier to see follow.

> Now benchmarking is not clear. Possible performance degradation from TID
> ordering interfere with positive effects from the optimizations in
> non-trivial way.

Is there any evidence of a regression in the last 2 versions? I've
been using pgbench, which didn't show any. That's not a sympathetic
case for the patch, though it would be nice to confirm if there was
some small improvement there. I've seen contradictory results (slight
improvements and slight regressions), but that was with a much earlier
version, so it just isn't relevant now. pgbench is mostly interesting
as a thing that we want to avoid regressing.

Once I post the next version, it would be great if somebody could use
HammerDB's OLTP test, which seems like the best fair use
implementation of TPC-C that's available. I would like to make that
the "this is why you should care, even if you happen to not believe in
the patch's strategic importance" benchmark. TPC-C is clearly the most
influential database benchmark ever, so I think that that's a fair
request. (See the TPC-C commentary at
https://www.hammerdb.com/docs/ch03s02.html, for example.)

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Sep 30, 2018 at 2:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Now benchmarking is not clear. Possible performance degradation from TID
> > ordering interfere with positive effects from the optimizations in
> > non-trivial way.
>
> Is there any evidence of a regression in the last 2 versions?

I did find a pretty clear regression, though only with writes to
unique indexes. Attached is v6, which fixes the issue. More on that
below.

v6 also:

* Adds a new-to-v6 "insert at new item's insertion point"
optimization, which is broken out into its own commit.

This *greatly* improves the index bloat situation with the TPC-C
benchmark in particular, even before the benchmark starts (just with
the initial bulk load). See the relevant commit message for full
details, or a couple of my previous mails on this thread. I will
provide my own TPC-C test data + test case to any reviewer that wants
to see this for themselves. It shouldn't be hard to verify the
improvement in raw index size with any TPC-C implementation, though.
Please make an off-list request if you're interested. The raw dump is
1.8GB.

The exact details of when this new optimization kick in and how it
works are tentative. They should really be debated. Reviewers should
try to think of edge cases in which my "heap TID adjacency" approach
could make the optimization kick in when it shouldn't -- cases where
it causes bloat rather than preventing it. I couldn't find any such
regressions, but this code was written very recently.

I should also look into using HammerDB to do a real TPC-C benchmark,
and really put the patch to the test...anybody have experience with
it?

* Generally groups everything into a relatively manageable series of
cumulative improvements, starting with the infrastructure required to
physically truncate tuples correctly, without any of the smarts around
selecting a split point.

The base patch is useless on its own, since it's just necessary to
have the split point selection smarts to see a consistent benefit.
Reviewers shouldn't waste their time doing any real benchmarking with
just the first patch applied.

* Adds a lot of new information to the nbtree README, about the
high-level thought process behind the design, including citing the
classic paper that this patch was primarily inspired by.

* Adds a new, dedicated insertion scan key struct --
BTScanInsert[Data]. This is passed around to a number of different
routines (_btsearch(), _bt_binsrch(), _bt_compare(), etc). This was
suggested by Andrey, and also requested by Peter Eisentraut.

While this BTScanInsert work started out as straightforward
refactoring, it actually led to my discovering and fixing the
regression I mentioned. Previously, I passed a lower bound on a binary
search to _bt_binsrch() within _bt_findinsertloc(). This wasn't nearly
as effective as what the master branch does for unique indexes at the
same point -- it usually manages to reuse a result from an earlier
_bt_binsrch() as the offset for the new tuple, since it has no need to
worry about the new tuple's position *among duplicates* on the page.
In earlier versions of my patch, most of the work of a second binary
search took place, despite being redundant and unnecessary. This
happened for every new insertion into a non-unique index -- I could
easily measure the problem with a simple serial test case. I can see
no regression there against master now, though.

My fix for the regression involves including some mutable state in the
new BTScanInsert struct (within v6-0001-*patch), to explicitly
remember and restore some internal details across two binary searches
against the same leaf page. We now remember a useful lower *and* upper
bound within bt_binsrch(), which is what is truly required to fix the
regression. While there is still a second call to _bt_binsrch() within
_bt_findinsertloc() for unique indexes, it will do no comparisons in
the common case where there are no existing dead duplicate tuples in
the unique index. This means that the number of _bt_compare() calls we
get in this _bt_findinsertloc() unique index path is the same as the
master branch in almost all cases (I instrumented the regression tests
to make sure of this). I also think that having BTScanInsert will ease
things around pg_upgrade support, something that remains an open item.
Changes in this area seem to make everything clearer -- the signature
of _bt_findinsertloc() seemed a bit jumbled to me.

Aside: I think that this BTScanInsert mutable state idea could be
pushed even further in the future. "Dynamic prefix truncation" could
be implemented by taking a similar approach when descending composite
indexes for an index scan (doesn't have to be a unique index). We can
observe that earlier attributes must all be equal to our own scankey's
values once we descend the tree and pass between a pair of pivot
tuples where a common prefix (some number of leading attributes) is
fully equal. It's safe to just not bother comparing these prefix
attributes on lower levels, because we can reason about their values
transitively; _bt_compare() can be told to always skip the first
attribute or two during later/lower-in-the-tree binary searches. This
idea will not be implemented for Postgres v12 by me, though.

--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Oct 3, 2018 at 4:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I did find a pretty clear regression, though only with writes to
> unique indexes. Attached is v6, which fixes the issue. More on that
> below.

I've been benchmarking my patch using oltpbench's TPC-C benchmark
these past few weeks, which has been very frustrating -- the picture
is very mixed. I'm testing a patch that has evolved from v6, but isn't
too different.

In one way, the patch does exactly what it's supposed to do when these
benchmarks are run: it leaves indexes *significantly* smaller than the
master branch will on the same (rate-limited) workload, without
affecting the size of tables in any noticeable way. The numbers that I
got from my much earlier synthetic single client benchmark mostly hold
up. For example, the stock table's primary key is about 35% smaller,
and the order line index is only about 20% smaller relative to master,
which isn't quite as good as in the synthetic case, but I'll take it
(this is all because of the
v6-0003-Add-split-at-new-tuple-page-split-optimization.patch stuff).
However, despite significant effort, and despite the fact that the
index shrinking is reliable, I cannot yet consistently show an
increase in either transaction throughput, or transaction latency.

I can show a nice improvement in latency on a slightly-rate-limited
TPC-C workload when backend_flush_after=0 (something like a 40%
reduction on average), but that doesn't hold up when oltpbench isn't
rate-limited and/or has backend_flush_after set. Usually, there is a
1% - 2% regression, despite the big improvements in index size, and
despite the big reduction in the amount of buffers that backends must
write out themselves.

The obvious explanation is that throughput is decreased due to our
doing extra work (truncation) while under an exclusive buffer lock.
However, I've worked hard on that, and, as I said, I can sometimes
observe a nice improvement in latency. This makes me doubt the obvious
explanation. My working theory is that this has something to do with
shared_buffers eviction. Maybe we're making worse decisions about
which buffer to evict, or maybe the scalability of eviction is hurt.
Perhaps both.

You can download results from a recent benchmark to get some sense of
this. It includes latency and throughput graphs, plus details
statistics collector stats:

https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I would welcome any theories as to what could be the problem here. I'm
think that this is fixable, since the picture for the patch is very
positive, provided you only focus on bgwriter/checkpoint activity and
on-disk sizes. It seems likely that there is a very specific gap in my
understanding of how the patch affects buffer cleaning.

--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andres Freund
Дата:
Hi,

On 2018-10-18 12:54:27 -0700, Peter Geoghegan wrote:
> I can show a nice improvement in latency on a slightly-rate-limited
> TPC-C workload when backend_flush_after=0 (something like a 40%
> reduction on average), but that doesn't hold up when oltpbench isn't
> rate-limited and/or has backend_flush_after set. Usually, there is a
> 1% - 2% regression, despite the big improvements in index size, and
> despite the big reduction in the amount of buffers that backends must
> write out themselves.

What kind of backend_flush_after values where you trying?
backend_flush_after=0 obviously is the default, so I'm not clear on
that.   How large is the database here, and how high is shared_buffers


> The obvious explanation is that throughput is decreased due to our
> doing extra work (truncation) while under an exclusive buffer lock.
> However, I've worked hard on that, and, as I said, I can sometimes
> observe a nice improvement in latency. This makes me doubt the obvious
> explanation. My working theory is that this has something to do with
> shared_buffers eviction. Maybe we're making worse decisions about
> which buffer to evict, or maybe the scalability of eviction is hurt.
> Perhaps both.

Is it possible that there's new / prolonged cases where a buffer is read
from disk after the patch? Because that might require doing *write* IO
when evicting the previous contents of the victim buffer, and obviously
that can take longer if you're running with backend_flush_after > 0.

I wonder if it'd make sense to hack up a patch that logs when evicting a
buffer while already holding another lwlock. That shouldn't be too hard.


> You can download results from a recent benchmark to get some sense of
> this. It includes latency and throughput graphs, plus details
> statistics collector stats:
> 
> https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I'm uncllear which runs are what here? I assume "public" is your
patchset, and master is master? Do you reset the stats inbetween runs?

Greetings,

Andres Freund


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public' is the patch case. Sorry for not mentioning it initially. 

--
Peter Geoghegan
(Sent from my phone)

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:
> What kind of backend_flush_after values where you trying?
> backend_flush_after=0 obviously is the default, so I'm not clear on
> that.   How large is the database here, and how high is shared_buffers

I *was* trying backend_flush_after=512kB, but it's
backend_flush_after=0 in the benchmark I posted. See the
"postgres*settings" files.

On the master branch, things looked like this after the last run:

pg@tpcc_oltpbench[15547]=# \dt+
                      List of relations
 Schema │    Name    │ Type  │ Owner │   Size   │ Description
────────┼────────────┼───────┼───────┼──────────┼─────────────
 public │ customer   │ table │ pg    │ 4757 MB  │
 public │ district   │ table │ pg    │ 5240 kB  │
 public │ history    │ table │ pg    │ 1442 MB  │
 public │ item       │ table │ pg    │ 10192 kB │
 public │ new_order  │ table │ pg    │ 140 MB   │
 public │ oorder     │ table │ pg    │ 1185 MB  │
 public │ order_line │ table │ pg    │ 19 GB    │
 public │ stock      │ table │ pg    │ 9008 MB  │
 public │ warehouse  │ table │ pg    │ 4216 kB  │
(9 rows)

pg@tpcc_oltpbench[15547]=# \di+
                                         List of relations
 Schema │                 Name                 │ Type  │ Owner │
Table    │  Size   │ Description
────────┼──────────────────────────────────────┼───────┼───────┼────────────┼─────────┼─────────────
 public │ customer_pkey                        │ index │ pg    │
customer   │ 367 MB  │
 public │ district_pkey                        │ index │ pg    │
district   │ 600 kB  │
 public │ idx_customer_name                    │ index │ pg    │
customer   │ 564 MB  │
 public │ idx_order                            │ index │ pg    │
oorder     │ 715 MB  │
 public │ item_pkey                            │ index │ pg    │ item
     │ 2208 kB │
 public │ new_order_pkey                       │ index │ pg    │
new_order  │ 188 MB  │
 public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg    │
oorder     │ 715 MB  │
 public │ oorder_pkey                          │ index │ pg    │
oorder     │ 958 MB  │
 public │ order_line_pkey                      │ index │ pg    │
order_line │ 9624 MB │
 public │ stock_pkey                           │ index │ pg    │ stock
     │ 904 MB  │
 public │ warehouse_pkey                       │ index │ pg    │
warehouse  │ 56 kB   │
(11 rows)

> Is it possible that there's new / prolonged cases where a buffer is read
> from disk after the patch? Because that might require doing *write* IO
> when evicting the previous contents of the victim buffer, and obviously
> that can take longer if you're running with backend_flush_after > 0.

Yes, I suppose that that's possible, because the buffer
popularity/usage_count will be affected in ways that cannot easily be
predicted. However, I'm not running with "backend_flush_after > 0"
here -- that was before.

> I wonder if it'd make sense to hack up a patch that logs when evicting a
> buffer while already holding another lwlock. That shouldn't be too hard.

I'll look into this.

Thanks
-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if it'd make sense to hack up a patch that logs when evicting a
> buffer while already holding another lwlock. That shouldn't be too hard.

I tried this. It looks like we're calling FlushBuffer() with more than
a single LWLock held (not just the single buffer lock) somewhat *less*
with the patch. This is a positive sign for the patch, but also means
that I'm no closer to figuring out what's going on.

I tested a case with a 1GB shared_buffers + a TPC-C database sized at
about 10GB. I didn't want the extra LOG instrumentation to influence
the outcome.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andrey Lepikhov
Дата:

On 19.10.2018 0:54, Peter Geoghegan wrote:
> I would welcome any theories as to what could be the problem here. I'm
> think that this is fixable, since the picture for the patch is very
> positive, provided you only focus on bgwriter/checkpoint activity and
> on-disk sizes. It seems likely that there is a very specific gap in my
> understanding of how the patch affects buffer cleaning.

I have same problem with background heap & index cleaner (based on your 
patch). In this case the bottleneck is WAL-record which I need to write 
for each cleaned block and locks which are held during the WAL-record 
writing process.
Maybe you will do a test without writing any data to disk?

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Oct 23, 2018 at 11:35 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I have same problem with background heap & index cleaner (based on your
> patch). In this case the bottleneck is WAL-record which I need to write
> for each cleaned block and locks which are held during the WAL-record
> writing process.

Part of the problem here is that v6 uses up to 25 candidate split
points, even during regularly calls to _bt_findsplitloc(). That was
based on some synthetic test-cases. I've found that I can get most of
the benefit in index size with far fewer spilt points, though. The
extra work done with an exclusive buffer lock held will be
considerably reduced in v7. I'll probably post that in a couple of
weeks, since I'm in Europe for pgConf.EU. I don't fully understand the
problems here, but even still I know that what you were testing wasn't
very well optimized for write-heavy workloads. It would be especially
bad with pgbench, since there isn't much opportunity to reduce the
size of indexes there.

> Maybe you will do a test without writing any data to disk?

Yeah, I should test that on its own. I'm particularly interested in
TPC-C, because it's a particularly good target for my patch. I can
find a way of only executing the read TPC-C queries, to see where they
are on their own. TPC-C is particularly write-heavy, especially
compared to the much more recent though less influential TPC-E
benchmark.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andrey Lepikhov
Дата:
I do the code review.
Now, it is first patch - v6-0001... dedicated to a logical duplicates 
ordering.

Documentation is full and clear. All non-trivial logic is commented 
accurately.

Patch applies cleanly on top of current master. Regression tests passed 
and my "Retail Indextuple deletion" use cases works without mistakes.
But I have two comments on the code.
New BTScanInsert structure reduces parameters list of many functions and 
look fine. But it contains some optimization part ('restorebinsrch' 
field et al.). It is used very locally in the code - 
_bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize 
this logic into separate struct, which will passed to _bt_binsrch() as 
pointer. Another routines may pass NULL value to this routine. It is may 
simplify usability of the struct.

Due to the optimization the _bt_binsrch() size has grown twice. May be 
you move this to some service routine?


-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Nov 2, 2018 at 3:06 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Documentation is full and clear. All non-trivial logic is commented
> accurately.

Glad you think so.

I had the opportunity to discuss this patch at length with Heikki
during pgConf.EU. I don't want to speak on his behalf, but I will say
that he seemed to understand all aspects of the patch series, and
seemed generally well disposed towards the high level design. The
high-level design is the most important aspect -- B-Trees can be
optimized in many ways, all at once, and we must be sure to come up
with something that enables most or all of them. I really care about
the long term perspective.

That conversation with Heikki eventually turned into a conversation
about reimplementing GIN using the nbtree code, which is actually
related to my patch series (sorted on heap TID is the first step to
optional run length encoding for duplicates). Heikki seemed to think
that we can throw out a lot of the optimizations within GIN, and add a
few new ones to nbtree, while still coming out ahead. This made the
general nbtree-as-GIN idea (which we've been talking about casually
for years) seem a lot more realistic to me. Anyway, he requested that
I support this long term goal by getting rid of the DESC TID sort
order thing -- that breaks GIN-style TID compression. It also
increases the WAL volume unnecessarily when a page is split that
contains all duplicates.

The DESC heap TID sort order thing probably needs to go. I'll probably
have to go fix the regression test failures that occur when ASC heap
TID order is used. (Technically those failures are a pre-existing
problem, a problem that I mask by using DESC order...which is weird.
The problem is masked in the master branch by accidental behaviors
around nbtree duplicates, which is something that deserves to die.
DESC order is closer to the accidental current behavior.)

> Patch applies cleanly on top of current master. Regression tests passed
> and my "Retail Indextuple deletion" use cases works without mistakes.

Cool.

> New BTScanInsert structure reduces parameters list of many functions and
> look fine. But it contains some optimization part ('restorebinsrch'
> field et al.). It is used very locally in the code -
> _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize
> this logic into separate struct, which will passed to _bt_binsrch() as
> pointer. Another routines may pass NULL value to this routine. It is may
> simplify usability of the struct.

Hmm. I see your point. I did it that way because the knowledge of
having cached an upper and lower bound for a binary search of a leaf
page needs to last for a relatively long time. I'll look into it
again, though.

> Due to the optimization the _bt_binsrch() size has grown twice. May be
> you move this to some service routine?

Maybe. There are some tricky details that seem to work against it.
I'll see if it's possible to polish that some more, though.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andrey Lepikhov
Дата:

On 03.11.2018 5:00, Peter Geoghegan wrote:
> The DESC heap TID sort order thing probably needs to go. I'll probably
> have to go fix the regression test failures that occur when ASC heap
> TID order is used. (Technically those failures are a pre-existing
> problem, a problem that I mask by using DESC order...which is weird.
> The problem is masked in the master branch by accidental behaviors
> around nbtree duplicates, which is something that deserves to die.
> DESC order is closer to the accidental current behavior.)

I applied your patches at top of master. After tests corrections 
(related to TID ordering in index relations DROP...CASCADE operation) 
'make check-world' passed successfully many times.
In the case of 'create view' regression test - 'drop cascades to 62 
other objects' problem - I verify an Álvaro Herrera hypothesis [1] and 
it is true. You can verify it by tracking the 
object_address_present_add_flags() routine return value.
Some doubts, however, may be regarding the 'triggers' test.
May you specify the test failures do you mean?

[1] 
https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Nov 2, 2018 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I had the opportunity to discuss this patch at length with Heikki
> during pgConf.EU.

> The DESC heap TID sort order thing probably needs to go. I'll probably
> have to go fix the regression test failures that occur when ASC heap
> TID order is used.

I've found that TPC-C testing with ASC heap TID order fixes the
regression that I've been concerned about these past few weeks. Making
this change leaves the patch a little bit faster than the master
branch for TPC-C, while still leaving TPC-C indexes about as small as
they were with v6 of the patch (i.e. much smaller). I now get about a
1% improvement in transaction throughput, an improvement that seems
fairly consistent. It seems likely that the next revision of the patch
series will be an unambiguous across the board win for performance. I
think that I come out ahead with ASC heap TID order because that has
the effect of reducing the volume of WAL generated by page splits.
Page splits are already optimized for splitting right, not left.

I should thank Heikki for pointing me in the right direction here.

--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I applied your patches at top of master. After tests corrections
> (related to TID ordering in index relations DROP...CASCADE operation)
> 'make check-world' passed successfully many times.
> In the case of 'create view' regression test - 'drop cascades to 62
> other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
> it is true. You can verify it by tracking the
> object_address_present_add_flags() routine return value.

I'll have to go and fix the problem directly, so that ASC sort order
can be used.

> Some doubts, however, may be regarding the 'triggers' test.
> May you specify the test failures do you mean?

Not sure what you mean. The order of items that are listed in the
DETAIL for a cascading DROP can have an "implementation defined"
order. I think that this is an example of the more general problem --
what you call the 'drop cascades to 62 other objects' problem is a
more specific subproblem, or, if you prefer, a more specific symptom
of the same problem.

Since I'm going to have to fix the problem head-on, I'll have to study
it in detail anyway.

--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andrey Lepikhov
Дата:

On 04.11.2018 9:31, Peter Geoghegan wrote:
> On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> I applied your patches at top of master. After tests corrections
>> (related to TID ordering in index relations DROP...CASCADE operation)
>> 'make check-world' passed successfully many times.
>> In the case of 'create view' regression test - 'drop cascades to 62
>> other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
>> it is true. You can verify it by tracking the
>> object_address_present_add_flags() routine return value.
> 
> I'll have to go and fix the problem directly, so that ASC sort order
> can be used.
> 
>> Some doubts, however, may be regarding the 'triggers' test.
>> May you specify the test failures do you mean?
> 
> Not sure what you mean. The order of items that are listed in the
> DETAIL for a cascading DROP can have an "implementation defined"
> order. I think that this is an example of the more general problem --
> what you call the 'drop cascades to 62 other objects' problem is a
> more specific subproblem, or, if you prefer, a more specific symptom
> of the same problem.

I mean that your code have not any problems that I can detect by 
regression tests and by the retail index tuple deletion patch.
Difference in amount of dropped objects is not a problem. It is caused 
by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file 
catalog/dependency.c and it is legal behavior: column row object deleted 
without any report because we already decided to drop its whole table.

Also, I checked the triggers test. Difference in the ERROR message 
'cannot drop trigger trg1' is caused by different order of tuples in the 
relation with the dependDependerIndexId relid. It is legal behavior and 
we can simply replace test results.

May be you know any another problems of the patch?

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Nov 4, 2018 at 8:21 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I mean that your code have not any problems that I can detect by
> regression tests and by the retail index tuple deletion patch.
> Difference in amount of dropped objects is not a problem. It is caused
> by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file
> catalog/dependency.c and it is legal behavior: column row object deleted
> without any report because we already decided to drop its whole table.

The behavior implied by using ASC heap TID order is always "legal",
but it may cause a regression in certain functionality -- something
that an ordinary user might complain about. There were some changes
when DESC heap TID order is used too, of course, but those were safe
to ignore (it seemed like nobody could ever care). It might have been
okay to just use DESC order, but since it now seems like I must use
ASC heap TID order for performance reasons, I have to tackle a couple
of these issues head-on (e.g.  'cannot drop trigger trg1').

> Also, I checked the triggers test. Difference in the ERROR message
> 'cannot drop trigger trg1' is caused by different order of tuples in the
> relation with the dependDependerIndexId relid. It is legal behavior and
> we can simply replace test results.

Let's look at this specific "trg1" case:

"""
 create table trigpart (a int, b int) partition by range (a);
 create table trigpart1 partition of trigpart for values from (0) to (1000);
 create trigger trg1 after insert on trigpart for each row execute
procedure trigger_nothing();
 ...
 drop trigger trg1 on trigpart1; -- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger
trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table
trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
"""

The original hint suggests "you need to drop the object on the
partition parent instead of its child", which is useful. The new hint
suggests "instead of dropping the trigger on the partition child,
maybe drop the child itself!". That's almost an insult to the user.

Now, I suppose that I could claim that it's not my responsibility to
fix this, since we get the useful behavior only due to accidental
implementation details. I'm not going to take that position, though. I
think that I am obliged to follow both the letter and the spirit of
the law. I'm almost certain that this regression test was written
because somebody specifically cared about getting the original, useful
message. The underlying assumptions may have been a bit shaky, but we
all know how common it is for software to evolve to depend on
implementation-defined details. We've all written code that does it,
but hopefully it didn't hurt us much because we also wrote regression
tests that exercised the useful behavior.

> May be you know any another problems of the patch?

Just the lack of pg_upgrade support. That is progressing nicely,
though. I'll probably have that part in the next revision of the
patch. I've found what looks like a workable approach, though I need
to work on a testing strategy for pg_upgrade.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Nov 4, 2018 at 10:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Just the lack of pg_upgrade support.

Attached is v7 of the patch series. Changes:

* Pre-pg_upgrade indexes (indexes of an earlier BTREE_VERSION) are now
supported. Using pg_upgrade will be seamless to users. "Getting tired"
returns, for the benefit of old indexes that regularly have lots of
duplicates inserted.

Notably, the new/proposed version of btree (BTREE_VERSION 4) cannot be
upgraded on-the-fly -- we're changing more than the contents of the
metapage, so that won't work. Version 2 -> version 3 upgrades can
still take place dynamically/on-the-fly. It you want to upgrade to
version 4, you'll need to REINDEX. The performance of the patch with
pg_upgrade'd indexes has been validated. There doesn't seem to be any
regressions.

amcheck checks both the old invariants, and the new/stricter/L&Y
invariants. Which set are checked depends on the btree version of the
index undergoing verification.

* ASC heap TID order is now used -- not DESC order, as before. This
fixed all performance regressions that I'm aware of, and seems quite a
lot more elegant overall.

I believe that the patch series is now an unambiguous, across the
board win for performance. I could see about a 1% increase in
transaction throughput with my own TPC-C tests, while the big drop in
the size of indexes was preserved. pgbench testing also showed as much
as a 3.5% increase in transaction throughput in some cases with
non-uniform distributions. Thanks for the suggestion, Heikki!

Unfortunately, and as predicted, this change created a new problem
that I need to fix directly: it makes certain diagnostic messages that
accidentally depend on a certain pg_depend scan order say something
different, and less useful (though still technically correct). I'll
tackle that problem over on the dedicated thread I started [1]. (For
now, I include a separate patch to paper over questionable regression
test changes in a controlled way:
v7-0005-Temporarily-paper-over-problematic-regress-output.patch.)

* New optimization that has index scans avoid visiting the next page
by checking the high key -- this is broken out into its own commit
(v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch).

This is related to an optimization that has been around for years --
we're now using the high key, rather than using a normal (non-pivot)
index tuple. High keys are much more likely to indicate that the scan
doesn't need to visit the next page with the earlier patches in the
patch series applied, since the new logic for choosing a split point
favors a high key with earlier differences. It's pretty easy to take
advantage of that. With a composite index, or a secondary index, it's
particularly likely that we can avoid visiting the next leaf page. In
other words, now that we're being smarter about future locality of
access during page splits, we should take full advantage during index
scans.

The v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch
commit uses a _bt_lowest_scantid() sentinel value to avoid
unnecessarily visiting a page to the left of the page we actually
ought to go to directly during a descent of a B-Tree -- that
optimization was around in all earlier versions of the patch series.
It seems natural to also have this new-to-v7 optimization. It avoids
unnecessarily going right once we reach the leaf level, so it "does
the same thing on the right side" -- the two optimizations mirror each
other. If you don't get what I mean by that, then imagine a secondary
index where each value appears a few hundred times. Literally every
simple lookup query will either benefit from the first optimization on
the way down the tree, or from the second optimization towards the end
of the scan. (The page split logic ought to pack large groups of
duplicates together, ideally confining them to one leaf page.)

Andrey: the BTScanInsert struct still has the restorebinsrch stuff
(mutable binary search optimization state) in v7. It seemed to make
sense to keep it there, because I think that we'll be able to add
similar optimizations in the future, that use similar mutable state.
See my remarks on "dynamic prefix truncation" [2]. I think that that
could be very helpful with skip scans, for example, so we'll probably
end up adding it before too long. I hope you don't feel too strongly
about it.

[1] https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
[2] https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
Attached is v8 of the patch series, which has some relatively minor changes:

* A new commit adds an artificial tie-breaker column to pg_depend
indexes, comprehensively solving the issues with regression test
instability. This is the only really notable change.

* Clean-up of how the design in described in the nbtree README, and
elsewhere. I want to make it clear that we're now more or less using
the Lehman and Yao design. I re-read the Lehman and Yao paper to make
sure that the patch acknowledges what Lehman and Yao say to expect, at
least in cases that seemed to matter.

* Stricter verification by contrib/amcheck. Not likely to catch a case
that wouldn't have been caught by previous revisions, but should make
the design a bit clearer to somebody following L&Y.

* Tweaks to how _bt_findsplitloc() accumulates candidate split points.
We're less aggressive in choosing a smaller tuple during an internal
page split in this revision.

The overall impact of the pg_depend change is that required regression
test output changes are *far* less numerous than they were in v7.
There are now only trivial differences in the output order of items.
And, there are very few diagnostic message changes overall -- we see
exactly 5 changes now, rather than dozens. Importantly, there is no
longer any question about whether I could make diagnostic messages
less useful to users, because the existing behavior for
findDependentObjects() is retained. This is an independent
improvement, since it fixes an independent problem with test
flappiness that we've been papering-over for some time [2] -- I make
the required order actually-deterministic, removing heap TID ordering
as a factor that can cause seemingly-random regression test failures
on slow/overloaded buildfarm animals.

Robert Haas remarked that he thought that the pg_depend index
tie-breaker commit's approach is acceptable [1] -- see the other
thread that Robert weighed in on for all the gory details. The patch's
draft commit message may also be interesting. Note that adding a new
column turns out to have *zero* storage overhead, because we only ever
end up filling up space that was already getting lost to alignment.

The pg_depend thing is clearly a kludge. It's ugly, though in no small
part because it acknowledges the existing reality of how
findDependentObjects() already depends on scan order. I'm optimistic
that I'll be able to push this groundwork commit before too long; it
doesn't hinge on whether or not the nbtree patches are any good.

[1] https://postgr.es/m/CA+TgmoYNeFxdPimiXGL=tCiCXN8zWosUFxUfyDBaTd2VAg-D9w@mail.gmail.com
[2] https://postgr.es/m/11852.1501610262%40sss.pgh.pa.us
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Dmitry Dolgov
Дата:
> On Sun, Nov 25, 2018 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached is v8 of the patch series, which has some relatively minor changes:

Thank you for working on this patch,

Just for the information, cfbot says there are problems on windows:

src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
undeclared identifier


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Dec 1, 2018 at 4:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Just for the information, cfbot says there are problems on windows:
>
> src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
> undeclared identifier

Thanks. Looks like I should have used PG_INT32_MAX.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Dec 1, 2018 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Thanks. Looks like I should have used PG_INT32_MAX.

Attached is v9, which does things that way. There are no interesting
changes, though I have set things up so that a later patch in the
series can add "dynamic prefix truncation" -- I do not include any
such patch in v9, though. I'm going to start a new thread on that
topic, and include the patch there, since it's largely unrelated to
this work, and in any case still isn't in scope for Postgres 12 (the
patch is still experimental, for reasons that are of general
interest). If nothing else, Andrey and Peter E. will probably get a
better idea of why I thought that an insertion scan key was a good
place to put mutable state if they go read that other thread -- there
really was a bigger picture to setting things up that way.

--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Dec 3, 2018 at 7:10 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v9, which does things that way. There are no interesting
> changes, though I have set things up so that a later patch in the
> series can add "dynamic prefix truncation" -- I do not include any
> such patch in v9, though. I'm going to start a new thread on that
> topic, and include the patch there, since it's largely unrelated to
> this work, and in any case still isn't in scope for Postgres 12 (the
> patch is still experimental, for reasons that are of general
> interest).

The dynamic prefix truncation thread that I started:

https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 04/12/2018 05:10, Peter Geoghegan wrote:
> Attached is v9, ...

I spent some time reviewing this. I skipped the first patch, to add a 
column to pg_depend, and I got through patches 2, 3 and 4. Impressive 
results, and the code looks sane.

I wrote a laundry list of little comments on minor things, suggested 
rewordings of comments etc. I hope they're useful, but feel free to 
ignore/override my opinions of any of those, as you see best.

But first, a few slightly bigger (medium-sized?) issues that caught my eye:

1. How about doing the BTScanInsertData refactoring as a separate 
commit, first? It seems like a good thing for readability on its own, 
and would slim the big main patch. (And make sure to credit Andrey for 
that idea in the commit message.)


2. In the "Treat heap TID as part of the nbtree key space" patch:

>   *        Build an insertion scan key that contains comparison data from itup
>   *        as well as comparator routines appropriate to the key datatypes.
>   *
> + *        When itup is a non-pivot tuple, the returned insertion scan key is
> + *        suitable for finding a place for it to go on the leaf level.  When
> + *        itup is a pivot tuple, the returned insertion scankey is suitable
> + *        for locating the leaf page with the pivot as its high key (there
> + *        must have been one like it at some point if the pivot tuple
> + *        actually came from the tree).
> + *
> + *        Note that we may occasionally have to share lock the metapage, in
> + *        order to determine whether or not the keys in the index are expected
> + *        to be unique (i.e. whether or not heap TID is treated as a tie-breaker
> + *        attribute).  Callers that cannot tolerate this can request that we
> + *        assume that this is a heapkeyspace index.
> + *
>   *        The result is intended for use with _bt_compare().
>   */
> -ScanKey
> -_bt_mkscankey(Relation rel, IndexTuple itup)
> +BTScanInsert
> +_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)

This 'assumeheapkeyspace' flag feels awkward. What if the caller knows 
that it is a v3 index? There's no way to tell _bt_mkscankey() that. 
(There's no need for that, currently, but seems a bit weird.)

_bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which 
calls _bt_mkscankey(). It's holding a lock on the page being split. Do 
we risk deadlock by locking the metapage at the same time?

I don't have any great ideas on what to do about this, but it's awkward 
as it is. Can we get away without the new argument? Could we somehow 
arrange things so that rd_amcache would be guaranteed to already be set?


3. In the "Pick nbtree split points discerningly" patch

I find the different modes and the logic in _bt_findsplitloc() very hard 
to understand. I've spent a while looking at it now, and I think I have 
a vague understanding of what things it takes into consideration, but I 
don't understand why it performs those multiple stages, what each stage 
does, and how that leads to an overall strategy. I think a rewrite would 
be in order, to make that more understandable. I'm not sure what exactly 
it should look like, though.

If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or 
SINGLE_VALUE modes, it has to redo a lot of the work that was done in 
the DEFAULT mode already. That's probably not a big deal in practice, 
performance-wise, but I feel that it's another hint that some 
refactoring would be in order.

One idea on how to restructure that:

Make a single pass over all the offset numbers, considering a split at 
that location. Like the current code does. For each offset, calculate a 
"penalty" based on two factors:

* free space on each side
* the number of attributes in the pivot tuple, and whether it needs to 
store the heap TID

Define the penalty function so that having to add a heap TID to the 
pivot tuple is considered very expensive, more expensive than anything 
else, and truncating away other attributes gives a reward of some size.

However, naively computing the penalty upfront for every offset would be 
a bit wasteful. Instead, start from the middle of the page, and walk 
"outwards" towards both ends, until you find a "good enough" penalty.

Or something like that...


Now, the laundry list of smaller items:

----- laundry list begins -----

1st commits commit message:

> Make nbtree treat all index tuples as having a heap TID trailing key
> attribute.  Heap TID becomes a first class part of the key space on all
> levels of the tree.  Index searches can distinguish duplicates by heap
> TID, at least in principle.

What do you mean by "at least in principle"?

> Secondary index insertions will descend
> straight to the leaf page that they'll insert on to (unless there is a
> concurrent page split).

What is a "Secondary" index insertion?

> Naively adding a new attribute to every pivot tuple has unacceptable
> overhead (it bloats internal pages), so suffix truncation of pivot
> tuples is added.  This will generally truncate away the "extra" heap TID
> attribute from pivot tuples during a leaf page split, and may also
> truncate away additional user attributes.  This can increase fan-out,
> especially when there are several attributes in an index.

Suggestion: "when there are several attributes in an index" -> "in a 
multi-column index"

> +/*
> + * Convenience macro to get number of key attributes in tuple in low-context
> + * fashion
> + */
> +#define BTreeTupleGetNKeyAtts(itup, rel)   \
> +    Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
> +

What is "low-context fashion"?

> + * scankeys is an array of scan key entries for attributes that are compared
> + * before scantid (user-visible attributes).  Every attribute should have an
> + * entry during insertion, though not necessarily when a regular index scan
> + * uses an insertion scankey to find an initial leaf page.

Suggestion: Reword to something like "During insertion, there must be a 
scan key for every attribute, but when starting a regular index scan, 
some can be omitted."

> +typedef struct BTScanInsertData
> +{
> +    /*
> +     * Mutable state used by _bt_binsrch() to inexpensively repeat a binary
> +     * search on the leaf level when only scantid has changed.  Only used for
> +     * insertions where _bt_check_unique() is called.
> +     */
> +    bool        savebinsrch;
> +    bool        restorebinsrch;
> +    OffsetNumber low;
> +    OffsetNumber high;
> +
> +    /* State used to locate a position at the leaf level */
> +    bool        heapkeyspace;
> +    bool        nextkey;
> +    ItemPointer scantid;        /* tiebreaker for scankeys */
> +    int            keysz;            /* Size of scankeys */
> +    ScanKeyData scankeys[INDEX_MAX_KEYS];    /* Must appear last */
> +} BTScanInsertData;

It would feel more natural to me, to have the mutable state *after* the 
other fields. Also, it'd feel less error-prone to have 'scantid' be 
ItemPointerData, rather than a pointer to somewhere else. The 
'heapkeyspace' name isn't very descriptive. I understand that it means 
that the heap TID is part of the keyspace. Not sure what to suggest 
instead, though.

> +The requirement that all btree keys be unique is satisfied by treating heap
> +TID as a tiebreaker attribute.  Logical duplicates are sorted in heap item
> +pointer order.

Suggestion: "item pointer" -> TID, to use consistent terms.

> We don't use btree keys to disambiguate downlinks from the
> +internal pages during a page split, though: only one entry in the parent
> +level will be pointing at the page we just split, so the link fields can be
> +used to re-find downlinks in the parent via a linear search.  (This is
> +actually a legacy of when heap TID was not treated as part of the keyspace,
> +but it does no harm to keep things that way.)

I don't understand this paragraph.

> +Lehman and Yao talk about pairs of "separator" keys and downlinks in
> +internal pages rather than tuples or records.  We use the term "pivot"
> +tuple to distinguish tuples which don't point to heap tuples, that are
> +used only for tree navigation.  Pivot tuples include all tuples on
> +non-leaf pages and high keys on leaf pages.

Suggestion: reword to "All tuples on non-leaf pages, and high keys on 
leaf pages, are pivot tuples"

> Note that pivot tuples are
> +only used to represent which part of the key space belongs on each page,
> +and can have attribute values copied from non-pivot tuples that were
> +deleted and killed by VACUUM some time ago.  A pivot tuple may contain a
> +"separator" key and downlink, just a separator key (in practice the
> +downlink will be garbage), or just a downlink.

Rather than store garbage, set it to zeros?

> +Lehman and Yao require that the key range for a subtree S is described by
> +Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page.
> +A search where the scan key is equal to a pivot tuple in an upper tree
> +level must descend to the left of that pivot to ensure it finds any equal
> +keys.  Pivot tuples are always a _strict_ lower bound on items on their
> +downlink page; the equal item(s) being searched for must therefore be to
> +the left of that downlink page on the next level down.  (It's possible to
> +arrange for internal page tuples to be strict lower bounds in all cases
> +because their values come from leaf tuples, which are guaranteed unique by
> +the use of heap TID as a tiebreaker.  We also make use of hard-coded
> +negative infinity values in internal pages.  Rightmost pages don't have a
> +high key, though they conceptually have a positive infinity high key).  A
> +handy property of this design is that there is never any need to
> +distinguish between equality in the case where all attributes/keys are used
> +in a scan from equality where only some prefix is used.

"distringuish between ... from ..." doesn't sound like correct grammar. 
Suggestion: "distinguish between ... and ...", or just "distinguish ... 
from ...". Or rephrase the sentence some other way.

> +We truncate away suffix key attributes that are not needed for a page high
> +key during a leaf page split when the remaining attributes distinguish the
> +last index tuple on the post-split left page as belonging on the left page,
> +and the first index tuple on the post-split right page as belonging on the
> +right page.

That's a very long sentence.

>              * Since the truncated tuple is probably smaller than the
>              * original, it cannot just be copied in place (besides, we want
>              * to actually save space on the leaf page).  We delete the
>              * original high key, and add our own truncated high key at the
>              * same offset.  It's okay if the truncated tuple is slightly
>              * larger due to containing a heap TID value, since pivot tuples
>              * are treated as a special case by _bt_check_third_page().

By "treated as a special case", I assume that _bt_check_third_page() 
always reserves some space for that? Maybe clarify that somehow.

_bt_truncate():
> This is possible when there are
>  * attributes that follow an attribute in firstright that is not equal to the
>  * corresponding attribute in lastleft (equal according to insertion scan key
>  * semantics).

I can't comprehend that sentence. Simpler English, maybe add an example, 
please.

> /*
>  * _bt_leave_natts - how many key attributes to leave when truncating.
>  *
>  * Caller provides two tuples that enclose a split point.  CREATE INDEX
>  * callers must pass build = true so that we may avoid metapage access.  (This
>  * is okay because CREATE INDEX always creates an index on the latest btree
>  * version.)
>  *
>  * This can return a number of attributes that is one greater than the
>  * number of key attributes for the index relation.  This indicates that the
>  * caller must use a heap TID as a unique-ifier in new pivot tuple.
>  */
> static int
> _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
>                 bool build)

IMHO "keep" would sound better here than "leave".

> +    if (needheaptidspace)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> +                 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
> +                        itemsz, BTREE_VERSION, BTMaxItemSize(page),
> +                        RelationGetRelationName(rel)),
> +                 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
> +                           ItemPointerGetBlockNumber(&newtup->t_tid),
> +                           ItemPointerGetOffsetNumber(&newtup->t_tid),
> +                           RelationGetRelationName(heap)),
> +                 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
> +                         "Consider a function index of an MD5 hash of the value, "
> +                         "or use full text indexing."),
> +                 errtableconstraint(heap,
> +                                    RelationGetRelationName(rel))));
> +    else
> +        ereport(ERROR,
> +                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> +                 errmsg("index row size %zu exceeds btree version 3 maximum %zu for index \"%s\"",
> +                        itemsz, BTMaxItemSizeNoHeapTid(page),
> +                        RelationGetRelationName(rel)),
> +                 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
> +                           ItemPointerGetBlockNumber(&newtup->t_tid),
> +                           ItemPointerGetOffsetNumber(&newtup->t_tid),
> +                           RelationGetRelationName(heap)),
> +                 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
> +                         "Consider a function index of an MD5 hash of the value, "
> +                         "or use full text indexing."),
> +                 errtableconstraint(heap,
> +                                    RelationGetRelationName(rel))));

Could restructure this to avoid having two almost identical strings to 
translate.

>  #define BTREE_METAPAGE    0        /* first page is meta */
>  #define BTREE_MAGIC        0x053162    /* magic number of btree pages */
> -#define BTREE_VERSION    3        /* current version number */
> +#define BTREE_VERSION    4        /* current version number */
>  #define BTREE_MIN_VERSION    2    /* minimal supported version number */
> +#define BTREE_META_VERSION    3    /* minimal version with all meta fields */

BTREE_META_VERSION is a strange name for version 3. I think this 
deserves a more verbose comment, above these #defines, to list all the 
versions and their differences.

v9-0003-Pick-nbtree-split-points-discerningly.patch commit message:
> Add infrastructure to determine where the earliest difference appears
> among a pair of tuples enclosing a candidate split point.

I don't understand this sentence.

> _bt_findsplitloc() is also taught to care about the case where there are
> many duplicates, making it hard to find a distinguishing split point.
> _bt_findsplitloc() may even conclude that it isn't possible to avoid
> filling a page entirely with duplicates, in which case it packs pages
> full of duplicates very tightly.

Hmm. Is the assumption here that if a page is full of duplicates, there 
will be no more insertions into that page? Why?

> The number of cycles added is not very noticeable, which is important,
> since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
> lock is held.  We avoid using authoritative insertion scankey
> comparisons, unlike suffix truncation proper.

What do you do instead, then? memcmp? (Reading the patch, yes. 
Suggestion: "We use a faster binary comparison, instead of proper 
datatype-aware comparison, for speed".

Aside from performance, it would feel inappropriate to call user-defined 
code while holding a buffer lock, anyway.

> +There is sophisticated criteria for choosing a leaf page split point.  The
> +general idea is to make suffix truncation effective without unduly
> +influencing the balance of space for each half of the page split.  The
> +choice of leaf split point can be thought of as a choice among points
> +*between* items on the page to be split, at least if you pretend that the
> +incoming tuple was placed on the page already, without provoking a split.

I'd leave out the ", without provoking a split" part. Or maybe reword to 
"if you pretend that the incoming tuple fit and was placed on the page 
already".

> +Choosing the split point between two index tuples with differences that
> +appear as early as possible results in truncating away as many suffix
> +attributes as possible.

It took me a while to understand what the "appear as early as possible" 
means here. It's talking about a multi-column index, and about finding a 
difference in one of the leading key columns. Not, for example, about 
finding a split point early in the page.

> An array of acceptable candidate split points
> +(points that balance free space on either side of the split sufficiently
> +well) is assembled in a pass over the page to be split, sorted by delta.
> +An optimal split point is chosen during a pass over the assembled array.
> +There are often several split points that allow the maximum number of
> +attributes to be truncated away -- we choose whichever one has the lowest
> +free space delta.

Perhaps we should leave out these details in the README, and explain 
this in the comments of the picksplit-function itself? In the README, I 
think a more high-level description of what things are taken into 
account when picking the split point, would be enough.

> +Suffix truncation is primarily valuable because it makes pivot tuples
> +smaller, which delays splits of internal pages, but that isn't the only
> +reason why it's effective.

Suggestion: reword to "... , but that isn't the only benefit" ?

> There are cases where suffix truncation can
> +leave a B-Tree significantly smaller in size than it would have otherwise
> +been without actually making any pivot tuple smaller due to restrictions
> +relating to alignment.

Suggestion: reword to "... smaller in size than it would otherwise be, 
without ..."

and "without making any pivot tuple *physically* smaller, due to alignment".

This sentence is a bit of a cliffhanger: what are those cases, and how 
is that possible?

> The criteria for choosing a leaf page split point
> +for suffix truncation is also predictive of future space utilization.

How so? What does this mean?

> +Furthermore, even truncation that doesn't make pivot tuples smaller still
> +prevents pivot tuples from being more restrictive than truly necessary in
> +how they describe which values belong on which pages.

Ok, I guess these sentences resolve the cliffhanger I complained about. 
But this still feels like magic. When you split a page, all of the 
keyspace must belong on the left or the right page. Why does it make a 
difference to space utilization, where exactly you split the key space?

> +While it's not possible to correctly perform suffix truncation during
> +internal page splits, it's still useful to be discriminating when splitting
> +an internal page.  The split point that implies a downlink be inserted in
> +the parent that's the smallest one available within an acceptable range of
> +the fillfactor-wise optimal split point is chosen.  This idea also comes
> +from the Prefix B-Tree paper.  This process has much in common with to what
> +happens at the leaf level to make suffix truncation effective.  The overall
> +effect is that suffix truncation tends to produce smaller and less
> +discriminating pivot tuples, especially early in the lifetime of the index,
> +while biasing internal page splits makes the earlier, less discriminating
> +pivot tuples end up in the root page, delaying root page splits.

Ok, so this explains it further, I guess. I find this paragraph 
difficult to understand, though. The important thing here is the idea 
that some split points are more "discriminating" than others, but I 
think it needs some further explanation. What makes a split point more 
discriminating? Maybe add an example.

> +Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf
> +tuple that it's based on (the first item on the right page), since a heap
> +TID must be appended when nothing else distinguishes each side of a leaf
> +split.  Truncation cannot simply reuse the leaf level representation: we
> +must append an additional attribute, rather than incorrectly leaving a heap
> +TID in the generic IndexTuple item pointer field.  (The field is already
> +used by pivot tuples to store their downlink, plus some additional
> +metadata.)

That's not really the fault of suffix truncation as such, but the 
process of turning a leaf tuple into a pivot tuple. It would happen even 
if you didn't truncate anything.

I think this point, that we have to store the heap TID differently in 
pivot tuples, would deserve a comment somewhere else, too. While reading 
the patch, I didn't realize that that's what we're doing, until I read 
this part of the README, even though I saw the new code to deal with 
heap TIDs elsewhere in the code. Not sure where, maybe in 
BTreeTupleGetHeapTID().

> +Adding a heap TID attribute during a leaf page split should only occur when
> +the page to be split is entirely full of duplicates (the new item must also
> +be a duplicate).  The logic for selecting a split point goes to great
> +lengths to avoid heap TIDs in pivots --- "many duplicates" mode almost
> +always manages to pick a split point between two user-key-distinct tuples,
> +accepting a completely lopsided split if it must.

This is the first mention of "many duplicates" mode. Maybe just say 
"_bt_findsplitloc() almost always ..." or "The logic for selecting a 
split point goes to great lengths to avoid heap TIDs in pivots, and 
almost always manages to pick a split point between two 
user-key-distinct tuples, accepting a completely lopsided split if it must."

> Once appending a heap
> +TID to a split's pivot becomes completely unavoidable, there is a fallback
> +strategy --- "single value" mode is used, which makes page splits pack the
> +new left half full by using a high fillfactor.  Single value mode leads to
> +better overall space utilization when a large number of duplicates are the
> +norm, and thereby also limits the total number of pivot tuples with an
> +untruncated heap TID attribute.

This assumes that tuples are inserted in increasing TID order, right? 
Seems like a valid assumption, no complaints there, but it's an 
assumption nevertheless.

I'm not sure if this level of detail is worthwhile in the README. This 
logic on deciding the split point is all within the _bt_findsplitloc() 
function, so maybe put this explanation there. In the README, a more 
high-level explanation of what things _bt_findsplitloc() considers, 
should be enough.

_bt_findsplitloc(), and all its helper structs and subroutines, are 
about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it 
would be a good idea to move it to a whole new nbtsplitloc.c file? It's 
a very isolated piece of code.

In the comment on _bt_leave_natts_fast():

> + * Testing has shown that an approach involving treating the tuple as a
> + * decomposed binary string would work almost as well as the approach taken
> + * here.  It would also be faster.  It might actually be necessary to go that
> + * way in the future, if suffix truncation is made sophisticated enough to
> + * truncate at a finer granularity (i.e. truncate within an attribute, rather
> + * than just truncating away whole attributes).  The current approach isn't
> + * markedly slower, since it works particularly well with the "perfect
> + * penalty" optimization (there are fewer, more expensive calls here).  It
> + * also works with INCLUDE indexes (indexes with non-key attributes) without
> + * any special effort.

That's an interesting tidbit, but I'd suggest just removing this comment 
altogether. It's not really helping to understand the current 
implementation.

v9-0005-Add-high-key-continuescan-optimization.patch commit message:

> Note that even pre-pg_upgrade'd v3 indexes make use of this
> optimization.

.. but we're missing the other optimizations that make it more 
effective, so it probably won't do much for v3 indexes. Does it make 
them slower? It's probably acceptable, even if there's a tiny 
regression, but I'm curious.

----- laundry list ends -----

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Dec 28, 2018 at 10:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I spent some time reviewing this. I skipped the first patch, to add a
> column to pg_depend, and I got through patches 2, 3 and 4. Impressive
> results, and the code looks sane.

Thanks! I really appreciate your taking the time to do such a thorough review.

You were right to skip the first patch, because there is a fair chance
that it won't be used in the end. Tom is looking into the pg_depend
problem that I paper over with the first patch.

> I wrote a laundry list of little comments on minor things, suggested
> rewordings of comments etc. I hope they're useful, but feel free to
> ignore/override my opinions of any of those, as you see best.

I think that that feedback is also useful, and I'll end up using 95%+
of it. Much of the information I'm trying to get across is very
subtle.

> But first, a few slightly bigger (medium-sized?) issues that caught my eye:
>
> 1. How about doing the BTScanInsertData refactoring as a separate
> commit, first? It seems like a good thing for readability on its own,
> and would slim the big main patch. (And make sure to credit Andrey for
> that idea in the commit message.)

Good idea. I'll do that.

> This 'assumeheapkeyspace' flag feels awkward. What if the caller knows
> that it is a v3 index? There's no way to tell _bt_mkscankey() that.
> (There's no need for that, currently, but seems a bit weird.)

This is there for CREATE INDEX -- we cannot access the metapage during
an index build. We'll only be able to create new v4 indexes with the
patch applied, so we can assume that heap TID is part of the key space
safely.

> _bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which
> calls _bt_mkscankey(). It's holding a lock on the page being split. Do
> we risk deadlock by locking the metapage at the same time?

I already had vague concerns along the same lines. I am also concerned
about index_getprocinfo() calls that happen in the same code path,
with a buffer lock held. (SP-GiST's doPickSplit() function can be
considered a kind of precedent that makes the second issue okay, I
suppose.)

See also: My later remarks on the use of "authoritative comparisons"
from this same e-mail.

> I don't have any great ideas on what to do about this, but it's awkward
> as it is. Can we get away without the new argument? Could we somehow
> arrange things so that rd_amcache would be guaranteed to already be set?

These are probably safe in practice, but the way that we rely on them
being safe from a distance is a concern. Let me get back to you on
this.

> 3. In the "Pick nbtree split points discerningly" patch
>
> I find the different modes and the logic in _bt_findsplitloc() very hard
> to understand. I've spent a while looking at it now, and I think I have
> a vague understanding of what things it takes into consideration, but I
> don't understand why it performs those multiple stages, what each stage
> does, and how that leads to an overall strategy. I think a rewrite would
> be in order, to make that more understandable. I'm not sure what exactly
> it should look like, though.

I've already refactored that a little bit for the upcoming v10. The
way _bt_findsplitloc() state is initially set up becomes slightly more
streamlined. It still works in the same way, though, so you'll
probably only think that the new version is a minor improvement.
(Actually, v10 focuses on making _bt_splitatnewitem() a bit less
magical, at least right now.)

> If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or
> SINGLE_VALUE modes, it has to redo a lot of the work that was done in
> the DEFAULT mode already. That's probably not a big deal in practice,
> performance-wise, but I feel that it's another hint that some
> refactoring would be in order.

The logic within _bt_findsplitloc() has been very hard to refactor all
along. You're right that there is a fair amount of redundant-ish work
that the alternative modes (MANY_DUPLICATES + SINGLE_VALUE) perform.
The idea is to not burden the common DEFAULT case, and to keep the
control flow relatively simple.

I'm sure that if I was in your position I'd say something similar. It
is complicated in subtle ways, that looks like they might not matter,
but actually do. I am working off a fair variety of test cases, which
really came in handy. I remember thinking that I'd simplified it a
couple of times back in August or September, only to realize that I'd
regressed a case that I cared about. I eventually realized that I
needed to come up with a comprehensive though relatively fast test
suite, which seems essential for refactoring _bt_findsplitloc(), and
maybe even for fully understanding how _bt_findsplitloc() works.

Another complicating factor is that I have to worry about the number
of cycles used under a buffer lock (not just the impact on space
utilization).

With all of that said, I am willing to give it another try. You've
seen opportunities to refactor that I missed before now. More than
once.

> One idea on how to restructure that:
>
> Make a single pass over all the offset numbers, considering a split at
> that location. Like the current code does. For each offset, calculate a
> "penalty" based on two factors:
>
> * free space on each side
> * the number of attributes in the pivot tuple, and whether it needs to
> store the heap TID
>
> Define the penalty function so that having to add a heap TID to the
> pivot tuple is considered very expensive, more expensive than anything
> else, and truncating away other attributes gives a reward of some size.

As you go on to say, accessing the tuple to calculate a penalty like
this is expensive, and shouldn't be done exhaustively if at all
possible. We're only access item pointer information (that is, lp_len)
in the master branch's _bt_findsplitloc(), and that's all we do within
the patch until the point where we have a (usually quite small) array
of candidate split points, sorted by delta.

Doing a pass over the page to assemble an array of candidate splits,
and then doing a pass over the sorted array of splits with
tolerably-low left/right space deltas works pretty well. "Mixing" the
penalties together up front like that is something I considered, and
decided not to pursue -- it obscures relatively uncommon though
sometimes important large differences, that a single DEFAULT mode
style pass would probably miss. MANY_DUPLICATES mode is totally
exhaustive, because it's worth being totally exhaustive in the extreme
case where there are only a few distinct values, and it's still
possible to avoid a large grouping of values that spans more than one
page. But it's not worth being exhaustive like that most of the time.
That's the useful thing about having 2 alternative modes, that we
"escalate" to if and only if it seems necessary to. MANY_DUPLICATES
can be expensive, because no workload is likely to consistently use
it. Most will almost always use DEFAULT, some will use SINGLE_VALUE
quite a bit -- MANY_DUPLICATES is for when we're "in between" those
two.Seems unlikely to be the steady state.

Maybe we could just have MANY_DUPLICATES mode, and making SINGLE_VALUE
mode something that happens within a DEFAULT pass. It's probably not
worth it, though -- SINGLE_VALUE mode generally wants to split the
page in a way that makes the left page mostly full, and the right page
mostly empty. So eliminating SINGLE_VALUE mode would probably not
simplify the code.

> However, naively computing the penalty upfront for every offset would be
> a bit wasteful. Instead, start from the middle of the page, and walk
> "outwards" towards both ends, until you find a "good enough" penalty.

You can't start at the middle of the page, though.

You have to start at the left (though you could probably start at the
right instead). This is because of page fragmentation -- it's not
correct to assume that the line pointer offset into tuple space on the
page (firstright linw pointer lp_off for candidate split point) tells
you anything about what the space delta will be after the split. You
have to exhaustively add up the free space before the line pointer
(the free space for all earlier line pointers) before seeing if the
line pointer works as a split point, since each previous line
pointer's tuple could be located anywhere in the original page's tuple
space (anywhere to the left or to the right of where it would be in
the simple/unfragmented case).

> 1st commits commit message:
>
> > Make nbtree treat all index tuples as having a heap TID trailing key
> > attribute.  Heap TID becomes a first class part of the key space on all
> > levels of the tree.  Index searches can distinguish duplicates by heap
> > TID, at least in principle.
>
> What do you mean by "at least in principle"?

I mean that we don't really do that currently, because we don't have
something like retail index tuple deletion. However, we do have, uh,
insertion, so I guess that this is just wrong. Will fix.

> > Secondary index insertions will descend
> > straight to the leaf page that they'll insert on to (unless there is a
> > concurrent page split).
>
> What is a "Secondary" index insertion?

Secondary index is how I used to refer to a non-unique index, until I
realized that that was kind of wrong. (In fact, all indexes in
Postgres are secondary indexes, because we always use a heap, never a
clustered index.)

Will fix.

> Suggestion: "when there are several attributes in an index" -> "in a
> multi-column index"

I'll change it to say that.

> > +/*
> > + * Convenience macro to get number of key attributes in tuple in low-context
> > + * fashion
> > + */
> > +#define BTreeTupleGetNKeyAtts(itup, rel)   \
> > +     Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
> > +
>
> What is "low-context fashion"?

I mean that it works with non-pivot tuples in INCLUDE indexes without
special effort on the caller's part, while also fetching the number of
key attributes in any pivot tuple, where it might well be <
IndexRelationGetNumberOfKeyAttributes(). Maybe no comment is necessary
-- BTreeTupleGetNKeyAtts() is exactly what it sounds like to somebody
that already knows about BTreeTupleGetNAtts().

> Suggestion: Reword to something like "During insertion, there must be a
> scan key for every attribute, but when starting a regular index scan,
> some can be omitted."

Will do.

> It would feel more natural to me, to have the mutable state *after* the
> other fields.

I fully agree, but I can't really change it. The struct
BTScanInsertData ends with a flexible array member, though its sized
INDEX_MAX_KEYS because _bt_first() wants to allocate it on the stack
without special effort.

This was found to make a measurable difference with nested loop joins
-- I used to always allocate BTScanInsertData using palloc(), until I
found a regression. This nestloop join issue must be why commit
d961a568 removed an insertion scan key palloc() from _bt_first(), way
back in 2005. It seems like _bt_first() should remain free of
palloc()s, which it seems to actually manage to do, despite being so
hairy.

> Also, it'd feel less error-prone to have 'scantid' be
> ItemPointerData, rather than a pointer to somewhere else.

It's useful for me to be able to set it to NULL, though -- I'd need
another bool to represent the absence of a scantid if the field was
ItemPointerData (the absence could occur when _bt_mkscankey() is
passed a pivot tuple with its heap TID already truncated away, for
example). Besides, the raw scan keys themselves are very often
pointers to an attribute in some index tuple -- a tuple that the
caller needs to keep around for as long as the insertion scan key
needs to be used. Why not do the same thing with scantid? It is more
or less just another attribute, so it's really the same situation as
before.

> The 'heapkeyspace' name isn't very descriptive. I understand that it means
> that the heap TID is part of the keyspace. Not sure what to suggest
> instead, though.

I already changed this once, based on a similar feeling. If you come
up with an even better name than "heapkeyspace", let me know.   :-)

> > +The requirement that all btree keys be unique is satisfied by treating heap
> > +TID as a tiebreaker attribute.  Logical duplicates are sorted in heap item
> > +pointer order.
>
> Suggestion: "item pointer" -> TID, to use consistent terms.

Will do.

> > We don't use btree keys to disambiguate downlinks from the
> > +internal pages during a page split, though: only one entry in the parent
> > +level will be pointing at the page we just split, so the link fields can be
> > +used to re-find downlinks in the parent via a linear search.  (This is
> > +actually a legacy of when heap TID was not treated as part of the keyspace,
> > +but it does no harm to keep things that way.)
>
> I don't understand this paragraph.

I mean that we could now "go full Lehman and Yao" if we wanted to:
it's not necessary to even use the link field like this anymore. We
don't do that because of v3 indexes, but also because it doesn't
actually matter. The current way of re-finding downlinks would
probably even be better in a green field situation, in fact -- it's
just a bit harder to explain in a research paper.

> Suggestion: reword to "All tuples on non-leaf pages, and high keys on
> leaf pages, are pivot tuples"

Will do.

> > Note that pivot tuples are
> > +only used to represent which part of the key space belongs on each page,
> > +and can have attribute values copied from non-pivot tuples that were
> > +deleted and killed by VACUUM some time ago.  A pivot tuple may contain a
> > +"separator" key and downlink, just a separator key (in practice the
> > +downlink will be garbage), or just a downlink.
>
> Rather than store garbage, set it to zeros?

There may be minor forensic value in keeping the item pointer block as
the heap block (but not the heap item pointer) within leaf high keys
(i.e. only changing it when it gets copied over for insertion into the
parent, and the block needs to point to the leaf child). I recall
discussing this with Alexander Korotkov shortly before the INCLUDE
patch went in. I'd rather keep it that way, rather than zeroing.

I could say "undefined" instead of "garbage", though. Not at all
attached to that wording.

> "distringuish between ... from ..." doesn't sound like correct grammar.
> Suggestion: "distinguish between ... and ...", or just "distinguish ...
> from ...". Or rephrase the sentence some other way.

Yeah, I mangled the grammar. Which is kind of surprising, since I make
a very important point about why strict lower bounds are handy in that
sentence!

> > +We truncate away suffix key attributes that are not needed for a page high
> > +key during a leaf page split when the remaining attributes distinguish the
> > +last index tuple on the post-split left page as belonging on the left page,
> > +and the first index tuple on the post-split right page as belonging on the
> > +right page.
>
> That's a very long sentence.

Will restructure.

> >                        * Since the truncated tuple is probably smaller than the
> >                        * original, it cannot just be copied in place (besides, we want
> >                        * to actually save space on the leaf page).  We delete the
> >                        * original high key, and add our own truncated high key at the
> >                        * same offset.  It's okay if the truncated tuple is slightly
> >                        * larger due to containing a heap TID value, since pivot tuples
> >                        * are treated as a special case by _bt_check_third_page().
>
> By "treated as a special case", I assume that _bt_check_third_page()
> always reserves some space for that? Maybe clarify that somehow.

I'll just say that _bt_check_third_page() reserves space for it in the
next revision of the patch.

> _bt_truncate():
> > This is possible when there are
> >  * attributes that follow an attribute in firstright that is not equal to the
> >  * corresponding attribute in lastleft (equal according to insertion scan key
> >  * semantics).
>
> I can't comprehend that sentence. Simpler English, maybe add an example,
> please.

Okay.

> > static int
> > _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
> >                               bool build)
>
> IMHO "keep" would sound better here than "leave".

WFM.

> Could restructure this to avoid having two almost identical strings to
> translate.

I'll try.

> >  #define BTREE_METAPAGE       0               /* first page is meta */
> >  #define BTREE_MAGIC          0x053162        /* magic number of btree pages */
> > -#define BTREE_VERSION        3               /* current version number */
> > +#define BTREE_VERSION        4               /* current version number */
> >  #define BTREE_MIN_VERSION    2       /* minimal supported version number */
> > +#define BTREE_META_VERSION   3       /* minimal version with all meta fields */
>
> BTREE_META_VERSION is a strange name for version 3. I think this
> deserves a more verbose comment, above these #defines, to list all the
> versions and their differences.

Okay, but what would be better? I'm trying to convey that
BTREE_META_VERSION is the last version where upgrading was a simple
matter of changing the metapage, which can be performed on the fly.
The details of what were added to v3 (what nbtree stuff went into
Postgres 11) are not really interesting enough to have a descriptive
nbtree.h #define name. The metapage-only distinction is actually the
interesting distinction here (if I could do the upgrade on-the-fly,
there'd be no need for a v3 #define at all).

> v9-0003-Pick-nbtree-split-points-discerningly.patch commit message:
> > Add infrastructure to determine where the earliest difference appears
> > among a pair of tuples enclosing a candidate split point.
>
> I don't understand this sentence.

A (candidate) split point is a point *between* two enclosing tuples on
the original page, provided you pretend that the new tuple that caused
the split is already on the original page. I probably don't need to be
(un)clear on that in the commit message, though. I think that I'll
probably end up committing 0002-* and 0003-* in one go anyway (though
not before doing the insertion scan key struct refactoring in a
separate commit, as you suggest).

> > _bt_findsplitloc() is also taught to care about the case where there are
> > many duplicates, making it hard to find a distinguishing split point.
> > _bt_findsplitloc() may even conclude that it isn't possible to avoid
> > filling a page entirely with duplicates, in which case it packs pages
> > full of duplicates very tightly.
>
> Hmm. Is the assumption here that if a page is full of duplicates, there
> will be no more insertions into that page? Why?

This is a really important point, that should probably have been in
your main feedback, rather than the laundry list. I was hoping you'd
comment on this more, in fact.

Imagine the extreme (and admittedly unrealistic) case first: We have a
page full of duplicates, all of which point to one heap page, and with
a gapless sequence of heap TID item pointers. It's literally
impossible to have another page split in this extreme case, because
VACUUM is guaranteed to kill the tuples in the leaf page before
anybody can insert next time (IOW, there has to be TID recycling
before an insertion into the leaf page is even possible).

Now, I've made the "fillfactor" 99, so I haven't actually assumed that
there will be *no* further insertions on the page. I'm almost assuming
that, but not quite. My thinking was that I should match the greedy
behavior that we already have to some degree, and continue to pack
leaf pages full of duplicates very tight. I am quite willing to
consider whether or not I'm still being too aggressive, all things
considered. If I made it 50:50, that would make indexes with
relatively few distinct values significantly larger than on master,
which would probably be deemed a regression. FWIW, I think that even
that regression in space utilization would be more than made up for in
other ways. The master branch _bt_findinsertloc() stuff is a disaster
with many duplicates for a bunch of reasons that are even more
important than the easy-to-measure bloat issue (FPIs, unnecessary
buffer lock contention... I could go on).

What value do you think works better than 99? 95? 90? I'm open minded
about this. I have my own ideas about why 99 works, but they're based
on intuitions that might fail to consider something important. The
current behavior with many duplicates is pretty awful, so we can at
least be sure that it isn't any worse than that.

> What do you do instead, then? memcmp? (Reading the patch, yes.
> Suggestion: "We use a faster binary comparison, instead of proper
> datatype-aware comparison, for speed".

WFM.

> Aside from performance, it would feel inappropriate to call user-defined
> code while holding a buffer lock, anyway.

But we do that all the time for this particular variety of user
defined code? I mean, we actually *have* to use the authoritative
comparisons at the last moment, once we actually make our mind up
about where to split -- nothing else is truly trustworthy. So, uh, we
actually do this "inappropriate" thing -- just not that much of it.

> I'd leave out the ", without provoking a split" part. Or maybe reword to
> "if you pretend that the incoming tuple fit and was placed on the page
> already".

Okay.

> It took me a while to understand what the "appear as early as possible"
> means here. It's talking about a multi-column index, and about finding a
> difference in one of the leading key columns. Not, for example, about
> finding a split point early in the page.

This is probably a hold-over from when we didn't look at candidate
split point tuples an attribute at a time (months ago, it was
something pretty close to a raw memcmp()). Will fix.

> Perhaps we should leave out these details in the README, and explain
> this in the comments of the picksplit-function itself? In the README, I
> think a more high-level description of what things are taken into
> account when picking the split point, would be enough.

Agreed.

> > +Suffix truncation is primarily valuable because it makes pivot tuples
> > +smaller, which delays splits of internal pages, but that isn't the only
> > +reason why it's effective.
>
> Suggestion: reword to "... , but that isn't the only benefit" ?

WFM.

> > There are cases where suffix truncation can
> > +leave a B-Tree significantly smaller in size than it would have otherwise
> > +been without actually making any pivot tuple smaller due to restrictions
> > +relating to alignment.
>
> Suggestion: reword to "... smaller in size than it would otherwise be,
> without ..."

WFM.

> and "without making any pivot tuple *physically* smaller, due to alignment".

WFM.

> This sentence is a bit of a cliffhanger: what are those cases, and how
> is that possible?

This is something you see with the TPC-C indexes, even without the new
split stuff. The TPC-C stock pk is about 45% smaller with that later
commit, but it's something like 6% or 7% smaller even without it (or
maybe it's the orderlines pk). And without ever managing to make a
pivot tuple physically smaller. This happens because truncating away
trailing attributes allows more stuff to go on the first right half of
a split. In more general terms: suffix truncation avoids committing
ourselves to rules about where values should go that are stricter than
truly necessary. On balance, this improves space utilization quite
noticeably, even without the special cases where really big
improvements are made.

If that still doesn't make sense, perhaps you should just try out the
TPC-C stuff without the new split patch, and see for yourself. The
easiest way to do that is to follow the procedure I describe here:

https://bitbucket.org/openscg/benchmarksql/issues/6/making-it-easier-to-recreate-postgres-tpc

(BenchmarkSQL is by far the best TPC-C implementation I've found that
works with Postgres, BTW. Yes, I also hate Java.)

> Ok, I guess these sentences resolve the cliffhanger I complained about.
> But this still feels like magic. When you split a page, all of the
> keyspace must belong on the left or the right page. Why does it make a
> difference to space utilization, where exactly you split the key space?

You have to think about the aggregate effect, rather than thinking
about a single split at a time. But, like I said, maybe the best thing
is to see the effect for yourself with TPC-C (while reverting the
split-at-new-item patch).

> Ok, so this explains it further, I guess. I find this paragraph
> difficult to understand, though. The important thing here is the idea
> that some split points are more "discriminating" than others, but I
> think it needs some further explanation. What makes a split point more
> discriminating? Maybe add an example.

An understandable example seems really hard, even though the effect is
clear. Maybe I should just say *nothing* about the benefits when pivot
tuples don't actually shrink? I found it pretty interesting, and maybe
even something that makes it more understandable, but maybe that isn't
a good enough reason to keep the explanation.

This doesn't address your exact concern, but I think it might help:

Bayer's Prefix B-tree paper talks about the effect of being more
aggressive in finding a split point. You tend to be able to make index
have more leaf pages but fewer internal pages as you get more
aggressive about split points. However, both internal pages and leaf
pages eventually become more numerous than they'd be with a reasonable
interval/level of aggression/discernment -- the saving in internal
pages no longer compensates for having more downlinks in internal
pages. Bayer ends up saying next to nothing about how big the "split
interval" should be.

BTW, somebody named Timothy L. Towns wrote the only analysis I've been
able to find on split interval for "simply prefix B-Trees" (suffix
truncation):

https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1

He is mostly talking about the classic case from Bayer's 77 paper,
where everything is a memcmp()-able string, which is probably what
some systems actually do. On the other hand, I care about attribute
granularity. Anyway, it's pretty clear that this Timothy L. Towns
fellow should have picked a better topic for his thesis, because he
fails to say anything practical about it. Unfortunately, a certain
amount of magic in this area is unavoidable.

> > +Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf
> > +tuple that it's based on (the first item on the right page), since a heap
> > +TID must be appended when nothing else distinguishes each side of a leaf
> > +split.  Truncation cannot simply reuse the leaf level representation: we
> > +must append an additional attribute, rather than incorrectly leaving a heap
> > +TID in the generic IndexTuple item pointer field.  (The field is already
> > +used by pivot tuples to store their downlink, plus some additional
> > +metadata.)
>
> That's not really the fault of suffix truncation as such, but the
> process of turning a leaf tuple into a pivot tuple. It would happen even
> if you didn't truncate anything.

Fair. Will change.

> I think this point, that we have to store the heap TID differently in
> pivot tuples, would deserve a comment somewhere else, too. While reading
> the patch, I didn't realize that that's what we're doing, until I read
> this part of the README, even though I saw the new code to deal with
> heap TIDs elsewhere in the code. Not sure where, maybe in
> BTreeTupleGetHeapTID().

Okay.

> This is the first mention of "many duplicates" mode. Maybe just say
> "_bt_findsplitloc() almost always ..." or "The logic for selecting a
> split point goes to great lengths to avoid heap TIDs in pivots, and
> almost always manages to pick a split point between two
> user-key-distinct tuples, accepting a completely lopsided split if it must."

Sure.

> > Once appending a heap
> > +TID to a split's pivot becomes completely unavoidable, there is a fallback
> > +strategy --- "single value" mode is used, which makes page splits pack the
> > +new left half full by using a high fillfactor.  Single value mode leads to
> > +better overall space utilization when a large number of duplicates are the
> > +norm, and thereby also limits the total number of pivot tuples with an
> > +untruncated heap TID attribute.
>
> This assumes that tuples are inserted in increasing TID order, right?
> Seems like a valid assumption, no complaints there, but it's an
> assumption nevertheless.

I can be explicit about that. See also: my remarks above about
"fillfactor" with single value mode.

> I'm not sure if this level of detail is worthwhile in the README. This
> logic on deciding the split point is all within the _bt_findsplitloc()
> function, so maybe put this explanation there. In the README, a more
> high-level explanation of what things _bt_findsplitloc() considers,
> should be enough.

Okay.

> _bt_findsplitloc(), and all its helper structs and subroutines, are
> about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it
> would be a good idea to move it to a whole new nbtsplitloc.c file? It's
> a very isolated piece of code.

Good idea. I'll give that a go.

> In the comment on _bt_leave_natts_fast():

> That's an interesting tidbit, but I'd suggest just removing this comment
> altogether. It's not really helping to understand the current
> implementation.

Will do.

> v9-0005-Add-high-key-continuescan-optimization.patch commit message:
>
> > Note that even pre-pg_upgrade'd v3 indexes make use of this
> > optimization.
>
> .. but we're missing the other optimizations that make it more
> effective, so it probably won't do much for v3 indexes. Does it make
> them slower? It's probably acceptable, even if there's a tiny
> regression, but I'm curious.

But v3 indexes get the same _bt_findsplitloc() treatment as v4 indexes
-- the new-item-split stuff works almost as well for v3 indexes, and
the other _bt_findsplitloc() stuff doesn't seem to make much
difference. I'm not sure if that's the right thing to do (probably
doesn't matter very much). Now, to answer your question about v3
indexes + the continuescan optimization: I think that it probably will
help a bit, with or without the _bt_findsplitloc() changes. Much
harder to be sure whether it's worth it on balance, since that's
workload dependent. My sense is that it's a much smaller benefit much
of the time, but the cost is still pretty low. So why not just make it
version-generic, and keep things relatively uncluttered?

Once again, I greatly appreciate your excellent review!
--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 29/12/2018 01:04, Peter Geoghegan wrote:
>> However, naively computing the penalty upfront for every offset would be
>> a bit wasteful. Instead, start from the middle of the page, and walk
>> "outwards" towards both ends, until you find a "good enough" penalty.
>
> You can't start at the middle of the page, though.
> 
> You have to start at the left (though you could probably start at the
> right instead). This is because of page fragmentation -- it's not
> correct to assume that the line pointer offset into tuple space on the
> page (firstright linw pointer lp_off for candidate split point) tells
> you anything about what the space delta will be after the split. You
> have to exhaustively add up the free space before the line pointer
> (the free space for all earlier line pointers) before seeing if the
> line pointer works as a split point, since each previous line
> pointer's tuple could be located anywhere in the original page's tuple
> space (anywhere to the left or to the right of where it would be in
> the simple/unfragmented case).

Right. You'll need to do the free space computations from left to right, 
but once you have done that, you can compute the penalties in any order.

I'm envisioning that you have an array, with one element for each item 
on the page (including the tuple we're inserting, which isn't really on 
the page yet). In the first pass, you count up from left to right, 
filling the array. Next, you compute the complete penalties, starting 
from the middle, walking outwards.

That's not so different from what you're doing now, but I find it more 
natural to explain the algorithm that way.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Right. You'll need to do the free space computations from left to right,
> but once you have done that, you can compute the penalties in any order.
>
> I'm envisioning that you have an array, with one element for each item
> on the page (including the tuple we're inserting, which isn't really on
> the page yet). In the first pass, you count up from left to right,
> filling the array. Next, you compute the complete penalties, starting
> from the middle, walking outwards.
>
> That's not so different from what you're doing now, but I find it more
> natural to explain the algorithm that way.

Ah, right. I think I see what you mean now.

I like that this datastructure explicitly has a place for the new
item, so you really do "pretend it's already on the page". Maybe
that's what you liked about it as well.

I'm a little concerned about the cost of maintaining the data
structure. This sounds workable, but we probably don't want to
allocate a buffer most of the time, or even hold on to the information
most of the time. The current design throws away potentially useful
information that it may later have to recreate, but even that has the
benefit of having little storage overhead in the common case.

Leave it with me. I'll need to think about this some more.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Alexander Korotkov
Дата:
Hi!

I'm starting to look at this patchset.  Not ready to post detail
review, but have a couple of questions.

On Wed, Sep 19, 2018 at 9:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I still haven't managed to add pg_upgrade support, but that's my next
> step. I am more or less happy with the substance of the patch in v5,
> and feel that I can now work backwards towards figuring out the best
> way to deal with on-disk compatibility. It shouldn't be too hard --
> most of the effort will involve coming up with a good test suite.

Yes, it shouldn't be too hard, but it seems like we have to keep two
branches of code for different handling of duplicates.  Is that true?

+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()
+ * quantum.  This guarantee is important, since callers need to stay under
+ * the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.

I didn't get the point of this paragraph.  Does it might happen that
first right tuple is under tuple size restriction, but new pivot tuple
is beyond that restriction?  If so, would we have an error because of
too long pivot tuple?  If not, I think this needs to be explained
better.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
Hi Alexander,

On Fri, Jan 4, 2019 at 7:40 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I'm starting to look at this patchset.  Not ready to post detail
> review, but have a couple of questions.

Thanks for taking a look!

> Yes, it shouldn't be too hard, but it seems like we have to keep two
> branches of code for different handling of duplicates.  Is that true?

Not really. If you take a look at v9, you'll see the approach I've
taken is to make insertion scan keys aware of which rules apply (the
"heapkeyspace" field field controls this). I think that there are
about 5 "if" statements for that outside of amcheck. It's pretty
manageable.

I like to imagine that the existing code already has unique keys, but
nobody ever gets to look at the final attribute. It works that way
most of the time -- the only exception is insertion with user keys
that aren't unique already. Note that the way we move left on equal
pivot tuples, rather than right (rather than following the pivot's
downlink) wasn't invented by Postgres to deal with the lack of unique
keys. That's actually a part of the Lehman and Yao design itself.
Almost all of the special cases are optimizations rather than truly
necessary infrastructure.

> I didn't get the point of this paragraph.  Does it might happen that
> first right tuple is under tuple size restriction, but new pivot tuple
> is beyond that restriction?  If so, would we have an error because of
> too long pivot tuple?  If not, I think this needs to be explained
> better.

The v9 version of the function _bt_check_third_page() shows what it
means (comments on this will be improved in v10, too). The old limit
of 2712 bytes still applies to pivot tuples, while a new, lower limit
of 2704 bytes applied for non-pivot tuples. This difference is
necessary because an extra MAXALIGN() quantum could be needed to add a
heap TID to a pivot tuple during truncation in the worst case. To
users, the limit is 2704 bytes, because that's the limit that actually
needs to be enforced during insertion.

We never actually say "1/3 of a page means 2704 bytes" in the docs,
since the definition was always a bit fuzzy. There will need to be a
compatibility note in the release notes, though.
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I'm envisioning that you have an array, with one element for each item
> > on the page (including the tuple we're inserting, which isn't really on
> > the page yet). In the first pass, you count up from left to right,
> > filling the array. Next, you compute the complete penalties, starting
> > from the middle, walking outwards.

> Ah, right. I think I see what you mean now.

> Leave it with me. I'll need to think about this some more.

Attached is v10 of the patch series, which has many changes based on
your feedback. However, I didn't end up refactoring _bt_findsplitloc()
in the way you described, because it seemed hard to balance all of the
concerns there. I still have an open mind on this question, and
recognize the merit in what you suggested. Perhaps it's possible to
reach a compromise here.

I did refactor the _bt_findsplitloc() stuff to make the division of
work clearer, though -- I think that you'll find that to be a clear
improvement, even though it's less than what you asked for. I also
moved all of the _bt_findsplitloc() stuff (old and new) into its own
.c file, nbtsplicloc.c, as you suggested.

Other significant changes
=========================

* Creates a new commit that changes routines like _bt_search() and
_bt_binsrch() to use a dedicated insertion scankey struct, per request
from Heikki.

* As I mentioned in passing, many other small changes to comments, the
nbtree README, and the commit messages based on your (Heikki's) first
round of review.

* v10 generalizes the previous _bt_lowest_scantid() logic for adding a
tie-breaker on equal pivot tuples during a descent of a B-Tree.

The new code works with any truncated attribute, not just a truncated
heap TID (I removed _bt_lowest_scantid() entirely). This also allowed
me to remove a couple of places that previously opted in to
_bt_lowest_scantid(), since the new approach can work without anybody
explicitly opting in. As a bonus, the new approach makes the patch
faster, since remaining queries where we unnecessarily follow an
equal-though-truncated downlink are fixed (it's usually only the heap
TID that's truncated when we can do this, but not always).

The idea behind this new generalized approach is to recognize that
minus infinity is an artificial/sentinel value that doesn't appear in
real keys (it only appears in pivot tuples). The majority of callers
(all callers aside from VACUUM's leaf page deletion code) can
therefore go to the right of a pivot that has all-equal attributes, if
and only if:

1. The pivot has at least one truncated/minus infinity attribute *and*

2. The number of attributes matches the scankey.

In other words, we tweak the comparison logic to add a new
tie-breaker. There is no change to the on-disk structures compared to
v9 of the patch -- I've only made index scans able to take advantage
of minus infinity values in *all* cases.

If this explanation is confusing to somebody less experienced with
nbtree than Heikki: consider the way we descend *between* the values
on internal pages, rather than expecting exact matches. _bt_binsrch()
behaves slightly differently when doing a binary search on an internal
page already: equality actually means "go left" when descending the
tree (though it doesn't work like that on leaf pages, where insertion
scankeys almost always search for a >= match). We want to "go right"
instead in cases where it's clear that tuples of interest to our scan
can only be in that child page (we're rarely searching for a minus
infinity value, since that doesn't appear in real tuples). (Note that
this optimization has nothing to do with "moving right" to recover
from concurrent page splits -- we would have relied on code like
_bt_findsplicloc() and _bt_readpage() to move right once we reach the
leaf level when we didn't have this optimization, but that code isn't
concerned with recovering from concurrent page splits.)

Minor changes
=============

* Addresses Heikki's concerns about locking the metapage more
frequently in a general way. Comments are added to nbtpage.c, and
updated in a number of places that already talk about the same risk.

The master branch seems to be doing much the same thing in similar
situations already (e.g. during a root page split, when we need to
finish an interrupted page split but don't have a usable
parent/ancestor page stack). Importantly, the patch does not change
the dependency graph.

* Small changes to user docs where existing descriptions of things
seem to be made inaccurate by the patch.

Benchmarking
============

I have also recently been doing a lot of automated benchmarking. Here
are results of a BenchmarkSQL benchmark (plus various instrumentation)
as a bz2 archive:

https://drive.google.com/file/d/1RVJUzMtMNDi4USg0-Yo56LNcRItbFg1Q/view?usp=sharing

I completed on my home server last night, against v10 of the patch
series. Note that there were 4 runs for each case (master case +
public/patch case), with each run lasting 2 hours (so the benchmark
took over 8 hours once you include bulk loading time). There were 400
"warehouses" (this is similar to pgbench's scale factor), and 16
terminals/clients. This left the database 110GB+ in size on a server
with 32GB of memory and a fast consumer grade SSD. Autovacuum was
tuned to perform aggressive cleanup of bloat. All the settings used
are available in the bz2 archive (there are "settings" output files,
too).

Summary
-------

See the html "report" files for a quick visual indication of how the
tests progresses. BenchmarkSQL uses R to produce useful graphs, which
is quite convenient. (I have automated a lot of this with my own ugly
shellscript.)

We see a small but consistent increase in transaction throughput here,
as well as a small but consistent decrease in average latency for each
class of transaction. There is also a large and consistent decrease in
the on-disk size of indexes, especially if you just consider the
number of internal pages (diff the "balance" files to see what I
mean). Note that the performance is expected to degrade across runs,
since the database is populated once, at the start, and has more data
added over time; the important thing is that run n on master be
compared to run n on public/patch. Note also that I use my own fork of
BenchmarkSQL that does its CREATE INDEX before initial bulk loading,
not after [1]. It'll take longer to see problems on Postgres master if
the initial bulk load does CREATE INDEX after BenchmarkSQL workers
populate tables (we only need INSERTs to see significant index bloat).
Avoiding pristine indexes at the start of the benchmark makes the
problems on the master branch apparent sooner.

The benchmark results also include things like pg_statio* +
pg_stat_bgwriter view output (reset between test runs), which gives
some insight into what's going on. Checkpoints tend to write out a few
more dirty buffers with the patch, while there is a much larger drop
in the number of buffers written out by backends. There are probably
workloads where we'd see a much larger increase in transaction
throughput -- TPC-C happens to access index pages with significant
locality, and happens to be very write-heavy, especially compared to
the more modern (though less influential) TPC-E benchmark. Plus, the
TPC-C workload isn't at all helped by the fact that the patch will
never "get tired", even though that's the most notable improvement
overall.

[1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Jan 8, 2019 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v10 of the patch series, which has many changes based on
> your feedback. However, I didn't end up refactoring _bt_findsplitloc()
> in the way you described, because it seemed hard to balance all of the
> concerns there. I still have an open mind on this question, and
> recognize the merit in what you suggested. Perhaps it's possible to
> reach a compromise here.

> * Addresses Heikki's concerns about locking the metapage more
> frequently in a general way. Comments are added to nbtpage.c, and
> updated in a number of places that already talk about the same risk.

Attached is v11 of the patch, which removes the comments mentioned
here, and instead finds a way to not do new things with buffer locks.

Changes
=======

* We simply avoid holding buffer locks while accessing the metapage.
(Of course, the old root page split stuff still does this -- it has
worked that way forever.)

* We also avoid calling index_getprocinfo() with any buffer lock held.
We'll still call support function 1 with a buffer lock held to
truncate, but that's not new -- *any* insertion will do that.

For some reason I got stuck on the idea that we need to use a
scankey's own values within _bt_truncate()/_bt_keep_natts() by
generating a new insertion scankey every time. We now simply ignore
those values, and call the comparator with pairs of tuples that each
come from the page directly. Usually, we'll just reuse the insertion
scankey that we were using for the insertion anyway (we no longer
build our own scankey for truncation). Other times, we'll build an
empty insertion scankey (one that has the function pointer and so on,
but no values). The only downside is that I cannot have an assertion
that calls _bt_compare() to make sure we truncated correctly there and
then, since a dedicated insertion scankey is no longer conveniently
available.

I feel rather silly for not having gone this way from the beginning,
because the new approach is quite obviously simpler and safer.
nbtsort.c now gets a reusable, valueless insertion scankey that it
uses for both truncation and for setting up a merge of the two spools
for unique index builds. This approach allows me to remove
_bt_mkscankey_nodata() altogether -- callers build a "nodata"
insertion scankey with empty values by passing _bt_mkscankey() a NULL
tuple. That's equivalent to having an insertion scankey built from an
all-attributes-truncated tuple. IOW, the patch now makes the "nodata"
stuff a degenerate case of building a scankey from a
truncated-attributes tuple. tuplesort.c also uses the new BTScanInsert
struct. There is no longer any case where there in an insertion
scankey that isn't accessed using the BTScanInsert struct.

* No more pg_depend tie-breaker column commit. That was an ugly hack,
that I'm glad to be rid of -- many thanks to Tom for working through a
number of test instability issues that affected the patch. I do still
need to paper-over one remaining regression test issue/bug that the
patch happens to unmask, pending Tom fixing it directly [1]. This
papering-over is broken out into its own commit
("v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch"). I
expect that Tom will fix the bug before too long, at which point the
temporary work around can just be reverted from your local tree.

Outlook
=======

I feel that this version is pretty close to being commitable, since
everything about the design is settled. It completely avoids saying
anything new about buffer locking protocols, LWLock deadlock safety,
etc. VACUUM and crash recovery are also unchanged, so subtle bugs
should at least not be too hard to reproduce when observed once. It's
pretty complementary code: the new logic for picking a split point
builds a list of candidate split points using the old technique, with
a second pass to choose the best one for suffix truncation among the
accumulated list. Hard to see how that could introduce an invalid
split point choice.

I take the risk of introducing new corruption bugs very seriously.
contrib/amcheck now verifies all aspects of the new on-disk
representation. The stricter Lehman & Yao style invariant ("the
subtree S is described by Ki < v <= Ki + 1 ...") allows amcheck to be
stricter in what it will accept (e.g., heap TID needs to be in order
among logical duplicates, we always expect to see a representation of
the number of pivot tuple attributes, and we expect the high key to be
strictly greater than items on internal pages).

Review
======

It would be very helpful if a reviewer such as Heikki or Alexander
could take a look at the patch once more. I suggest that they look at
the following points in the patch:

*  The minusinfkey stuff, which is explained within _bt_compare(), and
within nbtree.h header comments. Page deletion by VACUUM is the only
_bt_search() caller that sets minusinfkey to true (though older
versions of btree and amcheck also set minusinfkey to true).

* Does the value of BTREE_SINGLEVAL_FILLFACTOR make sense? Am I being
a little too aggressive there, possibly hurting workloads where HOT
pruning occurs periodically? Sane duplicate handling is the most
compelling improvement that the patch makes, but I may still have been
a bit too aggressive in packing pages full of duplicates so tightly. I
figured that that was the closest thing to the previous behavior
that's still reasonable.

* Does the _bt_splitatnewitem() criteria for deciding if we should
split at the point the new tuple is positioned at miss some subtlety?
It's important that splitting at the new item when we shouldn't
doesn't happen, or hardly ever happens -- it should be
*self-limiting*. This was tested using BenchmarkSQL/TPC-C [2] -- TPC-C
has a workload where this particular enhancement makes indexes a lot
smaller.

* There was also testing of index bloat following bulk insertions,
based on my own custom test suite. Data and indexes were taken from
TPC-C tables, TPC-H tables, TPC-E tables, UK land registry data [3],
and the Mouse Genome Database Project (Postgres schema + indexes) [4].
This takes almost an hour to run on my development machine, though the
most important tests finish in less than 5 minutes. I can provide
access to all or some of these tests, if reviewers are interested and
are willing to download several gigabytes of sample data that I'll
provide privately.

[1] https://postgr.es/m/19220.1547767251@sss.pgh.pa.us
[2] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
[3] https://wiki.postgresql.org/wiki/Sample_Databases
[4] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 09/01/2019 02:47, Peter Geoghegan wrote:
> On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I'm envisioning that you have an array, with one element for each item
>>> on the page (including the tuple we're inserting, which isn't really on
>>> the page yet). In the first pass, you count up from left to right,
>>> filling the array. Next, you compute the complete penalties, starting
>>> from the middle, walking outwards.
> 
>> Ah, right. I think I see what you mean now.
> 
>> Leave it with me. I'll need to think about this some more.
> 
> Attached is v10 of the patch series, which has many changes based on
> your feedback. However, I didn't end up refactoring _bt_findsplitloc()
> in the way you described, because it seemed hard to balance all of the
> concerns there. I still have an open mind on this question, andAt a 
> recognize the merit in what you suggested. Perhaps it's possible to
> reach a compromise here.

I spent some time first trying to understand the current algorithm, and 
then rewriting it in a way that I find easier to understand. I came up 
with the attached. I think it optimizes for the same goals as your 
patch, but the approach  is quite different. At a very high level, I 
believe the goals can be described as:

1. Find out how much suffix truncation is possible, i.e. how many key 
columns can be truncated away, in the best case, among all possible ways 
to split the page.

2. Among all the splits that achieve that optimum suffix truncation, 
find the one with smallest "delta".

For performance reasons, it doesn't actually do it in that order. It's 
more like this:

1. First, scan all split positions, recording the 'leftfree' and 
'rightfree' at every valid split position. The array of possible splits 
is kept in order by offset number. (This scans through all items, but 
the math is simple, so it's pretty fast)

2. Compute the optimum suffix truncation, by comparing the leftmost and 
rightmost keys, among all the possible split positions.

3. Split the array of possible splits in half, and process both halves 
recursively. The recursive process "zooms in" to the place where we'd 
expect to find the best candidate, but will ultimately scan through all 
split candidates, if no "good enough" match is found.


I've only been testing this on leaf splits. I didn't understand how the 
penalty worked for internal pages in your patch. In this version, the 
same algorithm is used for leaf and internal pages. I'm sure this still 
has bugs in it, and could use some polishing, but I think this will be 
more readable way of doing it.


What have you been using to test this? I wrote the attached little test 
extension, to explore what _bt_findsplitloc() decides in different 
scenarios. It's pretty rough, but that's what I've been using while 
hacking on this. It prints output like this:

postgres=# select test_split();
NOTICE:  test 1:
left    2/358: 1 0
left  358/358: 1 356
right   1/ 51: 1 357
right  51/ 51: 1 407  SPLIT TUPLE
split ratio: 12/87

NOTICE:  test 2:
left    2/358: 0 0
left  358/358: 356 356
right   1/ 51: 357 357
right  51/ 51: 407 407  SPLIT TUPLE
split ratio: 12/87

NOTICE:  test 3:
left    2/358: 0 0
left  320/358: 10 10  SPLIT TUPLE
left  358/358: 48 48
right   1/ 51: 49 49
right  51/ 51: 99 99
split ratio: 12/87

NOTICE:  test 4:
left    2/309: 1 100
left  309/309: 1 407  SPLIT TUPLE
right   1/100: 2 0
right 100/100: 2 99
split ratio: 24/75

Each test consists of creating a temp table with one index, and 
inserting rows in a certain pattern, until the root page splits. It then 
prints the first and last tuples on both pages, after the split, as well 
as the tuple that caused the split. I don't know if this is useful to 
anyone but myself, but I thought I'd share it.

- Heikki

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I spent some time first trying to understand the current algorithm, and
> then rewriting it in a way that I find easier to understand. I came up
> with the attached. I think it optimizes for the same goals as your
> patch, but the approach  is quite different. At a very high level, I
> believe the goals can be described as:
>
> 1. Find out how much suffix truncation is possible, i.e. how many key
> columns can be truncated away, in the best case, among all possible ways
> to split the page.
>
> 2. Among all the splits that achieve that optimum suffix truncation,
> find the one with smallest "delta".

Thanks for going to the trouble of implementing what you have in mind!

I agree that the code that I wrote within nbtsplitloc.c is very
subtle, and I do think that I have further work to do to make its
design clearer. I think that this high level description of the goals
of the algorithm is inaccurate in subtle but important ways, though.
Hopefully there will be a way of making it more understandable that
preserves certain important characteristics. If you had my test
cases/data that would probably help you a lot (more on that later).

The algorithm I came up with almost always does these two things in
the opposite order, with each considered in clearly separate phases.
There are good reasons for this. We start with the same criteria as
the old algorithm. We assemble a small array of candidate split
points, rather than one split point, but otherwise it's almost exactly
the same (the array is sorted by delta). Then, at the very end, we
iterate through the small array to find the best choice for suffix
truncation. IOW, we only consider suffix truncation as a *secondary*
goal. The delta is still by far the most important thing 99%+ of the
time. I assume it's fairly rare to not have two distinct tuples within
9 or so tuples of the delta-wise optimal split position -- 99% is
probably a low estimate, at least in OLTP app, or within unique
indexes. I see that you do something with a "good enough" delta that
seems like it also makes delta the most important thing, but that
doesn't appear to be, uh, good enough. ;-)

Now, it's true that my approach does occasionally work in a way close
to what you describe above -- it does this when we give up on default
mode and check "how much suffix truncation is possible?" exhaustively,
for every possible candidate split point. "Many duplicates" mode kicks
in when we need to be aggressive about suffix truncation. Even then,
the exact goals are different to what you have in mind in subtle but
important ways. While "truncating away the heap TID" isn't really a
special case in other places, it is a special case for my version of
nbtsplitloc.c, which more or less avoids it at all costs. Truncating
away heap TID is more important than truncating away any other
attribute by a *huge* margin. Many duplicates mode *only* specifically
cares about truncating the final TID attribute. That is the only thing
that is ever treated as more important than delta, though even there
we don't forget about delta entirely. That is, we assume that the
"perfect penalty" is nkeyatts when in many duplicates mode, because we
don't care about suffix truncation beyond heap TID truncation by then.
So, if we find 5 split points out of 250 in the final array that avoid
appending heap TID, we use the earliest/lowest delta out of those 5.
We're not going to try to maximize the number of *additional*
attributes that get truncated, because that can make the leaf pages
unbalanced in an *unbounded* way. None of these 5 split points are
"good enough", but the distinction between their deltas still matters
a lot. We strongly prefer a split with a *mediocre* delta to a split
with a *terrible* delta -- a bigger high key is the least of our
worries here. (I made similar mistakes myself months ago, BTW.)

Your version of the algorithm makes a test case of mine (UK land
registry test case [1]) go from having an index that's 1101 MB with my
version of the algorithm/patch and 1329 MB on the master branch to an
index that's 3030 MB in size. I think that this happens because it
effectively fails to give any consideration to delta at all at certain
points. On leaf pages with lots of unique keys, your algorithm does
about as well as mine because all possible split points look equally
good suffix-truncation-wise, plus you have the "good enough" test, so
delta isn't ignored. I think that your algorithm also works well when
there are many duplicates but only one non-TID index column, since the
heap TID truncation versus other truncation issue does not arise. The
test case I used is an index on "(county, city, locality)", though --
low cardinality, but more than a single column. That's a *combination*
of two separate considerations, that seem to get conflated. I don't
think that you can avoid doing "a second pass" in some sense, because
these really are separate considerations.

There is an important middle-ground that your algorithm fails to
handle with this test case. You end up maximizing the number of
attributes that are truncated when you shouldn't -- leaf page splits
are totally unbalanced much of the time. Pivot tuples are smaller on
average, but are also far far more numerous, because there are more
leaf page splits as a result of earlier leaf page splits being
unbalanced. If instead you treated heap TID truncation as the only
thing that you were willing to go to huge lengths to prevent, then
unbalanced splits become *self-limiting*. The next split will probably
end up being a single value mode split, which packs pages full of
duplicates at tightly as possible.

Splits should "degrade gracefully" from default mode to many
duplicates mode to single value mode in cases where the number of
distinct values is constant (or almost constant), but the total number
of tuples grows over time.

> I've only been testing this on leaf splits. I didn't understand how the
> penalty worked for internal pages in your patch. In this version, the
> same algorithm is used for leaf and internal pages.

The approach that I use for internal pages is only slightly different
to what we've always done -- I split very near the delta-wise optimal
point, with a slight preference for a tuple that happens to be
smaller. And, there is no way in which the delta-optimal point can be
different to what it would have been on master with internal pages
(they only use default mode). I don't think it's appropriate to use
the same algorithm for leaf and internal page splits at all. We cannot
perform suffix truncation on internal pages.

> What have you been using to test this? I wrote the attached little test
> extension, to explore what _bt_findsplitloc() decides in different
> scenarios.

I've specifically tested the _bt_findsplitloc() stuff using a couple
of different techniques. Primarily, I've been using lots of real world
data and TPC benchmark test data, with expected/test output generated
by a contrib/pageinspect query that determines the exact number of
leaf blocks and internal page blocks from each index in a test
database. Just bash and SQL. I'm happy to share that with you, if
you're able to accept a couple of gigabytes worth of dumps that are
needed to make the scripts work. Details:

pg@bat:~/hdd/sample-data$ ll land_registry.custom.dump
-rw------- 1 pg pg 1.1G Mar  3  2018 land_registry.custom.dump
pg@bat:~/hdd/sample-data$ ll tpcc_2018-07-20_unlogged.dump
-rw-rw-r-- 1 pg pg 1.8G Jul 20  2018 tpcc_2018-07-20_unlogged.dump

(The only other components for these "fast" tests are simple bash scripts.)

I think that you'd find it a lot easier to work with me on these
issues you at least had these tests -- my understanding of the
problems was shaped by the tests. I strongly recommend that you try
out my UK land registry test and the TPC-C test as a way of
understanding the design I've used for _bt_findsplitloc(). It
shouldn't be that inconvenient to get it over to you. I have several
more tests besides these two, but they're much more cumbersome and
much less valuable. I have a script that I can run in 5 minutes that
probably catches all the regressions. The long running stuff, like my
TPC-E test case (the stuff that I won't bother sending) hasn't caught
any regressions that the fast tests didn't catch as well.

Separately, I also have a .gdbinit function that looks like this:

define dump_page
  dump binary memory /tmp/gdb_postgres_page.dump $arg0 ($arg0 + 8192)
  echo Invoking pg_hexedit + wxHexEditor on page...\n
  ! ~/code/pg_hexedit/pg_hexedit -n 1 /tmp/gdb_postgres_page.dump >
/tmp/gdb_postgres_page.dump.tags
  ! ~/code/wxHexEditor/wxHexEditor /tmp/gdb_postgres_page.dump &> /dev/null
end

This allows me to see an arbitrary page from an interactive gdb
session using my pg_hexedit tool. I can simply "dump_page page" from
most functions in the nbtree source code. At various points I found it
useful to add optimistic assertions to the split point choosing
routines that failed. I could then see why they failed by using gdb
with the resulting core dump. I could look at the page image using
pg_hexedit/wxHexEditor from there. This allowed me to understand one
or two corner cases. For example, this is how I figured out the exact
details at the end of _bt_perfect_penalty(), when it looks like we're
about to go into a second pass of the page.

> It's pretty rough, but that's what I've been using while
> hacking on this. It prints output like this:

Cool! I did have something that would LOG the new high key in an easy
to interpret way at one point, which was a little like this.

[1] https://postgr.es/m/CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com
--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Jan 28, 2019 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Thanks for going to the trouble of implementing what you have in mind!
>
> I agree that the code that I wrote within nbtsplitloc.c is very
> subtle, and I do think that I have further work to do to make its
> design clearer. I think that this high level description of the goals
> of the algorithm is inaccurate in subtle but important ways, though.
> Hopefully there will be a way of making it more understandable that
> preserves certain important characteristics.

Heikki and I had the opportunity to talk about this recently. We found
an easy way forward. I believe that the nbtsplitloc.c algorithm itself
is fine -- the code will need to be refactored, though.

nbtsplitloc.c can be refactored to assemble a list of legal split
points up front, before deciding which one to go with in a separate
pass (using one of two "alternative modes", as before). I now
understand that Heikki simply wants to separate the questions of "Is
this candidate split point legal?" from "Is this known-legal candidate
split point good/ideal based on my current criteria?". This seems like
a worthwhile goal to me. Heikki accepts the need for multiple
modes/passes, provided recursion isn't used in the implementation.

It's clear to me that the algorithm should start off trying to split
towards the middle of the page (or towards the end in the rightmost
case), while possibly making a small compromise on the exact split
point to maximize the effectiveness of suffix truncation. We must
change strategy entirely if and only if the middle of the page (or
wherever we'd like to split initially) is found to be completely full
of duplicates -- that's where the need for a second pass comes in.
This should almost never happen in most applications. Even when it
happens, we only care about not splitting inside a group of
duplicates. That's not the same thing as caring about maximizing the
number of attributes truncated away. Those two things seem similar,
but are actually very different.

It might have sounded like Heikki and I disagreed on the design of the
algorithm at a high level, or what its goals ought to be. That is not
the case, though. (At least not so far.)

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Feb 5, 2019 at 4:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Heikki and I had the opportunity to talk about this recently. We found
> an easy way forward. I believe that the nbtsplitloc.c algorithm itself
> is fine -- the code will need to be refactored, though.

Attached v12 does not include this change, though I have every
intention of doing the refactoring described for v13. The
nbtsplitloc.c/split algorithm refactoring would necessitate
revalidating the patch's performance, though, which didn't seem worth
blocking on. Besides, there was bit rot that needed to be fixed.

Notable improvements in v12:

* No more papering-over regression test differences caused by
pg_depend issues, thanks to recent work by Tom (today's commit
1d92a0c9).

* I simplified the code added to _bt_binsrch() to deal with saving and
restoring binary search bounds for _bt_check_unique()-caller
insertions (this is from first/"Refactor nbtree insertion scankeys"
patch). I also improved matters within _bt_check_unique() itself: the
early "break" there (based on reaching the known strict upper bound
from cache binary search) works in terms of the existing
_bt_check_unique() loop invariant.

This even allowed me to add a new assertion that makes sure that
breaking out of the loop early is correct -- we call _bt_isequal() for
next item on assert-enabled builds when we break having reached strict
upper bound established by initial binary search. In other words,
_bt_check_unique() ends up doing the same number of _bt_isequal()
calls as it did on the master branch, provided assertions are enabled.

* I've restored regression test coverage that the patch previously
inadvertently took away. Suffix truncation made deliberately-tall
B-Tree indexes from the regression tests much shorter, making the
tests fail to test the code paths the tests originally targeted. I
needed to find ways to "defeat" suffix truncation so I still ended up
with a fairly tall tree that hit various code paths.

I think that we went from having 8 levels in btree_tall_idx (i.e.
ridiculously many) to having only a single root page when I first
caught the problem! Now btree_tall_idx only has 3 levels, which is all
we really need. Even multi-level page deletion didn't have any
coverage in previous versions. I used gcov to specifically verify that
we have good multi-level page deletion coverage. I also used gcov to
make sure that we have coverage of the v11 "cache rightmost block"
optimization, since I noticed that that was missing (though present on
the master branch) -- that's actually all that the btree_tall_idx
tests in the patch, since multi-level page deletion is covered by a
covering-indexes-era test. Finally, I made sure that we have coverage
of fast root splits. In general, I preserved the original intent
behind the existing tests, all of which I was fairly familiar with
from previous projects.

* I've added a new "relocate" bt_index_parent_check()/amcheck option,
broken out in a separate commit. This new option makes verification
relocate each and every leaf page tuple, starting from the root each
time. This means that there will be at least one piece of code that
specifically relies on "every tuple should have a unique key" from the
start, which seems like a good idea.

This enhancement to amcheck allows me to detect various forms of
corruption that no other existing verification option would catch. In
particular, I can catch various very subtle "cross-cousin
inconsistencies" that require that we verify a page using its
grandparent rather than its parent [1] (existing checks catch some but
not all "cousin problem" corruption). Simply put, this amcheck
enhancement allows me to detect corruption of the least significant
byte in a key value in the root page -- that kind of corruption will
cause index scans to miss only a small number of tuples at the leaf
level. Maybe this scenario isn't realistic, but I'd rather not take
any chances.

* I rethought the "single value mode" fillfactor, which I've been
suspicious of for a while now. It's now 96, down from 99.

Micro-benchmarks involving concurrent sessions inserting into a low
cardinality index led me to the conclusion that 99 was aggressively
high. It was not that hard to get excessive page splits with these
microbenchmarks, since insertions with monotonically increasing heap
TIDs arrived a bit out of order with a lot of concurrency. 99 worked a
bit better than 96 with only one session, but significantly worse with
concurrent sessions. I still think that it's a good idea to be more
aggressive than default leaf fillfactor, but reducing "single value
mode" fillfactor to 90 (or whatever the user set general leaf
fillfactor to) wouldn't be so bad.

[1] http://subs.emis.de/LNI/Proceedings/Proceedings144/32.pdf
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Notable improvements in v12:

I've been benchmarking v12, once again using a slightly modified
BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1], since
the problems with index bloat don't take so long to manifest
themselves when the indexes are inserted into incrementally from the
very beginning. This benchmarking process took over 20 hours, with a
database that started off at about 90GB (700 TPC-C/BenchmarkSQL
warehouses were used). That easily exceeded available main memory on
my test server, which was 32GB. This is a pretty I/O bound workload,
and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB,
NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only
tablespace.

Importantly, I figured out that I should disable both hash joins and
merge joins with BenchmarkSQL, in order to force all joins to be
nested loop joins. Otherwise, the "stock level" transaction eventually
starts to use a hash join, even though that's about 10x slower than a
nestloop join (~4ms vs. ~40ms on this machine) -- the hash join
produces a lot of noise without really testing anything. It usually
takes a couple of hours before we start to get obviously-bad plans,
but it also usually takes about that long until the patch series
starts to noticeably overtake the master branch. I don't think that
TPC-C will ever benefit from using a hash join or a merge join, since
it's supposed to be a pure OLTP benchmark, and is a benchmark that
MySQL is known to do at least respectably-well on.

This is the first benchmark I've published that was considerably I/O
bound. There are significant improvements in performance across the
board, on every measure, though it takes several hours for that to
really show. The benchmark was not rate-limited. 16
clients/"terminals" are used throughout. There were 5 runs for master
and 5 for patch, interlaced, each lasting 2 hours. Initialization
occurred once, so it's expected that both databases will gradually get
larger across runs.

Summary (appears in same order as the execution of each run) -- each
run is 2 hours, so 20 hours total excluding initial load time (2 hours
* 5 runs for master + 2 hours * 5 runs for patch):

Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC
(NewOrders) = 39172.37
Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC
(NewOrders) = 39530.2

Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC
(NewOrders) = 33530.66
Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC
(NewOrders) = 36508.38    <-- 8.8% increase in tpmTOTAL/throughput

Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC
(NewOrders) = 30949.24
Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC
(NewOrders) = 34021.98   <-- 9.8% increase in tpmTOTAL/throughput

Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC
(NewOrders) = 31163.29
Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC
(NewOrders) = 31793.99

Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC
(NewOrders) = 28908.8
Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders)
= 30885.56  <-- 6.9% increase in tpmTOTAL/throughput

There were *also* significant reductions in transaction latency for
the patch -- see the full html reports in the provided tar archive for
full details (URL provided below). The html reports have nice SVG
graphs, generated by BenchmarkSQL using R -- one for transaction
throughput, and another for transaction latency. The overall picture
is that the patched version starts out ahead, and has a much more
gradual decline as the database becomes larger and more bloated.

Note also that the statistics collector stats show a *big* reduction
in blocks read into shared_buffers for the duration of these runs. For
example, here is what pg_stat_database shows for run 3 (I reset the
stats between runs):

master: blks_read = 78,412,640, blks_hit = 4,022,619,556
patch: blks_read = 70,033,583, blks_hit = 4,505,308,517  <-- 10.7%
reduction in blks_read/logical I/O

This suggests an indirect benefit, likely related to how buffers are
evicted in each case. pg_stat_bgwriter indicates that more buffers are
written out during checkpoints, while fewer are written out by
backends. I won't speculate further on what all of this means right
now, though.

You can find the raw details for blks_read for each and every run in
the full tar archive. It is available for download from:

https://drive.google.com/file/d/1kN4fDmh1a9jtOj8URPrnGYAmuMPmcZax/view?usp=sharing

There are also dumps of the other pg_stat* views at the end of each
run, logs for each run, etc. There's more information than anybody
else is likely to find interesting.

If anyone needs help in recreating this benchmark, then I'd be happy
to assist in that. The is a shell script (zsh) included in the tar
archive, although that will need to be changed a bit to point to the
correct installations and so on. Independent validation of the
performance of the patch series on this and other benchmarks is very
welcome.

[1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I spent some time first trying to understand the current algorithm, and
> then rewriting it in a way that I find easier to understand. I came up
> with the attached. I think it optimizes for the same goals as your
> patch, but the approach  is quite different.

Attached is v13 of the patch series, which significantly refactors
nbtsplitloc.c to implement the algorithm using the approach from your
prototype posted on January 28 -- I now take a "top down" approach
that materializes all legal split points up-front, as opposed to the
initial "bottom up" approach that used recursion, and weighed
everything (balance of free space, suffix truncation, etc) all at
once. Some of the code is directly lifted from your prototype, so
there is now a question about whether or not you should be listed as a
co-author. (I think that you should be credited as a secondary author
of the nbtsplitloc.c patch, and as a secondary author in the release
notes for the feature as a whole, which I imagine will be rolled into
one item there.)

I always knew that a "top down" approach would be simpler, but I
underestimated how much better it would be overall, and how manageable
the downsides are -- the added cycles are not actually noticeable when
compared to the master branch, even during microbenchmarks. Thanks for
suggesting this approach!

I don't even need to simulate recursion with a loop or a goto;
everything is structured as a linear series of steps now. There are
still the same modes as before, though; the algorithm is essentially
unchanged. All of my tests show that it's at least as effective as v12
was in terms of how effective the final _bt_findsplitloc() results are
in reducing index size. The new approach will make more sophisticated
suffix truncation costing much easier to implement in a future
release, when suffix truncation is taught to truncate *within*
individual datums/attributes (e.g. generate the text string "new m"
given a split point between "new jersey" and "new york", by using some
new opclass infrastructure). "Top down" also makes the implementation
of the "split after new item" optimization safer and simpler -- we
already have all split points conveniently available, so we can seek
out an exact match instead of interpolating where it ought appear
later using a dynamic fillfactor. We can back out of the "split after
new item" optimization in the event of the *precise* split point we
want to use not being legal. That shouldn't be necessary, and isn't
necessary in practice, but it seems like a good idea be defensive with
something so delicate as this.

I'm using qsort() to sort the candidate split points array. I'm not
trying to do something clever to avoid the up-front effort of sorting
everything, even though we could probably get away with that much of
the time (e.g. by doing a top-N sort in default mode). Testing has
shown that using an inlined qsort() routine in the style of
tuplesort.c would make my serial test cases/microbenchmarks faster,
without adding much complexity. We're already competitive with the
master branch even without this microoptimization, so I've put that
off for now. What do you think of the idea of specializing an
inlineable qsort() for nbtsplitloc.c?

Performance is at least as good as v12 with a more relevant workload,
such as BenchmarkSQL. Transaction throughput is 5% - 10% greater in my
most recent tests (benchmarks for v13 specifically).

Unlike in your prototype, v13 makes the array for holding candidate
split points into a single big allocation that is always exactly
BLCKSZ. The idea is that palloc() can thereby recycle the big
_bt_findsplitloc() allocation within _bt_split(). palloc() considers
8KiB to be the upper limit on the size of individual blocks it
manages, and we'll always go on to palloc(BLCKSZ) through the
_bt_split() call to PageGetTempPage(). In a sense, we're not even
allocating memory that we weren't allocating already. (Not sure that
this really matters, but it is easy to do it that way.)

Other changes from your prototype:

*  I found a more efficient representation than a pair of raw
IndexTuple pointers for each candidate split. Actually, I use the same
old representation (firstoldonright + newitemonleft) in each split,
and provide routines to work backwards from that to get the lastleft
and firstright tuples. This approach is far more space efficient, and
space efficiency matters when you've allocating space for hundreds of
items in a critical path like this.

* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?

* Fixed a bug in the loop that adds split points. Your refactoring
made the main loop responsible for new item space handling, as just
mentioned, but it didn't create a split where the new item is first on
the page, and the split puts the new item on the left page on its own,
on all existing items on the new right page. I couldn't prove that
this caused failures to find a legal split, but it still seemed like a
bug.

In general, I think that we should generate our initial list of split
points in exactly the same manner as we do so already. The only
difference as far as split legality/feasibility goes is that we
pessimistically assume that suffix truncation will have to add a heap
TID in all cases. I don't see any advantage to going further than
that.

Changes to my own code since v12:

* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.

* Removed subtly wrong assertion in nbtpage.c, concerning VACUUM's
page deletion. Even a page that is about to be deleted can be filled
up again and split when we release and reacquire a lock, however
unlikely that may be.

* Rename _bt_checksplitloc() to _bt_recordsplit(). I think that it
makes more sense to make that about recording a split point, rather
than about making sure a split point is legal. It still does that, but
perhaps 99%+ of calls to _bt_recordsplit()/_bt_checksplitloc() result
in the split being deemed legal, so the new name makes much more
sense.

--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 26/02/2019 12:31, Peter Geoghegan wrote:
> On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I spent some time first trying to understand the current algorithm, and
>> then rewriting it in a way that I find easier to understand. I came up
>> with the attached. I think it optimizes for the same goals as your
>> patch, but the approach is quite different.
> 
> Attached is v13 of the patch series, which significantly refactors
> nbtsplitloc.c to implement the algorithm using the approach from your
> prototype posted on January 28 -- I now take a "top down" approach
> that materializes all legal split points up-front, as opposed to the
> initial "bottom up" approach that used recursion, and weighed
> everything (balance of free space, suffix truncation, etc) all at
> once.

Great, looks much simpler now, indeed! Now I finally understand the 
algorithm.

> I'm using qsort() to sort the candidate split points array. I'm not
> trying to do something clever to avoid the up-front effort of sorting
> everything, even though we could probably get away with that much of
> the time (e.g. by doing a top-N sort in default mode). Testing has
> shown that using an inlined qsort() routine in the style of
> tuplesort.c would make my serial test cases/microbenchmarks faster,
> without adding much complexity. We're already competitive with the
> master branch even without this microoptimization, so I've put that
> off for now. What do you think of the idea of specializing an
> inlineable qsort() for nbtsplitloc.c?

If the performance is acceptable without it, let's not bother. We can 
optimize later.

What would be the worst case scenario for this? Splitting a page that 
has as many tuples as possible, I guess, so maybe inserting into a table 
with a single-column index, with 32k BLCKSZ. Have you done performance 
testing on something like that?

> Unlike in your prototype, v13 makes the array for holding candidate
> split points into a single big allocation that is always exactly
> BLCKSZ. The idea is that palloc() can thereby recycle the big
> _bt_findsplitloc() allocation within _bt_split(). palloc() considers
> 8KiB to be the upper limit on the size of individual blocks it
> manages, and we'll always go on to palloc(BLCKSZ) through the
> _bt_split() call to PageGetTempPage(). In a sense, we're not even
> allocating memory that we weren't allocating already. (Not sure that
> this really matters, but it is easy to do it that way.)

Rounding up the allocation to BLCKSZ seems like a premature 
optimization. Even if it saved some cycles, I don't think it's worth the 
trouble of having to explain all that in the comment.

I think you could change the curdelta, leftfree, and rightfree fields in 
SplitPoint to int16, to make the array smaller.

> Other changes from your prototype:
> 
> *  I found a more efficient representation than a pair of raw
> IndexTuple pointers for each candidate split. Actually, I use the same
> old representation (firstoldonright + newitemonleft) in each split,
> and provide routines to work backwards from that to get the lastleft
> and firstright tuples. This approach is far more space efficient, and
> space efficiency matters when you've allocating space for hundreds of
> items in a critical path like this.

Ok.

> * You seemed to refactor _bt_checksplitloc() in passing, making it not
> do the newitemisfirstonright thing. I changed that back. Did I miss
> something that you intended here?

My patch treated the new item the same as all the old items, in 
_bt_checksplitloc(), so it didn't need newitemisonright. You still need 
it with your approach.

> Changes to my own code since v12:
> 
> * Simplified "Add "split after new tuple" optimization" commit, and
> made it more consistent with associated code. This is something that
> was made a lot easier by the new approach. It would be great to hear
> what you think of this.

I looked at it very briefly. Yeah, it's pretty simple now. Nice!


About this comment on _bt_findsplit_loc():

>/*
> *    _bt_findsplitloc() -- find an appropriate place to split a page.
> *
> * The main goal here is to equalize the free space that will be on each
> * split page, *after accounting for the inserted tuple*.  (If we fail to
> * account for it, we might find ourselves with too little room on the page
> * that it needs to go into!)
> *
> * If the page is the rightmost page on its level, we instead try to arrange
> * to leave the left split page fillfactor% full.  In this way, when we are
> * inserting successively increasing keys (consider sequences, timestamps,
> * etc) we will end up with a tree whose pages are about fillfactor% full,
> * instead of the 50% full result that we'd get without this special case.
> * This is the same as nbtsort.c produces for a newly-created tree.  Note
> * that leaf and nonleaf pages use different fillfactors.
> *
> * We are passed the intended insert position of the new tuple, expressed as
> * the offsetnumber of the tuple it must go in front of (this could be
> * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
> * passed, since it's needed to give some weight to how effective suffix
> * truncation will be.  The implementation picks the split point that
> * maximizes the effectiveness of suffix truncation from a small list of
> * alternative candidate split points that leave each side of the split with
> * about the same share of free space.  Suffix truncation is secondary to
> * equalizing free space, except in cases with large numbers of duplicates.
> * Note that it is always assumed that caller goes on to perform truncation,
> * even with pg_upgrade'd indexes where that isn't actually the case
> * (!heapkeyspace indexes).  See nbtree/README for more information about
> * suffix truncation.
> *
> * We return the index of the first existing tuple that should go on the
> * righthand page, plus a boolean indicating whether the new tuple goes on
> * the left or right page.  The bool is necessary to disambiguate the case
> * where firstright == newitemoff.
> *
> * The high key for the left page is formed using the first item on the
> * right page, which may seem to be contrary to Lehman & Yao's approach of
> * using the left page's last item as its new high key on the leaf level.
> * It isn't, though: suffix truncation will leave the left page's high key
> * fully equal to the last item on the left page when two tuples with equal
> * key values (excluding heap TID) enclose the split point.  It isn't
> * necessary for a new leaf high key to be equal to the last item on the
> * left for the L&Y "subtree" invariant to hold.  It's sufficient to make
> * sure that the new leaf high key is strictly less than the first item on
> * the right leaf page, and greater than the last item on the left page.
> * When suffix truncation isn't possible, L&Y's exact approach to leaf
> * splits is taken (actually, a tuple with all the keys from firstright but
> * the heap TID from lastleft is formed, so as to not introduce a special
> * case).
> *
> * Starting with the first right item minimizes the divergence between leaf
> * and internal splits when checking if a candidate split point is legal.
> * It is also inherently necessary for suffix truncation, since truncation
> * is a subtractive process that specifically requires lastleft and
> * firstright inputs.
> */

This is pretty good, but I think some copy-editing can make this even 
better. If you look at the old comment, it had this structure:

1. Explain what the function does.
2. Explain the arguments
3. Explain the return value.

The additions to this comment broke the structure. The explanations of 
argument and return value are now in the middle, in 3rd and 4th 
paragraphs. And the 3rd paragraph that explains the arguments, now also 
goes into detail on what the function does with it. I'd suggest moving 
things around to restore the old structure, that was more clear.

The explanation of how the high key for the left page is formed (5th
paragraph), seems out-of-place here, because the high key is not formed 
here.

Somewhere in the 1st or 2nd paragraph, perhaps we should mention that 
the function effectively uses a different fillfactor in some other 
scenarios too, not only when it's the rightmost page.

In the function itself:

>      * maxsplits should never exceed maxoff because there will be at most as
>      * many candidate split points as there are points _between_ tuples, once
>      * you imagine that the new item is already on the original page (the
>      * final number of splits may be slightly lower because not all splits
>      * will be legal).  Even still, add space for an extra two splits out of
>      * sheer paranoia.
>      */
>     state.maxsplits = maxoff + 2;
>     state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
>     state.nsplits = 0;

I wouldn't be that paranoid. The code that populates the array is pretty 
straightforward.

>     /*
>      * Scan through the data items and calculate space usage for a split at
>      * each possible position.  We start at the first data offset rather than
>      * the second data offset to handle the "newitemoff == first data offset"
>      * case (otherwise, a split whose firstoldonright is the first data offset
>      * can't be legal, and won't actually end up being recorded by
>      * _bt_recordsplit).
>      *
>      * Still, it's typical for almost all calls to _bt_recordsplit to
>      * determine that the split is legal, and therefore enter it into the
>      * candidate split point array for later consideration.
>      */

Suggestion: Remove the "Still" word. The observation that typically all 
split points are legal is valid, but it seems unrelated to the first 
paragraph. (Do we need to mention it at all?)

>    /*
>     * If the new item goes as the last item, record the split point that
>     * leaves all the old items on the left page, and the new item on the
>     * right page.  This is required because a split that leaves the new item
>     * as the firstoldonright won't have been reached within the loop.  We
>     * always record every possible split point.
>     */

Suggestion: Remove the last sentence. An earlier comment already said 
that we calculate space usage for a split at each possible position, 
that seems sufficient. Like it was before this patch.

>    /*
>     * Find lowest possible penalty among split points currently regarded as
>     * acceptable -- the "perfect" penalty.  The perfect penalty often saves
>     * _bt_bestsplitloc() additional work around calculating penalties.  This
>     * is also a convenient point to determine if default mode worked out, or
>     * if we should instead reassess which split points should be considered
>     * acceptable (split interval, and possibly fillfactormult).
>     */
>    perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
>                                         newitem, &secondmode);

ISTM that figuring out which "mode" we want to operate in is actually 
the *primary* purpose of _bt_perfect_penalty(). We only really use the 
penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd 
suggest changing the function name to something like _bt_choose_mode(), 
and have secondmode be the primary return value from it, with 
perfectpenalty being the secondary result through a pointer.

It doesn't really choose the mode, either, though. At least after the 
next "Add split after new tuple optimization" patch. The caller has a 
big part in choosing what to do. So maybe split _bt_perfect_penalty into 
two functions: _bt_perfect_penalty(), which just computes the perfect 
penalty, and _bt_analyze_split_interval(), which determines how many 
duplicates there are in the top-N split points.

BTW, I like the word "strategy", like you called it in the comment on 
SplitPoint struct, better than "mode".

> +        if (usemult)
> +            delta = fillfactormult * split->leftfree -
> +                (1.0 - fillfactormult) * split->rightfree;
> +        else
> +            delta = split->leftfree - split->rightfree;
> 

How about removing the "usemult" variable, and just check if 
fillfactormult == 0.5?

>     /*
>      * There are a much smaller number of candidate split points when
>      * splitting an internal page, so we can afford to be exhaustive.  Only
>      * give up when pivot that will be inserted into parent is as small as
>      * possible.
>      */
>     if (!state->is_leaf)
>         return MAXALIGN(sizeof(IndexTupleData) + 1);

Why are there fewer candidate split points on an internal page?

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
Some comments on 
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below. 
Mostly about code comments. In general, I think a round of copy-editing 
the comments, to use simpler language, would do good. The actual code 
changes look good to me.

> /*
>  *    _bt_findinsertloc() -- Finds an insert location for a tuple
>  *
>  *        On entry, *bufptr contains the page that the new tuple unambiguously
>  *        belongs on.  This may not be quite right for callers that just called
>  *        _bt_check_unique(), though, since they won't have initially searched
>  *        using a scantid.  They'll have to insert into a page somewhere to the
>  *        right in rare cases where there are many physical duplicates in a
>  *        unique index, and their scantid directs us to some page full of
>  *        duplicates to the right, where the new tuple must go.  (Actually,
>  *        since !heapkeyspace pg_upgraded'd non-unique indexes never get a
>  *        scantid, they too may require that we move right.  We treat them
>  *        somewhat like unique indexes.)

Seems confusing to first say assertively that "*bufptr contains the page 
that the new tuple unambiguously belongs to", and then immediately go on 
to list a whole bunch of exceptions. Maybe just remove "unambiguously".

> @@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel,
>               * If this page was incompletely split, finish the split now. We
>               * do this while holding a lock on the left sibling, which is not
>               * good because finishing the split could be a fairly lengthy
> -             * operation.  But this should happen very seldom.
> +             * operation.  But this should only happen when inserting into a
> +             * unique index that has more than an entire page for duplicates
> +             * of the value being inserted.  (!heapkeyspace non-unique indexes
> +             * are an exception, once again.)
>               */
>              if (P_INCOMPLETE_SPLIT(lpageop))
>              {

This happens very seldom, because you only get an incomplete split if 
you crash in the middle of a page split, and that should be very rare. I 
don't think we need to list more fine-grained conditions here, that just 
confuses the reader.

> /*
>  *    _bt_useduplicatepage() -- Settle for this page of duplicates?
>  *
>  *        Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
>  *        as a part of the keyspace.  If there were many tuples of the same
>  *        value spanning more than one leaf page, a new tuple of that same
>  *        value could legally be placed on any one of the pages.
>  *
>  *        This function handles the question of whether or not an insertion
>  *        of a duplicate into a pg_upgrade'd !heapkeyspace index should
>  *        insert on the page contained in buf when a choice must be made.
>  *        Preemptive microvacuuming is performed here when that could allow
>  *        caller to insert on to the page in buf.
>  *
>  *        Returns true if caller should proceed with insert on buf's page.
>  *        Otherwise, caller should move on to the page to the right (caller
>  *        must always be able to still move right following call here).
>  */

So, this function is only used for legacy pg_upgraded indexes. The 
comment implies that, but doesn't actually say it.

> /*
>  * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
>  * and non-pivot tuples, despite differences in how heap TID is represented.
>  *
>  * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
>  * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
>  * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
>  * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
>  * probably also contain a heap TID at the end of the tuple.  We currently
>  * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
>  * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
>  * tuple), but it might also be used by non-pivot tuples in the future.
>  * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
>  * tuples that actually originated with the truncation of one or more
>  * attributes.
>  */
> #define BTreeTupleGetHeapTID(itup) ...

The comment claims that "all pivot tuples must be as of BTREE_VERSION 
4". I thought that all internal tuples are called pivot tuples, even on 
version 3. I think what this means to say is that this macro is only 
used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only 
have a heap TID in BTREE_VERSION 4 indexes.

This macro (and many others in nbtree.h) is quite complicated. A static 
inline function might be easier to read.

> @@ -1114,6 +1151,8 @@ _bt_insertonpg(Relation rel,
>  
>              if (BufferIsValid(metabuf))
>              {
> +                Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
> +                xlmeta.version = metad->btm_root;
>                  xlmeta.root = metad->btm_root;
>                  xlmeta.level = metad->btm_level;
>                  xlmeta.fastroot = metad->btm_fastroot;

'xlmeta.version' is set incorrectly.

> /*
>  * Btree version 4 (used by indexes initialized by PostgreSQL v12) made
>  * general changes to the on-disk representation to add support for
>  * heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace
>  * semantics in pg_upgrade scenarios.  We continue to offer support for
>  * BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions
>  * up to and including v10 to v12+ without requiring a REINDEX.
>  * Similarly, we continue to offer support for BTREE_NOVAC_VERSION to
>  * support upgrades from v11 to v12+ without requiring a REINDEX.
>  *
>  * We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION
>  * to BTREE_NOVAC_VERSION automatically.  v11's "no vacuuming" enhancement
>  * (the ability to skip full index scans during vacuuming) only requires
>  * two new metapage fields, which makes it possible to upgrade at any
>  * point that the metapage must be updated anyway (e.g. during a root page
>  * split).  Note also that there happened to be no changes in metapage
>  * layout for btree version 4.  All current metapage fields should have
>  * valid values set when a metapage WAL record is replayed.
>  *
>  * It's convenient to consider the "no vacuuming" enhancement (metapage
>  * layout compatibility) separately from heapkeyspace semantics, since
>  * each issue affects different areas.  This is just a convention; in
>  * practice a heapkeyspace index is always also a "no vacuuming" index.
>  */
> #define BTREE_METAPAGE  0               /* first page is meta */
> #define BTREE_MAGIC             0x053162        /* magic number of btree pages */
> #define BTREE_VERSION   4               /* current version number */
> #define BTREE_MIN_VERSION       2       /* minimal supported version number */
> #define BTREE_NOVAC_VERSION     3       /* minimal version with all meta fields */

I find this comment difficult to read. I suggest rewriting it to:

/*
  * The current Btree version is 4. That's what you'll get when you create
  * a new index.
  *
  * Btree version 3 was used in PostgreSQL v11. It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace. Index tuples
  * with duplicate keys could be stored in any order. We continue to
  * support reading and writing Btree version 3, so that they don't need
  * to be immediately re-indexed at pg_upgrade. In order to get the new
  * heapkeyspace semantics, however, a REINDEX is needed.
  *
  * Btree version 2 is the same as version 3, except for two new fields
  * in the metapage that were introduced in version 3. A version 2 metapage
  * will be automatically upgraded to version 3 on the first insert to it.
  */



Now that the index tuple format becomes more complicated, I feel that 
there should be some kind of an overview explaining the format. All the 
information is there, in the comments in nbtree.h, but you have to piece 
together all the details to get the overall picture. I wrote this to 
keep my head straight:

B-tree tuple format
===================

Leaf tuples
-----------

     t_tid | t_info | key values | INCLUDE columns if any

t_tid points to the heap TID.


Pivot tuples
------------

All tuples on internal pages are pivot tuples. Also, the high keys on 
leaf pages.

     t_tid | t_info | key values | [heap TID]

The INDEX_ALT_TID_MASK bit in t_info is set.

The block number in 't_tid' points to the lower B-tree page.

The lower bits in 't_tid.ip_posid' store the number of keys stored (it 
can be less than the number of keys in the index, if some keys have been 
suffix-truncated). If BT_HEAP_TID_ATTR flag is set, there's an extra 
heap TID field after the key datums.

(In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set. In 
that case, the number keys is implicitly the same as the number of keys 
in the index, and there is no heap TID.)


I think adding something like this in nbtree.h would be good.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Mar 3, 2019 at 5:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Great, looks much simpler now, indeed! Now I finally understand the
> algorithm.

Glad to hear it. Thanks for the additional review!

Attached is v14, which has changes based on your feedback. This
includes changes based on your more recent feedback on
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch, though
I'll respond to those points directly in a later email.

v14 also changes the logic by which we decide if alternative strategy
should be used to use leftmost and rightmost splits for the entire
page, rather than accessing the page directly. We always handle the
newitem-at-end edge case correctly now, since the new "top down"
approach has all legal splits close at hand. This is more elegant,
more obviously correct, and even more effective, at least in some
cases -- it's another example of why "top down" is the superior
approach for nbtsplitloc.c. This made my "UK land registry data" index
have about 2.5% fewer leaf pages than with v13, which is small but
significant.

Separately, I made most of the new nbtsplitloc.c functions use a
FindSplitData argument in v14, which simplifies their signatures quite
a bit.

> What would be the worst case scenario for this? Splitting a page that
> has as many tuples as possible, I guess, so maybe inserting into a table
> with a single-column index, with 32k BLCKSZ. Have you done performance
> testing on something like that?

I'll test that (added to my project TODO list), though it's not
obvious that that's the worst case. Page splits will be less frequent,
and have better choices about where to split.

> Rounding up the allocation to BLCKSZ seems like a premature
> optimization. Even if it saved some cycles, I don't think it's worth the
> trouble of having to explain all that in the comment.

Removed that optimization.

> I think you could change the curdelta, leftfree, and rightfree fields in
> SplitPoint to int16, to make the array smaller.

Added this alternative optimization to replace the BLCKSZ allocation
thing. I even found a way to get the array elements down to 8 bytes,
but that made the code noticeably slower with "many duplicates"
splits, so I didn't end up doing that (I used bitfields, plus the same
pragmas that we use to make sure that item pointers are packed).

> > * You seemed to refactor _bt_checksplitloc() in passing, making it not
> > do the newitemisfirstonright thing. I changed that back. Did I miss
> > something that you intended here?
>
> My patch treated the new item the same as all the old items, in
> _bt_checksplitloc(), so it didn't need newitemisonright. You still need
> it with your approach.

I would feel better about it if we stuck to the same method for
calculating if a split point is legal as before (the only difference
being that we pessimistically add heap TID overhead to new high key on
leaf level). That seems safer to me.

> > Changes to my own code since v12:
> >
> > * Simplified "Add "split after new tuple" optimization" commit, and
> > made it more consistent with associated code. This is something that
> > was made a lot easier by the new approach. It would be great to hear
> > what you think of this.
>
> I looked at it very briefly. Yeah, it's pretty simple now. Nice!

I can understand why it might be difficult to express an opinion on
the heuristics themselves. The specific cut-off points (e.g. details
of what "heap TID adjacency" actually means) are not that easy to
defend with a theoretical justification, though they have been
carefully tested. I think it's worth comparing the "split after new
tuple" optimization to the traditional leaf fillfactor of 90, which is
a very similar situation. Why should it be 90? Why not 85, or 95? Why
is it okay to assume that the rightmost page shouldn't be split 50/50?

The answers to all of these questions about the well established idea
of a leaf fillfactor boil down to this: it's very likely to be correct
on average, and when it isn't correct the problem is self-limiting,
and has an infinitesimally small chance of continually recurring
(unless you imagine an *adversarial* case). Similarly, it doesn't
matter if these new heuristics get it wrong once every 1000 splits (a
very pessimistic estimate), because even then those will cancel each
other out in the long run. It is necessary to take a holistic view of
things. We're talking about an optimization that makes the two largest
TPC-C indexes over 40% smaller -- I can hold my nose if I must in
order to get that benefit. There were also a couple of indexes in the
real-world mouse genome database that this made much smaller, so this
will clearly help in the real world.

Besides all this, the "split after new tuple" optimization fixes an
existing worst case, rather than being an optimization, at least in my
mind. It's not supposed to be possible to have leaf pages that are all
only 50% full without any deletes, and yet we allow it to happen in
this one weird case. Even completely random insertions result in 65% -
70% average space utilization, so the existing worst case really is
exceptional. We are forced to take a holistic view, and infer
something about the pattern of insertions over time, even though
holistic is a dirty word.

> > (New header comment block over _bt_findsplitloc())
>
> This is pretty good, but I think some copy-editing can make this even
> better

I've restored the old structure of the _bt_findsplitloc() header comments.

> The explanation of how the high key for the left page is formed (5th
> paragraph), seems out-of-place here, because the high key is not formed
> here.

Moved that to _bt_split(), which seems like a good compromise.

> Somewhere in the 1st or 2nd paragraph, perhaps we should mention that
> the function effectively uses a different fillfactor in some other
> scenarios too, not only when it's the rightmost page.

Done.

> >       state.maxsplits = maxoff + 2;
> >       state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
> >       state.nsplits = 0;
>
> I wouldn't be that paranoid. The code that populates the array is pretty
> straightforward.

Done that way. But are you sure? Some of the attempts to create a new
split point are bound to fail, because they try to put everything
(including new item) on one size of the split. I'll leave the
assertion there.

> >        * Still, it's typical for almost all calls to _bt_recordsplit to
> >        * determine that the split is legal, and therefore enter it into the
> >        * candidate split point array for later consideration.
> >        */
>
> Suggestion: Remove the "Still" word. The observation that typically all
> split points are legal is valid, but it seems unrelated to the first
> paragraph. (Do we need to mention it at all?)

Removed the second paragraph entirely.

> >       /*
> >        * If the new item goes as the last item, record the split point that
> >        * leaves all the old items on the left page, and the new item on the
> >        * right page.  This is required because a split that leaves the new item
> >        * as the firstoldonright won't have been reached within the loop.  We
> >        * always record every possible split point.
> >        */
>
> Suggestion: Remove the last sentence.

Agreed. Removed.

> ISTM that figuring out which "mode" we want to operate in is actually
> the *primary* purpose of _bt_perfect_penalty(). We only really use the
> penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd
> suggest changing the function name to something like _bt_choose_mode(),
> and have secondmode be the primary return value from it, with
> perfectpenalty being the secondary result through a pointer.

I renamed _bt_perfect_penalty() to _bt_strategy(), since I agree that
its primary purpose is to decide on a strategy (which is what I'm now
calling a mode, per your request a bit further down). It still returns
perfectpenalty, though, since that seemed more natural to me, probably
because its style matches the style of caller/_bt_findsplitloc().
perfectpenalty isn't a mere optimization -- it is important to prevent
many duplicates mode from going overboard with suffix truncation. It
does more than just save _bt_bestsplitloc() cycles, which I've tried
to make clearer in v14.

> It doesn't really choose the mode, either, though. At least after the
> next "Add split after new tuple optimization" patch. The caller has a
> big part in choosing what to do. So maybe split _bt_perfect_penalty into
> two functions: _bt_perfect_penalty(), which just computes the perfect
> penalty, and _bt_analyze_split_interval(), which determines how many
> duplicates there are in the top-N split points.

Hmm. I didn't create a _bt_analyze_split_interval(), because now
_bt_perfect_penalty()/_bt_strategy() is responsible for setting the
perfect penalty in all cases. It was a mistake for me to move some
perfect penalty stuff for alternative modes/strategies out to the
caller in v13. In v14, we never make _bt_findsplitloc() change its
perfect penalty -- it only changes its split interval, based on the
strategy/mode, possibly after sorting. Let me know what you think of
this.

> BTW, I like the word "strategy", like you called it in the comment on
> SplitPoint struct, better than "mode".

I've adopted that terminology in v14 -- it's always "strategy", never "mode".

> How about removing the "usemult" variable, and just check if
> fillfactormult == 0.5?

I need to use "usemult" to determine if the "split after new tuple"
optimization should apply leaf fillfactor, or should instead split at
the exact point after the newly inserted tuple -- it's very natural to
have a single bool flag for that. It's seems simpler to continue to
use "usemult" for everything, and not distinguish "split after new
tuple" as a special case later on. (Besides, the master branch already
uses a bool for this, even though it handles far fewer things.)

> >       /*
> >        * There are a much smaller number of candidate split points when
> >        * splitting an internal page, so we can afford to be exhaustive.  Only
> >        * give up when pivot that will be inserted into parent is as small as
> >        * possible.
> >        */
> >       if (!state->is_leaf)
> >               return MAXALIGN(sizeof(IndexTupleData) + 1);
>
> Why are there fewer candidate split points on an internal page?

The comment should say that there is typically a much smaller split
interval (this used to be controlled by limiting the size of the array
initially -- should have updated this for v13 of the patch). I believe
that you understand that, and are interested in why the split interval
itself is different on internal pages. Or why we are more conservative
with internal pages in general. Assuming that's what you meant, here
is my answer:

The "Prefix B-Tree" paper establishes the idea that there are
different split intervals for leaf pages and internal pages (which it
calls branch pages). We care about different things in each case. With
leaf pages, we care about choosing the split point that allows us to
create the smallest possible pivot tuple as our secondary goal
(primary goal is balancing space). With internal pages, we care about
choosing the smallest tuple to insert into parent of internal page
(often the root) as our secondary goal, but don't care about
truncation, because _bt_split() won't truncate new pivot. That's why
the definition of "penalty" varies according to whether we're
splitting an internal page or a leaf page. Clearly the idea of having
separate split intervals is well established, and makes sense.

It's fair to ask if I'm being too conservative (or not conservative
enough) with split interval in either case. Unfortunately, the Prefix
B-Tree paper never seems to give practical advice about how to come up
with an interval. They say:

"We have not analyzed the influence of sigma L [leaf interval] or
sigma B [branch/internal interval] on the performance of the trees. We
expect such an analysis to be quite involved and difficult. We are
quite confident, however, that small split intervals improve
performance considerably. Sets of keys that arise in practical
applications are often far from random, and clusters of similar keys
differing only in the last few letters (e.g. plural forms) are quite
common."

I am aware of another, not-very-notable paper that tries to to impose
some theory here, but doesn't really help much [1]. Anyway, I've found
that I was too conservative with split interval for internal pages. It
pays to make internal interval that higher than leaf interval, because
internal pages cover a much bigger portion of the key space than leaf
pages, which will tend to get filled up one way or another. Leaf pages
cover a tight part of the key space, in contrast. In v14, I've
increased internal page to 18, a big increase from 3, and twice what
it is for leaf splits (still 9 -- no change there). This mostly isn't
that different to 3, since there usually are pivot tuples that are all
the same size anyway. However, with cases where suffix truncation
makes pivot tuples a lot smaller (e.g. UK land registry test case),
this makes the items in the root a lot smaller on average, and even
makes the whole index smaller. My entire test suite has a few cases
that are noticeably improved by this change, and no cases that are
hurt.

I'm going to have to revalidate the performance of long-running
benchmarks with this change, so this should be considered provisional.
I think that it will probably be kept, though. Not expecting it to
noticeably impact either BenchmarkSQL or pgbench benchmarks.

[1] https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Some comments on
> v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
> Mostly about code comments. In general, I think a round of copy-editing
> the comments, to use simpler language, would do good. The actual code
> changes look good to me.

I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.

> Seems confusing to first say assertively that "*bufptr contains the page
> that the new tuple unambiguously belongs to", and then immediately go on
> to list a whole bunch of exceptions. Maybe just remove "unambiguously".

This is fixed in v14 of the patch series.

> This happens very seldom, because you only get an incomplete split if
> you crash in the middle of a page split, and that should be very rare. I
> don't think we need to list more fine-grained conditions here, that just
> confuses the reader.

Fixed in v14.

> > /*
> >  *    _bt_useduplicatepage() -- Settle for this page of duplicates?

> So, this function is only used for legacy pg_upgraded indexes. The
> comment implies that, but doesn't actually say it.

I made that more explicit in v14.

> > /*
> >  * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
> >  * and non-pivot tuples, despite differences in how heap TID is represented.

> > #define BTreeTupleGetHeapTID(itup) ...

I fixed up the comments above BTreeTupleGetHeapTID() significantly.

> The comment claims that "all pivot tuples must be as of BTREE_VERSION
> 4". I thought that all internal tuples are called pivot tuples, even on
> version 3.

In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).

Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.

We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.

> I think what this means to say is that this macro is only
> used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
> have a heap TID in BTREE_VERSION 4 indexes.

My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.

That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.

There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.

> This macro (and many others in nbtree.h) is quite complicated. A static
> inline function might be easier to read.

I agree that the macros are complicated, but that seems to be because
the rules are complicated. I'd rather leave the macros in place, and
improve the commentary on the rules.

> 'xlmeta.version' is set incorrectly.

Oops. Fixed in v14.

> I find this comment difficult to read. I suggest rewriting it to:
>
> /*
>   * The current Btree version is 4. That's what you'll get when you create
>   * a new index.

I used your wording for this in v14, almost verbatim.

> Now that the index tuple format becomes more complicated, I feel that
> there should be some kind of an overview explaining the format. All the
> information is there, in the comments in nbtree.h, but you have to piece
> together all the details to get the overall picture. I wrote this to
> keep my head straight:

v14 uses your diagrams in nbtree.h, and expands some existing
discussion of INCLUDE indexes/non-key attributes/tuple format. Let me
know what you think.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
I'm looking at the first patch in the series now. I'd suggest that you 
commit that very soon. It's useful on its own, and seems pretty much 
ready to be committed already. I don't think it will be much affected by 
whatever changes we make to the later patches, anymore.

I did some copy-editing of the code comments, see attached patch which 
applies on top of v14-0001-Refactor-nbtree-insertion-scankeys.patch. 
Mostly, to use more Plain English: use active voice instead of passive, 
split long sentences, avoid difficult words.

I also had a few comments and questions on some details. I added them in 
the same patch, marked with "HEIKKI:". Please take a look.

- Heikki

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm looking at the first patch in the series now. I'd suggest that you
> commit that very soon. It's useful on its own, and seems pretty much
> ready to be committed already. I don't think it will be much affected by
> whatever changes we make to the later patches, anymore.

I agree that the parts covered by the first patch in the series are
very unlikely to need changes, but I hesitate to commit it weeks ahead
of the other patches. Some of the things that make _bt_findinsertloc()
fast are missing for v3 indexes. The "consider secondary factors
during nbtree splits" patch actually more than compensates for that
with v3 indexes, at least in some cases, but the first patch applied
on its own will slightly regress performance. At least, I benchmarked
the first patch on its own several months ago and noticed a small
regression at the time, though I don't have the exact details at hand.
It might have been an invalid result, because I wasn't particularly
thorough at the time.

We do make some gains in the first patch  (the _bt_check_unique()
thing), but we also check the high key more than we need to within
_bt_findinsertloc() for non-unique indexes. Plus, the microvacuuming
thing isn't as streamlined.

It's a lot of work to validate and revalidate the performance of a
patch like this, and I'd rather commit the first three patches within
a couple of days of each other (I can validate v3 indexes and v4
indexes separately, though). We can put off the other patches for
longer, and treat them as independent. I guess I'd also push the final
amcheck patch following the first three -- no point in holding back on
that. Then we'd be left with "Add "split after new tuple"
optimization", and "Add high key "continuescan" optimization" as
independent improvements that can be pushed at the last minute of the
final CF.

> I also had a few comments and questions on some details. I added them in
> the same patch, marked with "HEIKKI:". Please take a look.

Will respond now. Any point that I haven't responding to directly has
been accepted.

> +HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
> +weird to talk about it in the function comment. I didn't understand what
> +the point of adding this sentence was, so I removed it.

Maybe there is no point in the comment you reference here, but I like
the idea of "checkingunique", because that symbol name is a common
thread between a number of functions that coordinate with each other.
It's not just a local variable in one function.

> @@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
>             if (P_RIGHTMOST(opaque))
>                 break;
>             highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
> +
> +           /*
> +            * HEIKKI: This assertion might fire if the user-defined opclass
> +            * is broken. It's just an assertion, so maybe that's ok. With a
> +            * broken opclass, it's obviously "garbage in, garbage out", but
> +            * we should try to behave sanely anyway. I don't remember what
> +            * our general policy on that is; should we assert, elog(ERROR),
> +            * or continue silently in that case? An elog(ERROR) or
> +            * elog(WARNING) would feel best to me, but I don't remember what
> +            * we usually do.
> +            */
>             Assert(highkeycmp <= 0);
>             if (highkeycmp != 0)
>                 break;

We don't really have a general policy on it. However, I don't have any
sympathy for the idea of trying to solider on with a corrupt index. I
also don't think that it's worth making this a "can't happen" error.
Like many of my assertions, this assertion is intended to document an
invariant. I don't actually anticipate that it could ever really fail.

> +Should we mention explicitly that this binary-search reuse is only applicable
> +if unique checks were performed? It's kind of implied by the fact that it's
> +_bt_check_unique() that saves the state, but perhaps we should be more clear
> +about it.

I guess so.

> +What is a "garbage duplicate"? Same as a "dead duplicate"?

Yes.

> +The last sentence, about garbage duplicates, seems really vague. Why do we
> +ever do any comparisons that are not strictly necessary? Perhaps it's best to
> +just remove that last sentence.

Okay -- will remove.

> +
> +HEIKKI: I don't buy the argument that microvacuuming has to happen here. You
> +could easily imagine a separate function that does microvacuuming, and resets
> +(or even updates) the binary-search cache in the insertion key. I agree this
> +is a convenient place to do it, though.

It wasn't supposed to be a water-tight argument. I'll just say that
it's convenient.

> +/* HEIKKI:
> +Do we need 'checkunique' as an argument? If unique checks were not
> +performed, the insertion key will simply not have saved state.
> +*/

We need it in the next patch in the series, because it's also useful
for optimizing away the high key check with non-unique indexes. We
know that _bt_moveright() was called at the leaf level, with scantid
filled in, so there is no question of needing to move right within
_bt_findinsertloc() (provided it's a heapkeyspace index).

Actually, we even need it in the first patch: we only restore a binary
search because we know that there is something to restore, and must
ask for it to be restored explicitly (anything else seems unsafe).
Maybe we can't restore it because it's not a unique index, or maybe we
can't restore it because we microvacuumed, or moved right to get free
space. I don't think that it'll be helpful to make _bt_findinsertloc()
pretend that it doesn't know exactly where the binary search bounds
come from -- it already knows plenty about unique indexes
specifically, and about how it may have to invalidate the bounds. The
whole way that it couples buffer locks is only useful for unique
indexes, so it already knows *plenty* about unique indexes
specifically.

I actually like the idea of making certain insertion scan key mutable
state relating to search bounds hidden in the case of "dynamic prefix
truncation" [1]. Doesn't seem to make sense here, though.

> +   /* HEIKKI: I liked this comment that we used to have here, before this patch: */
> +   /*----------
> +    * If we will need to split the page to put the item on this page,
> +    * check whether we can put the tuple somewhere to the right,
> +    * instead.  Keep scanning right until we

> +   /* HEIKKI: Maybe it's not relevant with the later patches, but at least
> +    * with just this first patch, it's still valid. I noticed that the
> +    * comment is now in _bt_useduplicatepage, it seems a bit out-of-place
> +    * there. */

I don't think it matters, because I don't think that the first patch
can be justified as an independent piece of work. I like the idea of
breaking up the patch series, because it makes it all easier to
understand, but the first three patches are kind of intertwined.

> +HEIKKI: In some scenarios, if the BTP_HAS_GARBAGE flag is falsely set, we would
> +try to microvacuum the page twice: first in _bt_useduplicatepage, and second
> +time here. That's because _bt_vacuum_one_page() doesn't clear the flag, if
> +there are in fact no LP_DEAD items. That's probably insignificant and not worth
> +worrying about, but I thought I'd mention it.

Right. It's also true that all future insertions will reach
_bt_vacuum_one_page() and do the same again, until there either is
garbage, or until the page splits.

> -    * rightmost page case), all the items on the right half will be user data
> -    * (there is no existing high key that needs to be relocated to the new
> -    * right page).
> +    * rightmost page case), all the items on the right half will be user
> +    * data.
> +    *
> +HEIKKI: I don't think the comment change you made here was needed or
> +helpful, so I reverted it.

I thought it added something when you're looking at it from a
WAL-logging point of view. But I can live without this.

> - * starting a regular index scan some can be omitted.  The array is used as a
> + * starting a regular index scan, some can be omitted.  The array is used as a
>   * flexible array member, though it's sized in a way that makes it possible to
>   * use stack allocations.  See nbtree/README for full details.
> +
> +HEIKKI: I don't see anything in the README about stack allocations. What
> +exactly does the README reference refer to? No code seems to actually allocate
> +this in the stack, so we don't really need that.

The README discusses insertion scankeys in general, though. I think
that you read it that way because you're focussed on my changes, and
not because it actually implies that the README talks about the stack
thing specifically. (But I can change it if you like.)

There is a stack allocation in _bt_first(). This was once just another
dynamic allocation, that called _bt_mkscankey(), but that regressed
nested loop joins, so I had to make it work the same way as before. I
noticed this about six months ago, because there was a clear impact on
the TPC-C "Stock level" transaction, which is now sometimes twice as
fast with the patch series. Note also that commit d961a568, from 2005,
changed the _bt_first() code to use a stack allocation. Besides,
sticking to a stack allocation makes the changes to _bt_first()
simpler, even though it has to duplicate a few things from
_bt_mkscankey().

I could get you a v15 that integrates your changes pretty quickly, but
I'll hold off on that for at least a few days. I have a feeling that
you'll have more feedback for me to work through before too long.

[1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Robert Haas
Дата:
On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I agree that the parts covered by the first patch in the series are
> very unlikely to need changes, but I hesitate to commit it weeks ahead
> of the other patches.

I know I'm stating the obvious here, but we don't have many weeks left
at this point.  I have not reviewed any code, but I have been
following this thread and I'd really like to see this work go into
PostgreSQL 12, assuming it's in good enough shape.  It sounds like
really good stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Mar 6, 2019 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I know I'm stating the obvious here, but we don't have many weeks left
> at this point.  I have not reviewed any code, but I have been
> following this thread and I'd really like to see this work go into
> PostgreSQL 12, assuming it's in good enough shape.  It sounds like
> really good stuff.

Thanks!

Barring any objections, I plan to commit the first 3 patches (plus the
amcheck "relocate" patch) within 7 - 10 days (that's almost
everything). Heikki hasn't reviewed 'Add high key "continuescan"
optimization' yet, and it seems like he should take a look at that
before I proceed with it. But that seems like the least controversial
enhancement within the entire patch series, so I'm not very worried
about it.

I'm currently working on v15, which has comment-only revisions
requested by Heikki. I expect to continue to work with him to make
sure that he is happy with the presentation. I'll also need to
revalidate the performance of the patch series following recent minor
changes to the logic for choosing a split point. That can take days.
This is why I don't want to commit the first patch without committing
at least the first three all at once -- it increases the amount of
performance validation work I'll have to do considerably. (I have to
consider both v4 and v3 indexes already, which seems like enough
work.)

Two of the later patches (one of which I plan to push as part of the
first batch of commits) use heuristics to decide where to split the
page. As a Postgres contributor, I have learned to avoid inventing
heuristics, so this automatically makes me a bit uneasy. However, I
don't feel so bad about it here, on reflection. The on-disk size of
the TPC-C indexes are reduced by 35% with the 'Add "split after new
tuple" optimization' patch (I think that the entire database is
usually about 12% smaller). There simply isn't a fundamentally better
way to get the same benefit, and I'm sure that nobody will argue that
we should just accept the fact that the most influential database
benchmark of all time has a big index bloat problem with Postgres.
That would be crazy.

That said, it's not impossible that somebody will shout at me because
my heuristics made their index bloated. I can't see how that could
happen, but I am prepared. I can always adjust the heuristics when new
information comes to light. I have fairly thorough test cases that
should allow me to do this without regressing anything else. This is a
risk that can be managed sensibly.

There is no gnawing ambiguity about the on-disk changes laid down in
the second patch (nor the first patch), though. Making on-disk changes
is always a bit scary, but making the keys unique is clearly a big
improvement architecturally, as it brings nbtree closer to the Lehman
& Yao design without breaking anything for v3 indexes (v3 indexes
simply aren't allowed to use a heap TID in their scankey). Unique keys
also allow amcheck to relocate every tuple in the index from the root
page, using the same code path as regular index scans. We'll be
relying on the uniqueness of keys within amcheck from the beginning,
before anybody teaches nbtree to perform retail index tuple deletion.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 06/03/2019 04:03, Peter Geoghegan wrote:
> On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I'm looking at the first patch in the series now. I'd suggest that you
>> commit that very soon. It's useful on its own, and seems pretty much
>> ready to be committed already. I don't think it will be much affected by
>> whatever changes we make to the later patches, anymore.

After staring at the first patch for bit longer, a few things started to 
bother me:

* The new struct is called BTScanInsert, but it's used for searches, 
too. It makes sense when you read the README, which explains the 
difference between "search scan keys" and "insertion scan keys", but now 
that we have a separate struct for this, perhaps we call insertion scan 
keys with a less confusing name. I don't know what to suggest, though. 
"Positioning key"?

* We store the binary search bounds in BTScanInsertData, but they're 
only used during insertions.

* The binary search bounds are specific for a particular buffer. But 
that buffer is passed around separately from the bounds. It seems easy 
to have them go out of sync, so that you try to use the cached bounds 
for a different page. The savebinsrch and restorebinsrch is used to deal 
with that, but it is pretty complicated.


I came up with the attached (against master), which addresses the 2nd 
and 3rd points. I added a whole new BTInsertStateData struct, to hold 
the binary search bounds. BTScanInsert now only holds the 'scankeys' 
array, and the 'nextkey' flag. The new BTInsertStateData struct also 
holds the current buffer we're considering to insert to, and a 
'bounds_valid' flag to indicate if the saved bounds are valid for the 
current buffer. That way, it's more straightforward to clear the 
'bounds_valid' flag whenever we move right.

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary 
search like _bt_binsrch does, but the bounds caching is only done in 
_bt_binsrch_insert. Seems more clear to have separate functions for them 
now, even though there's some duplication.

>> +/* HEIKKI:
>> +Do we need 'checkunique' as an argument? If unique checks were not
>> +performed, the insertion key will simply not have saved state.
>> +*/
> 
> We need it in the next patch in the series, because it's also useful
> for optimizing away the high key check with non-unique indexes. We
> know that _bt_moveright() was called at the leaf level, with scantid
> filled in, so there is no question of needing to move right within
> _bt_findinsertloc() (provided it's a heapkeyspace index).

Hmm. Perhaps it would be to move the call to _bt_binsrch (or 
_bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So 
that _bt_findinsertloc would only be responsible for finding the correct 
page to insert to. So the overall code, after patch #2, would be like:

/*
  * Do the insertion. First move right to find the correct page to
  * insert to, if necessary. If we're inserting to a non-unique index,
  * _bt_search() already did this when it checked if a move to the
  * right was required for leaf page.  Insertion scankey's scantid
  * would have been filled out at the time. On a unique index, the
  * current buffer is the first buffer containing duplicates, however,
  * so we may need to move right to the correct location for this
  * tuple.
  */
if (checkingunique || itup_key->heapkeyspace)
    _bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, 
newitemoff, false);

Does this make sense?

> Actually, we even need it in the first patch: we only restore a binary
> search because we know that there is something to restore, and must
> ask for it to be restored explicitly (anything else seems unsafe).
> Maybe we can't restore it because it's not a unique index, or maybe we
> can't restore it because we microvacuumed, or moved right to get free
> space. I don't think that it'll be helpful to make _bt_findinsertloc()
> pretend that it doesn't know exactly where the binary search bounds
> come from -- it already knows plenty about unique indexes
> specifically, and about how it may have to invalidate the bounds. The
> whole way that it couples buffer locks is only useful for unique
> indexes, so it already knows *plenty* about unique indexes
> specifically.

The attached new version simplifies this, IMHO. The bounds and the 
current buffer go together in the same struct, so it's easier to keep 
track whether the bounds are valid or not.

>> - * starting a regular index scan some can be omitted.  The array is used as a
>> + * starting a regular index scan, some can be omitted.  The array is used as a
>>    * flexible array member, though it's sized in a way that makes it possible to
>>    * use stack allocations.  See nbtree/README for full details.
>> +
>> +HEIKKI: I don't see anything in the README about stack allocations. What
>> +exactly does the README reference refer to? No code seems to actually allocate
>> +this in the stack, so we don't really need that.
> 
> The README discusses insertion scankeys in general, though. I think
> that you read it that way because you're focussed on my changes, and
> not because it actually implies that the README talks about the stack
> thing specifically. (But I can change it if you like.)
> 
> There is a stack allocation in _bt_first(). This was once just another
> dynamic allocation, that called _bt_mkscankey(), but that regressed
> nested loop joins, so I had to make it work the same way as before.

Ah, gotcha, I missed that.

- Heikki

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> After staring at the first patch for bit longer, a few things started to
> bother me:
>
> * The new struct is called BTScanInsert, but it's used for searches,
> too. It makes sense when you read the README, which explains the
> difference between "search scan keys" and "insertion scan keys", but now
> that we have a separate struct for this, perhaps we call insertion scan
> keys with a less confusing name. I don't know what to suggest, though.
> "Positioning key"?

I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.

> * We store the binary search bounds in BTScanInsertData, but they're
> only used during insertions.
>
> * The binary search bounds are specific for a particular buffer. But
> that buffer is passed around separately from the bounds. It seems easy
> to have them go out of sync, so that you try to use the cached bounds
> for a different page. The savebinsrch and restorebinsrch is used to deal
> with that, but it is pretty complicated.

That might be an improvement, but I do think that using mutable state
in the insertion scankey, to restrict a search is an idea that could
work well in at least one other way. That really isn't a once-off
thing, even though it looks that way.

> I came up with the attached (against master), which addresses the 2nd
> and 3rd points. I added a whole new BTInsertStateData struct, to hold
> the binary search bounds. BTScanInsert now only holds the 'scankeys'
> array, and the 'nextkey' flag.

It will also have to store heapkeyspace, of course. And minusinfkey.
BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

> The new BTInsertStateData struct also
> holds the current buffer we're considering to insert to, and a
> 'bounds_valid' flag to indicate if the saved bounds are valid for the
> current buffer. That way, it's more straightforward to clear the
> 'bounds_valid' flag whenever we move right.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.
For a second, I thought that you forgot to invalidate the bounds_valid
flag, because you didn't pass it directly, by value to
_bt_useduplicatepage().

> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

I'll have to think about that some more. Having a separate
_bt_binsrch_insert() may be worth it, but I'll need to do some
profiling.

> Hmm. Perhaps it would be to move the call to _bt_binsrch (or
> _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So
> that _bt_findinsertloc would only be responsible for finding the correct
> page to insert to. So the overall code, after patch #2, would be like:

Maybe, but as I said it's not like _bt_findinsertloc() doesn't know
all about unique indexes already. This is pointed out in a comment in
_bt_doinsert(), even. I guess that it might have to be changed to say
_bt_findinsertpage() instead, with your new approach.

> /*
>   * Do the insertion. First move right to find the correct page to
>   * insert to, if necessary. If we're inserting to a non-unique index,
>   * _bt_search() already did this when it checked if a move to the
>   * right was required for leaf page.  Insertion scankey's scantid
>   * would have been filled out at the time. On a unique index, the
>   * current buffer is the first buffer containing duplicates, however,
>   * so we may need to move right to the correct location for this
>   * tuple.
>   */
> if (checkingunique || itup_key->heapkeyspace)
>         _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>
> newitemoff = _bt_binsrch_insert(rel, &insertstate);
>
> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
> newitemoff, false);
>
> Does this make sense?

I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.

I suppose that this could be an improvement, provided all the
assertions that verify that the work "_bt_findinsertpage()" would have
done if called was in fact unnecessary. (e.g., check the high
key/rightmost-ness)

> The attached new version simplifies this, IMHO. The bounds and the
> current buffer go together in the same struct, so it's easier to keep
> track whether the bounds are valid or not.

I'll look into integrating this with my current draft v15 tomorrow.
Need to sleep on it.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Mar 6, 2019 at 10:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It will also have to store heapkeyspace, of course. And minusinfkey.
> BTW, I would like to hear what you think of the idea of minusinfkey
> (and the !minusinfkey optimization) specifically.

> I'm not sure that that's an improvement. Moving right should be very
> rare with my patch. gcov shows that we never move right here anymore
> with the regression tests, or within _bt_check_unique() -- not once.
> For a second, I thought that you forgot to invalidate the bounds_valid
> flag, because you didn't pass it directly, by value to
> _bt_useduplicatepage().

BTW, the !minusinfkey optimization is why we literally never move
right within _bt_findinsertloc() while the regression tests run. We
always land on the correct leaf page to begin with. (It works with
unique index insertions, where scantid is NULL when we descend the
tree.)

In general, there are two good reasons for us to move right:

* There was a concurrent page split (or page deletion), and we just
missed the downlink in the parent, and need to recover.

* We omit some columns from our scan key (at least scantid), and there
are perhaps dozens of matches -- this is not relevant to
_bt_doinsert() code.

The single value strategy used by nbtsplitloc.c does a good job of
making it unlikely that _bt_check_unique()-wise duplicates will cross
leaf pages, so there will almost always be one leaf page to visit.
And, the !minusinfkey optimization ensures that the only reason we'll
move right is because of a concurrent page split, within
_bt_moveright().

The buffer lock coupling move to the right that _bt_findinsertloc()
does should be considered an edge case with all of these measures, at
least with v4 indexes.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 07/03/2019 14:54, Peter Geoghegan wrote:
> On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> After staring at the first patch for bit longer, a few things started to
>> bother me:
>>
>> * The new struct is called BTScanInsert, but it's used for searches,
>> too. It makes sense when you read the README, which explains the
>> difference between "search scan keys" and "insertion scan keys", but now
>> that we have a separate struct for this, perhaps we call insertion scan
>> keys with a less confusing name. I don't know what to suggest, though.
>> "Positioning key"?
> 
> I think that insertion scan key is fine. It's been called that for
> almost twenty years. It's not like it's an intuitive concept that
> could be conveyed easily if only we came up with a new, pithy name.

Yeah. It's been like that forever, but I must confess I hadn't paid any 
attention to it, until now. I had not understood that text in the README 
explaining the difference between search and insertion scan keys, before 
looking at this patch. Not sure I ever read it with any thought. Now 
that I understand it, I don't like the "insertion scan key" name.

> BTW, I would like to hear what you think of the idea of minusinfkey
> (and the !minusinfkey optimization) specifically.

I don't understand it :-(. I guess that's valuable feedback on its own. 
I'll spend more time reading the code around that, but meanwhile, if you 
can think of a simpler way to explain it in the comments, that'd be good.

>> The new BTInsertStateData struct also
>> holds the current buffer we're considering to insert to, and a
>> 'bounds_valid' flag to indicate if the saved bounds are valid for the
>> current buffer. That way, it's more straightforward to clear the
>> 'bounds_valid' flag whenever we move right.
> 
> I'm not sure that that's an improvement. Moving right should be very
> rare with my patch. gcov shows that we never move right here anymore
> with the regression tests, or within _bt_check_unique() -- not once.

I haven't given performance much thought, really. I don't expect this 
method to be any slower, but the point of the refactoring is to make the 
code easier to understand.

>> /*
>>    * Do the insertion. First move right to find the correct page to
>>    * insert to, if necessary. If we're inserting to a non-unique index,
>>    * _bt_search() already did this when it checked if a move to the
>>    * right was required for leaf page.  Insertion scankey's scantid
>>    * would have been filled out at the time. On a unique index, the
>>    * current buffer is the first buffer containing duplicates, however,
>>    * so we may need to move right to the correct location for this
>>    * tuple.
>>    */
>> if (checkingunique || itup_key->heapkeyspace)
>>          _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>>
>> newitemoff = _bt_binsrch_insert(rel, &insertstate);
>>
>> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
>> newitemoff, false);
>>
>> Does this make sense?
> 
> I guess you're saying this because you noticed that the for (;;) loop
> in _bt_findinsertloc() doesn't do that much in many cases, because of
> the fastpath.

The idea is that _bt_findinsertpage() would not need to know whether the 
unique checks were performed or not. I'd like to encapsulate all the 
information about the "insert position we're considering" in the 
BTInsertStateData struct. Passing 'checkingunique' as a separate 
argument violates that, because when it's set, the key means something 
slightly different.

Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether 
'scantid' is set, instead of 'checkingunique'. That would seem better. 
If it looks like 'checkingunique', it's making the assumption that if 
unique checks were not performed, then we are already positioned on the 
correct page, according to the heap TID. But looking at 'scantid' seems 
like a more direct way of getting the same information. And then we 
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.

So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

        if (!checkingunique && itup_key->heapkeyspace)
            break;

With this:

        if (itup_key->scantid)
            break;

And remove the 'checkingunique' argument from _bt_findinsertloc.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 07/03/2019 15:41, Heikki Linnakangas wrote:
> On 07/03/2019 14:54, Peter Geoghegan wrote:
>> On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> After staring at the first patch for bit longer, a few things started to
>>> bother me:
>>>
>>> * The new struct is called BTScanInsert, but it's used for searches,
>>> too. It makes sense when you read the README, which explains the
>>> difference between "search scan keys" and "insertion scan keys", but now
>>> that we have a separate struct for this, perhaps we call insertion scan
>>> keys with a less confusing name. I don't know what to suggest, though.
>>> "Positioning key"?
>>
>> I think that insertion scan key is fine. It's been called that for
>> almost twenty years. It's not like it's an intuitive concept that
>> could be conveyed easily if only we came up with a new, pithy name.
> 
> Yeah. It's been like that forever, but I must confess I hadn't paid any
> attention to it, until now. I had not understood that text in the README
> explaining the difference between search and insertion scan keys, before
> looking at this patch. Not sure I ever read it with any thought. Now
> that I understand it, I don't like the "insertion scan key" name.
> 
>> BTW, I would like to hear what you think of the idea of minusinfkey
>> (and the !minusinfkey optimization) specifically.
> 
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.
> 
>>> The new BTInsertStateData struct also
>>> holds the current buffer we're considering to insert to, and a
>>> 'bounds_valid' flag to indicate if the saved bounds are valid for the
>>> current buffer. That way, it's more straightforward to clear the
>>> 'bounds_valid' flag whenever we move right.
>>
>> I'm not sure that that's an improvement. Moving right should be very
>> rare with my patch. gcov shows that we never move right here anymore
>> with the regression tests, or within _bt_check_unique() -- not once.
> 
> I haven't given performance much thought, really. I don't expect this
> method to be any slower, but the point of the refactoring is to make the
> code easier to understand.
> 
>>> /*
>>>     * Do the insertion. First move right to find the correct page to
>>>     * insert to, if necessary. If we're inserting to a non-unique index,
>>>     * _bt_search() already did this when it checked if a move to the
>>>     * right was required for leaf page.  Insertion scankey's scantid
>>>     * would have been filled out at the time. On a unique index, the
>>>     * current buffer is the first buffer containing duplicates, however,
>>>     * so we may need to move right to the correct location for this
>>>     * tuple.
>>>     */
>>> if (checkingunique || itup_key->heapkeyspace)
>>>           _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>>>
>>> newitemoff = _bt_binsrch_insert(rel, &insertstate);
>>>
>>> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
>>> newitemoff, false);
>>>
>>> Does this make sense?
>>
>> I guess you're saying this because you noticed that the for (;;) loop
>> in _bt_findinsertloc() doesn't do that much in many cases, because of
>> the fastpath.
> 
> The idea is that _bt_findinsertpage() would not need to know whether the
> unique checks were performed or not. I'd like to encapsulate all the
> information about the "insert position we're considering" in the
> BTInsertStateData struct. Passing 'checkingunique' as a separate
> argument violates that, because when it's set, the key means something
> slightly different.
> 
> Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether
> 'scantid' is set, instead of 'checkingunique'. That would seem better.
> If it looks like 'checkingunique', it's making the assumption that if
> unique checks were not performed, then we are already positioned on the
> correct page, according to the heap TID. But looking at 'scantid' seems
> like a more direct way of getting the same information. And then we
> won't need to pass the 'checkingunique' flag as an "out-of-band" argument.
> 
> So I'm specifically suggesting that we replace this, in _bt_findinsertloc:
> 
>         if (!checkingunique && itup_key->heapkeyspace)
>             break;
> 
> With this:
> 
>         if (itup_key->scantid)
>             break;
> 
> And remove the 'checkingunique' argument from _bt_findinsertloc.

Ah, scratch that. By the time we call _bt_findinsertloc(), scantid has 
already been restored, even if it was not set originally when we did 
_bt_search.

My dislike here is that passing 'checkingunique' as a separate argument 
acts like a "modifier", changing slightly the meaning of the insertion 
scan key. If it's not set, we know we're positioned on the correct page. 
Otherwise, we might not be. And it's a pretty indirect way of saying 
that, as it also depends 'heapkeyspace'. Perhaps add a flag to 
BTInsertStateData, to indicate the same thing more explicitly. Something 
like "bool is_final_insertion_page; /* when set, no need to move right */".

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 05/03/2019 05:16, Peter Geoghegan wrote:
> Attached is v14, which has changes based on your feedback. 
As a quick check of the backwards-compatibility code, i.e. 
!heapkeyspace, I hacked _bt_initmetapage to force the version number to 
3, and ran the regression tests. It failed an assertion in the 
'create_index' test:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f2943f9a535 in __GI_abort () at abort.c:79
#2  0x00005622c7d9d6b4 in ExceptionalCondition 
(conditionName=0x5622c7e4cbe8 "!(_bt_check_natts(rel, key->heapkeyspace, 
page, offnum))", errorType=0x5622c7e4c62a "FailedAssertion",
     fileName=0x5622c7e4c734 "nbtsearch.c", lineNumber=511) at assert.c:54
#3  0x00005622c78627fb in _bt_compare (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, page=0x7f293d433780 "", offnum=2) at nbtsearch.c:511
#4  0x00005622c7862640 in _bt_binsrch (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, buf=4622) at nbtsearch.c:432
#5  0x00005622c7861ec9 in _bt_search (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, bufP=0x7ffd7a9976d4, access=1, 
snapshot=0x5622c8353740) at nbtsearch.c:142
#6  0x00005622c7863a44 in _bt_first (scan=0x5622c841e828, 
dir=ForwardScanDirection) at nbtsearch.c:1183
#7  0x00005622c785f8b0 in btgettuple (scan=0x5622c841e828, 
dir=ForwardScanDirection) at nbtree.c:245
#8  0x00005622c78522e3 in index_getnext_tid (scan=0x5622c841e828, 
direction=ForwardScanDirection) at indexam.c:542
#9  0x00005622c7a67784 in IndexOnlyNext (node=0x5622c83ad280) at 
nodeIndexonlyscan.c:120
#10 0x00005622c7a438d5 in ExecScanFetch (node=0x5622c83ad280, 
accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9 
<IndexOnlyRecheck>) at execScan.c:95
#11 0x00005622c7a4394a in ExecScan (node=0x5622c83ad280, 
accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9 
<IndexOnlyRecheck>) at execScan.c:145
#12 0x00005622c7a67c73 in ExecIndexOnlyScan (pstate=0x5622c83ad280) at 
nodeIndexonlyscan.c:322
#13 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83ad280) at 
execProcnode.c:445
#14 0x00005622c7a501a5 in ExecProcNode (node=0x5622c83ad280) at 
../../../src/include/executor/executor.h:231
#15 0x00005622c7a50693 in fetch_input_tuple (aggstate=0x5622c83acdd0) at 
nodeAgg.c:406
#16 0x00005622c7a529d9 in agg_retrieve_direct (aggstate=0x5622c83acdd0) 
at nodeAgg.c:1737
#17 0x00005622c7a525a9 in ExecAgg (pstate=0x5622c83acdd0) at nodeAgg.c:1552
#18 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83acdd0) at 
execProcnode.c:445
#19 0x00005622c7a3621d in ExecProcNode (node=0x5622c83acdd0) at 
../../../src/include/executor/executor.h:231
#20 0x00005622c7a38bd9 in ExecutePlan (estate=0x5622c83acb78, 
planstate=0x5622c83acdd0, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true, numberTuples=0,
     direction=ForwardScanDirection, dest=0x5622c8462088, 
execute_once=true) at execMain.c:1645
#21 0x00005622c7a36872 in standard_ExecutorRun 
(queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0, 
execute_once=true) at execMain.c:363
#22 0x00005622c7a36696 in ExecutorRun (queryDesc=0x5622c83a9eb8, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:307
#23 0x00005622c7c357dc in PortalRunSelect (portal=0x5622c8336778, 
forward=true, count=0, dest=0x5622c8462088) at pquery.c:929
#24 0x00005622c7c3546f in PortalRun (portal=0x5622c8336778, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x5622c8462088, altdest=0x5622c8462088,
     completionTag=0x7ffd7a997d50 "") at pquery.c:770
#25 0x00005622c7c2f029 in exec_simple_query (query_string=0x5622c82cf508 
"SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 
IS NULL;") at postgres.c:1215
#26 0x00005622c7c3369a in PostgresMain (argc=1, argv=0x5622c82faee0, 
dbname=0x5622c82fac50 "regression", username=0x5622c82c81e8 "heikki") at 
postgres.c:4256
#27 0x00005622c7b8bcf2 in BackendRun (port=0x5622c82f3d80) at 
postmaster.c:4378
#28 0x00005622c7b8b45b in BackendStartup (port=0x5622c82f3d80) at 
postmaster.c:4069
#29 0x00005622c7b87633 in ServerLoop () at postmaster.c:1699
#30 0x00005622c7b86e61 in PostmasterMain (argc=3, argv=0x5622c82c6160) 
at postmaster.c:1372
#31 0x00005622c7aa9925 in main (argc=3, argv=0x5622c82c6160) at main.c:228

I haven't investigated any deeper, but apparently something's broken. 
This was with patch v14, without any further changes.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Mar 7, 2019 at 12:14 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I haven't investigated any deeper, but apparently something's broken.
> This was with patch v14, without any further changes.

Try it with my patch -- attached.

I think that you missed that the INCLUDE indexes thing within
nbtsort.c needs to be changed back.

-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > BTW, I would like to hear what you think of the idea of minusinfkey
> > (and the !minusinfkey optimization) specifically.
>
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

Here is another way of explaining it:

When I drew you that picture while we were in Lisbon, I mentioned to
you that the patch sometimes used a sentinel scantid value that was
greater than minus infinity, but less than any real scantid. This
could be used to force an otherwise-equal-to-pivot search to go left
rather than uselessly going right. I explained this about 30 minutes
in, when I was drawing you a picture.

Well, that sentinel heap TID thing doesn't exist any more, because it
was replaced by the !minusinfkey optimization, which is a
*generalization* of the same idea, which extends it to all columns
(not just the heap TID column). That way, you never have to go to two
pages just because you searched for a value that happened to be at the
"right at the edge" of a leaf page.

Page deletion wants to assume that truncated attributes from the high
key of the page being deleted have actual negative infinity values --
negative infinity is a value, just like any other, albeit one that can
only appear in pivot tuples. This is simulated by VACUUM using
"minusinfkey = true". We go left in the parent, not right, and land on
the correct leaf page. Technically we don't compare the negative
infinity values in the pivot to the negative infinity values in the
scankey, but we return 0 just as if we had, and found them equal.
Similarly, v3 indexes specify "minusinfkey = true" in all cases,
because they always want to go left -- just like in old Postgres
versions. They don't have negative infinity values (matches can be on
either side of the all-equal pivot, so they must go left).

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Mar 7, 2019 at 12:37 AM Peter Geoghegan <pg@bowt.ie> wrote:
> When I drew you that picture while we were in Lisbon, I mentioned to
> you that the patch sometimes used a sentinel scantid value that was
> greater than minus infinity, but less than any real scantid. This
> could be used to force an otherwise-equal-to-pivot search to go left
> rather than uselessly going right. I explained this about 30 minutes
> in, when I was drawing you a picture.

I meant the opposite: it could be used to go right, instead of going
left when descending the tree and unnecessarily moving right on the
leaf level.

As I said, moving right on the leaf level (rather than during the
descent) should only happen when it's necessary, such as when there is
a concurrent page split. It shouldn't happen reliably when searching
for the same value, unless there really are matches across multiple
leaf pages, and that's just what we have to do.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

One more thing on this: If you force bitmap index scans (by disabling
index-only scans and index scans with the "enable_" GUCs), then you
get EXPLAIN (ANALYZE, BUFFERS) instrumentation for the index alone
(and the heap, separately). No visibility map accesses, which obscure
the same numbers for a similar index-only scan.

You can then observe that most searches of a single value will touch
the bare minimum number of index pages. For example, if there are 3
levels in the index, you should access only 3 index pages total,
unless there are literally hundreds of matches, and cannot avoid
storing them on more than one leaf page. You'll see that the scan
touches the minimum possible number of index pages, because of:

* Many duplicates strategy. (Not single value strategy, which I
incorrectly mentioned in relation to this earlier.)

* The !minusinfykey optimization, which ensures that we go to the
right of an otherwise-equal pivot tuple in an internal page, rather
than left.

* The "continuescan" high key patch, which ensures that the scan
doesn't go to the right from the first leaf page to try to find even
more matches. The high key on the same leaf page will indicate that
the scan is over, without actually visiting the sibling. (Again, I'm
assuming that your search is for a single value.)

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I came up with the attached (against master), which addresses the 2nd
> and 3rd points. I added a whole new BTInsertStateData struct, to hold
> the binary search bounds. BTScanInsert now only holds the 'scankeys'
> array, and the 'nextkey' flag. The new BTInsertStateData struct also
> holds the current buffer we're considering to insert to, and a
> 'bounds_valid' flag to indicate if the saved bounds are valid for the
> current buffer. That way, it's more straightforward to clear the
> 'bounds_valid' flag whenever we move right.
>
> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

Attached is v15, which does not yet integrate these changes. However,
it does integrate earlier feedback that you posted for v14. I also
cleaned up some comments within nbtsplitloc.c.

I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.

The !minusinfkey optimization is more than just a "nice to have".
Suffix truncation makes pivot tuples less restrictive about what can
go on each page, but that might actually hurt performance if we're not
also careful to descend directly to the leaf page where matches will
first appear (rather than descending to a page to its left). If we
needlessly descend to a page that's to the left of the leaf page we
really ought to go straight to, then there are cases that are
regressed rather than helped -- especially cases where splits use the
"many duplicates" strategy. You continue to get correct answers when
the !minusinfkey optimization is ripped out, but it seems almost
essential that we include it. While it's true that we've always had to
move left like this, it's also true that suffix truncation will make
it happen much more frequently. It would happen (without the
!minusinfkey optimization) most often where suffix truncation makes
pivot tuples smallest.

Once you grok the minusinfkey stuff, then we'll be in a better
position to work through the feedback about _bt_binsrch_insert() and
so on, I think. You may lack all of the context of how the second
patch goes on to use the new insertion scan key struct, so it will
probably make life easier if we're both on the same page. (Pun very
much intended.)

Thanks again!
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 08/03/2019 12:22, Peter Geoghegan wrote:
> I would like to work through these other items with you
> (_bt_binsrch_insert() and so on), but I think that it would be helpful
> if you made an effort to understand the minusinfkey stuff first. I
> spent a lot of time improving the explanation of that within
> _bt_compare(). It's important.

Ok, after thinking about it for a while, I think I understand the minus 
infinity stuff now. Let me try to explain it in my own words:

Imagine that you have an index with two key columns, A and B. The index 
has two leaf pages, with the following items:

+--------+   +--------+
| Page 1 |   | Page 2 |
|        |   |        |
|    1 1 |   |    2 1 |
|    1 2 |   |    2 2 |
|    1 3 |   |    2 3 |
|    1 4 |   |    2 4 |
|    1 5 |   |    2 5 |
+--------+   +--------+

The key space is neatly split on the first key column - probably thanks 
to the new magic in the page split code.

Now, what do we have as the high key of page 1? Answer: "2 -inf". The 
"-inf" is not stored in the key itself, the second key column is just 
omitted, and the search code knows to treat it implicitly as a value 
that's lower than any real value. Hence, "minus infinity".

However, during page deletion, we need to perform a search to find the 
downlink pointing to a leaf page. We do that by using the leaf page's 
high key as the search key. But the search needs to treat it slightly 
differently in that case. Normally, searching with a single key value, 
"2", we would land on page 2, because any real value beginning with "2" 
would be on that page, but in the page deletion case, we want to find 
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code 
to do that.

Question: Wouldn't it be more straightforward to use "1 +inf" as page 
1's high key? I.e treat any missing columns as positive infinity. That 
way, the search for page deletion wouldn't need to be treated 
differently. That's also how this used to work: all tuples on a page 
used to be <= high key, not strictly < high key. And it would also make 
the rightmost page less of a special case: you could pretend the 
rightmost page has a pivot tuple with all columns truncated away, which 
means positive infinity.

You have this comment _bt_split which touches the subject:

>     /*
>      * The "high key" for the new left page will be the first key that's going
>      * to go into the new right page, or possibly a truncated version if this
>      * is a leaf page split.  This might be either the existing data item at
>      * position firstright, or the incoming tuple.
>      *
>      * The high key for the left page is formed using the first item on the
>      * right page, which may seem to be contrary to Lehman & Yao's approach of
>      * using the left page's last item as its new high key when splitting on
>      * the leaf level.  It isn't, though: suffix truncation will leave the
>      * left page's high key fully equal to the last item on the left page when
>      * two tuples with equal key values (excluding heap TID) enclose the split
>      * point.  It isn't actually necessary for a new leaf high key to be equal
>      * to the last item on the left for the L&Y "subtree" invariant to hold.
>      * It's sufficient to make sure that the new leaf high key is strictly
>      * less than the first item on the right leaf page, and greater than or
>      * equal to (not necessarily equal to) the last item on the left leaf
>      * page.
>      *
>      * In other words, when suffix truncation isn't possible, L&Y's exact
>      * approach to leaf splits is taken.  (Actually, even that is slightly
>      * inaccurate.  A tuple with all the keys from firstright but the heap TID
>      * from lastleft will be used as the new high key, since the last left
>      * tuple could be physically larger despite being opclass-equal in respect
>      * of all attributes prior to the heap TID attribute.)
>      */

But it doesn't explain why it's done like that.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Now, what do we have as the high key of page 1? Answer: "2 -inf". The
> "-inf" is not stored in the key itself, the second key column is just
> omitted, and the search code knows to treat it implicitly as a value
> that's lower than any real value. Hence, "minus infinity".

Right.

> However, during page deletion, we need to perform a search to find the
> downlink pointing to a leaf page. We do that by using the leaf page's
> high key as the search key. But the search needs to treat it slightly
> differently in that case. Normally, searching with a single key value,
> "2", we would land on page 2, because any real value beginning with "2"
> would be on that page, but in the page deletion case, we want to find
> page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
> to do that.

Right.

> Question: Wouldn't it be more straightforward to use "1 +inf" as page
> 1's high key? I.e treat any missing columns as positive infinity.

That might also work, but it wouldn't be more straightforward on
balance. This is because:

* We have always taken the new high key from the firstright item, and
we also continue to do that on internal pages -- same as before. It
would certainly complicate the nbtsplitloc.c code to have to deal with
this new special case now (leaf and internal pages would have to have
far different handling, not just slightly different handling).

* We have always had "-inf" values as the first item on an internal
page, which is explicitly truncated to zero attributes as of Postgres
v11. It seems ugly to me to make truncated attributes mean negative
infinity in that context, but positive infinity in every other
context.

* Another reason that I prefer "-inf" to "+inf" is that you can
imagine an implementation that makes pivot tuples into normalized
binary keys, that are truncated using generic/opclass-agnostic logic,
and compared using strcmp(). If the scankey binary string is longer
than the pivot tuple, then it's greater according to strcmp() -- that
just works. And, you can truncate the original binary strings built
using opclass infrastructure without having to understand where
attributes begin and end (though this relies on encoding things like
NULL-ness a certain way). If we define truncation to be "+inf" now,
then none of this works.

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

Even calling it minusinfkey is misleading in one way, because we're
not so much searching for "-inf" values as we are searching for the
first page that could have tuples for the untruncated attributes. But
isn't that how this has always worked, given that we've had to deal
with duplicate pivot tuples on the same level before now? As I said,
we're not doing an extra thing when minusinfykey is true (during page
deletion) -- it's the other way around. Saying that we're searching
for minus infinity values for the truncated attributes is kind of a
lie, although the search does behave that way.

>That way, the search for page deletion wouldn't need to be treated
> differently. That's also how this used to work: all tuples on a page
> used to be <= high key, not strictly < high key.

That isn't accurate -- it still works that way on the leaf level. The
alternative that you've described is possible, I think, but the key
space works just the same with either of our approaches. You've merely
thought of an alternative way of generating new high keys that satisfy
the same invariants as my own scheme. Provided the new separator for
high key is >= last item on the left and < first item on the right,
everything works.

As you point out, the original Lehman and Yao rule for leaf pages
(which Postgres kinda followed before) is that the high key is <=
items on the leaf level. But this patch makes nbtree follow that rule
fully and properly.

Maybe you noticed that amcheck tests < on internal pages, and only
checks <= on leaf pages. Perhaps it led you to believe that I did
things differently. Actually, this is classic Lehman and Yao. The keys
in internal pages are all "separators" as far as Lehman and Yao are
concerned, so the high key is less of a special case on internal
pages. We check < on internal pages because all separators are
supposed to be unique on a level. But, as I said, we do check <= on
the leaf level.

Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if
this is unclear. That shows that internal pages have unique keys -- we
can therefore expect the high key to be < items in internal pages. It
also shows that leaf pages copy the high key from the last item on the
left page -- we can expect the high key to be <= items there. Just
like with the patch, in effect. The comment from _bt_split() that you
quoted explains why what we do is like what Lehman and Yao do when
suffix truncation cannot truncate anything -- the new high key on the
left page comes from the last item on the left page.

> And it would also make
> the rightmost page less of a special case: you could pretend the
> rightmost page has a pivot tuple with all columns truncated away, which
> means positive infinity.

But we do already pretend that. How is that not the case already?

> But it doesn't explain why it's done like that.

It's done this way because that's equivalent to what Lehman and Yao
do, while also avoiding adding the special cases that I mentioned (in
nbtsplitloc.c, and so on).

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Question: Wouldn't it be more straightforward to use "1 +inf" as page
> > 1's high key? I.e treat any missing columns as positive infinity.
>
> That might also work, but it wouldn't be more straightforward on
> balance. This is because:

I thought of another reason:

* The 'Add high key "continuescan" optimization' is effective because
the high key of a leaf page tends to look relatively dissimilar to
other items on the page. The optimization would almost never help if
the high key was derived from the lastleft item at the time of a split
-- that's no more informative than the lastleft item itself.

As things stand with the patch, a high key usually has a value for its
last untruncated attribute that can only appear on the page to the
right, and never the current page. We'd quite like to be able to
conclude that the page to the right can't be interesting there and
then, without needing to visit it. Making new leaf high keys "as close
as possible to items on the right, without actually touching them"
makes the optimization quite likely to work out with the TPC-C
indexes, when we search for orderline items for an order that is
rightmost of a leaf page in the orderlines primary key.

And another reason:

* This makes it likely that any new items that would go between the
original lastleft and firstright items end up on the right page when
they're inserted after the lastleft/firstright split. This is
generally a good thing, because we've optimized WAL-logging for new
pages that go on the right. (You pointed this out to me in Lisbon, in
fact.)

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
> All of that said, maybe it would be clearer if page deletion was not
> the special case that has to opt in to minusinfkey semantics. Perhaps
> it would make more sense for everyone else to opt out of minusinfkey
> semantics, and to get the !minusinfkey optimization as a result of
> that. I only did it the other way around because that meant that only
> nbtpage.c had to acknowledge the special case.

This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

Thanks
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 08/03/2019 23:21, Peter Geoghegan wrote:
> On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
>> All of that said, maybe it would be clearer if page deletion was not
>> the special case that has to opt in to minusinfkey semantics. Perhaps
>> it would make more sense for everyone else to opt out of minusinfkey
>> semantics, and to get the !minusinfkey optimization as a result of
>> that. I only did it the other way around because that meant that only
>> nbtpage.c had to acknowledge the special case.
> 
> This seems like a good idea -- we should reframe the !minusinfkey
> optimization, without actually changing the behavior. Flip it around.
>
> The minusinfkey field within the insertion scankey struct would be
> called something like "descendrighttrunc" instead. Same idea, but with
> the definition inverted. Most _bt_search() callers (all of those
> outside of nbtpage.c and amcheck) would be required to opt in to that
> optimization to get it.
> 
> Under this arrangement, nbtpage.c/page deletion would not ask for the
> "descendrighttrunc" optimization, and would therefore continue to do
> what it has always done: find the first leaf page that its insertion
> scankey values could be on (we don't lie about searching for negative
> infinity, or having a negative infinity sentinel value in scan key).
> The only difference for page deletion between v3 indexes and v4
> indexes is that with v4 indexes we'll relocate the same leaf page
> reliably, since every separator key value is guaranteed to be unique
> on its level (including the leaf level/leaf high keys). This is just a
> detail, though, and not one that's even worth pointing out; we're not
> *relying* on that being true on v4 indexes anyway (we check that the
> block number is a match too, which is strictly necessary for v3
> indexes and seems like a good idea for v4 indexes).
> 
> This is also good because it makes it clear that the unique index code
> within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
> from the descendrighttrunc/!minusinfkey optimization -- it should be
> "honest" and ask for it explicitly. We can make _bt_doinsert() opt in
> to the optimization for unique indexes, but not for other indexes,
> where scantid is set from the start. The
> descendrighttrunc/!minusinfkey optimization cannot help when scantid
> is set from the start, because we'll always have an attribute value in
> insertion scankey that breaks the tie for us instead. We'll always
> move right of a heap-TID-truncated separator key whose untruncated
> attributes are all equal to a prefix of our insertion scankey values.
> 
> (This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
> unique indexes matters more than you might think -- we do really badly
> with things like Zipfian distributions currently, and reducing the
> contention goes some way towards helping with that. Postgres pro
> noticed this a couple of years back, and analyzed it in detail at that
> time. It's really nice that we very rarely have to move right within
> code like _bt_check_unique() and _bt_findsplitloc() with the patch.)
> 
> Does that make sense to you? Can you live with the name
> "descendrighttrunc", or do you have a better one?

"descendrighttrunc" doesn't make much sense to me, either. I don't 
understand it. Maybe a comment would make it clear, though.

I don't feel like this is an optimization. It's natural consequence of 
what the high key means. I guess you can think of it as an optimization, 
in the same way that not fully scanning the whole index for every search 
is an optimization, but that's not how I think of it :-).

If we don't flip the meaning of the flag, then maybe calling it 
something like "searching_for_leaf_page" would make sense:

/*
  * When set, we're searching for the leaf page with the given high key,
  * rather than leaf tuples matching the search keys.
  *
  * Normally, when !searching_for_pivot_tuple, if a page's high key
  * has truncated columns, and the columns that are present are equal to
  * the search key, the search will not descend to that page. For
  * example, if an index has two columns, and a page's high key is
  * ("foo", <omitted>), and the search key is also ("foo," <omitted>),
  * the search will not descend to that page, but its right sibling. The
  * omitted column in the high key means that all tuples on the page must
  * be strictly < "foo", so we don't need to visit it. However, sometimes
  * we perform a search to find the parent of a leaf page, using the leaf
  * page's high key as the search key. In that case, when we search for
  * ("foo", <omitted>), we do want to land on that page, not its right
  * sibling.
  */
bool    searching_for_leaf_page;


As the patch stands, you're also setting minusinfkey when dealing with 
v3 indexes. I think it would be better to only set 
searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert 
should describe the search key, regardless of whether it's a V3 or V4 
index. Properties of the index belong elsewhere. (We're violating that 
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is 
probably OK, it is pretty convenient to have it there. But in principle...)

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> "descendrighttrunc" doesn't make much sense to me, either. I don't
> understand it. Maybe a comment would make it clear, though.

It's not an easily grasped concept. I don't think that any name will
easily convey the idea to the reader, though. I'm happy to go with
whatever name you prefer.

> I don't feel like this is an optimization. It's natural consequence of
> what the high key means. I guess you can think of it as an optimization,
> in the same way that not fully scanning the whole index for every search
> is an optimization, but that's not how I think of it :-).

I would agree with this in a green field situation, where we don't
have to consider the legacy of v3 indexes. But that's not the case
here.

> If we don't flip the meaning of the flag, then maybe calling it
> something like "searching_for_leaf_page" would make sense:
>
> /*
>   * When set, we're searching for the leaf page with the given high key,
>   * rather than leaf tuples matching the search keys.
>   *
>   * Normally, when !searching_for_pivot_tuple, if a page's high key

I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)

I'm fine with "searching_for_pivot_tuple", I think. The underscores
are not really stylistically consistent with other stuff in nbtree.h,
but I can use something very similar to your suggestion that is
consistent.

>   * has truncated columns, and the columns that are present are equal to
>   * the search key, the search will not descend to that page. For
>   * example, if an index has two columns, and a page's high key is
>   * ("foo", <omitted>), and the search key is also ("foo," <omitted>),
>   * the search will not descend to that page, but its right sibling. The
>   * omitted column in the high key means that all tuples on the page must
>   * be strictly < "foo", so we don't need to visit it. However, sometimes
>   * we perform a search to find the parent of a leaf page, using the leaf
>   * page's high key as the search key. In that case, when we search for
>   * ("foo", <omitted>), we do want to land on that page, not its right
>   * sibling.
>   */
> bool    searching_for_leaf_page;

That works for me (assuming you meant searching_for_pivot_tuple).

> As the patch stands, you're also setting minusinfkey when dealing with
> v3 indexes. I think it would be better to only set
> searching_for_leaf_page in nbtpage.c.

That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare(). I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.

> In general, I think BTScanInsert
> should describe the search key, regardless of whether it's a V3 or V4
> index. Properties of the index belong elsewhere. (We're violating that
> by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
> probably OK, it is pretty convenient to have it there. But in principle...)

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Now, it's also true that there are a number of things that we have to
do within nbtinsert.c for !heapkeyspace that don't really concern the
key space as such. Even still, thinking about everything with
reference to the keyspace, and keeping that as similar as possible
between v3 and v4 is a good thing. It is up to high level code (such
as _bt_first()) to not allow a !heapkeyspace index scan to do
something that won't work for it. It is not up to low level code like
_bt_compare() to worry about these differences (beyond asserting that
caller got it right). If page deletion didn't need minusinfkey
semantics (if nobody but v3 indexes needed that), then I would just
make the "move right of separator" !minusinfkey code within
_bt_compare() test heapkeyspace. But we do have a general need for
minusinfkey semantics, so it seems simpler and more future-proof to
keep heapkeyspace out of low-level nbtsearch.c code.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 10/03/2019 20:53, Peter Geoghegan wrote:
> On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> If we don't flip the meaning of the flag, then maybe calling it
>> something like "searching_for_leaf_page" would make sense:
>>
>> /*
>>    * When set, we're searching for the leaf page with the given high key,
>>    * rather than leaf tuples matching the search keys.
>>    *
>>    * Normally, when !searching_for_pivot_tuple, if a page's high key
> 
> I guess you meant to say "searching_for_pivot_tuple" both times (not
> "searching_for_leaf_page"). After all, we always search for a leaf
> page. :-)

Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but 
changed to "searching_for_leaf_page" at the last minute. My thinking was 
that in the page-deletion case, you're trying to re-locate a particular 
leaf page. Otherwise, you're searching for matching tuples, not a 
particular page. Although during insertion, I guess you are also 
searching for a particular page, the page to insert to.

>> As the patch stands, you're also setting minusinfkey when dealing with
>> v3 indexes. I think it would be better to only set
>> searching_for_leaf_page in nbtpage.c.
> 
> That would mean I would have to check both heapkeyspace and
> minusinfkey within _bt_compare().

Yeah.

> I would rather just keep the
> assertion that makes sure that !heapkeyspace callers are also
> minusinfkey callers, and the comments that explain why that is. It
> might even matter to performance -- having an extra condition in
> _bt_compare() is something we should avoid.

It's a hot codepath, but I doubt it's *that* hot that it matters, 
performance-wise...

>> In general, I think BTScanInsert
>> should describe the search key, regardless of whether it's a V3 or V4
>> index. Properties of the index belong elsewhere. (We're violating that
>> by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
>> probably OK, it is pretty convenient to have it there. But in principle...)
> 
> The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
> they too have a heap TID attribute. nbtsearch.c code is not allowed to
> rely on its value, though, and must use
> minusinfkey/searching_for_pivot_tuple semantics (relying on its value
> being minus infinity is still relying on its value being something).

Yeah. I find that's a complicated way to think about it. My mental model 
is that v4 indexes store heap TIDs, and every tuple is unique thanks to 
that. But on v3, we don't store heap TIDs, and duplicates are possible.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Mar 10, 2019 at 12:53 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but
> changed to "searching_for_leaf_page" at the last minute. My thinking was
> that in the page-deletion case, you're trying to re-locate a particular
> leaf page. Otherwise, you're searching for matching tuples, not a
> particular page. Although during insertion, I guess you are also
> searching for a particular page, the page to insert to.

I prefer something like "searching_for_pivot_tuple", because it's
unambiguous. Okay with that?

> It's a hot codepath, but I doubt it's *that* hot that it matters,
> performance-wise...

I'll figure that out. Although I am currently looking into a
regression in workloads that fit in shared_buffers, that my
micro-benchmarks didn't catch initially. Indexes are still much
smaller, but we get a ~2% regression all the same. OTOH, we get a
7.5%+ increase in throughput when the workload is I/O bound, and
latency is generally no worse and even better with any workload.

I suspect that the nice top-down approach to nbtsplitloc.c has its
costs...will let you know more when I know more.

> > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
> > they too have a heap TID attribute. nbtsearch.c code is not allowed to
> > rely on its value, though, and must use
> > minusinfkey/searching_for_pivot_tuple semantics (relying on its value
> > being minus infinity is still relying on its value being something).
>
> Yeah. I find that's a complicated way to think about it. My mental model
> is that v4 indexes store heap TIDs, and every tuple is unique thanks to
> that. But on v3, we don't store heap TIDs, and duplicates are possible.

I'll try it that way, then.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sun, Mar 10, 2019 at 1:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
> > > they too have a heap TID attribute. nbtsearch.c code is not allowed to
> > > rely on its value, though, and must use
> > > minusinfkey/searching_for_pivot_tuple semantics (relying on its value
> > > being minus infinity is still relying on its value being something).
> >
> > Yeah. I find that's a complicated way to think about it. My mental model
> > is that v4 indexes store heap TIDs, and every tuple is unique thanks to
> > that. But on v3, we don't store heap TIDs, and duplicates are possible.
>
> I'll try it that way, then.

Attached is v16, which does it that way instead. There are simpler
comments, still located within _bt_compare(). These are based on your
suggested wording, with some changes. I think that I prefer it this
way too. Please let me know what you think.

Other changes:

* nbtsplitloc.c failed to consider the full range of values in the
split interval when deciding perfect penalty. It considered from the
middle to the left or right edge, rather than from the left edge to
the right edge. This didn't seem to really effect the quality of its
decisions very much, but it was still wrong. This is fixed by a new
function that determines the left and right edges of the split
interval -- _bt_interval_edges().

* We now record the smallest observed tuple during our pass over the
page to record split points. This is used by internal page splits, to
get a more useful "perfect penalty", saving cycles in the common case
where there isn't much variability in the size of tuples on the page
being split. The same field is used within the "split after new item"
optimization as a further crosscheck -- it's now impossible to fool it
into thinking that the page has equisized tuples.

The regression that I mentioned earlier isn't in pgbench type
workloads (even when the distribution is something more interesting
that the uniform distribution default). It is only in workloads with
lots of page splits and lots of index churn, where we get most of the
benefit of the patch, but also where the costs are most apparent.
Hopefully it can be fixed, but if not I'm inclined to think that it's
a price worth paying. This certainly still needs further analysis and
discussion, though. This revision of the patch does not attempt to
address that problem in any way.

-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

От
Peter Geoghegan
Дата:
On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The regression that I mentioned earlier isn't in pgbench type
> workloads (even when the distribution is something more interesting
> that the uniform distribution default). It is only in workloads with
> lots of page splits and lots of index churn, where we get most of the
> benefit of the patch, but also where the costs are most apparent.
> Hopefully it can be fixed, but if not I'm inclined to think that it's
> a price worth paying. This certainly still needs further analysis and
> discussion, though. This revision of the patch does not attempt to
> address that problem in any way.

I believe that I've figured out what's going on here.

At first, I thought that this regression was due to the cycles that
have been added to page splits, but that doesn't seem to be the case
at all. Nothing that I did to make page splits faster helped (e.g.
temporarily go back to doing them "bottom up" made no difference). CPU
utilization was consistently slightly *higher* with the master branch
(patch spent slightly more CPU time idle). I now believe that the
problem is with LWLock/buffer lock contention on index pages, and that
that's an inherent cost with a minority of write-heavy high contention
workloads. A cost that we should just accept.

Making the orderline primary key about 40% smaller increases
contention when BenchmarkSQL is run with this particular
configuration. The latency for the NEW_ORDER transaction went from
~4ms average on master to ~5ms average with the patch, while the
latency for other types of transactions was either unchanged or
improved. It's noticeable, but not that noticeable. This needs to be
put in context. The final transactions per minute for this
configuration was 250,000, with a total of only 100 warehouses. What
this boils down to is that the throughput per warehouse is about 8000%
of the maximum valid NOPM specified by the TPC-C spec [1]. In other
words, the database is too small relative to the machine, by a huge
amount, making the result totally and utterly invalid if you go on
what the TPC-C spec says. This exaggerates the LWLock/buffer lock
contention on index pages.

TPC-C is supposed to simulate a real use case with a plausible
configuration, but the details here are totally unrealistic. For
example, there are 3 million customers here (there are always 30k
customers per warehouse). 250k TPM means that there were about 112k
new orders per minute. It's hard to imagine a population of 3 million
customers making 112k orders per minute. That's over 20 million orders
in the first 3 hour long run that I got these numbers from. Each of
these orders has an average of about 10 line items. These people must
be very busy, and must have an awful lot of storage space in their
homes! (There are various other factors here, such as skew, and the
details will never be completely realistic anyway, but you take my
point. TPC-C is *designed* to be a realistic distillation of a real
use case, going so far as to require usable GUI input terminals when
evaluating a formal benchmark submission.)

The benchmark that I posted in mid-February [2] (which showed better
performance across the board) was much closer to what the TPC-C spec
requires -- that was only ~400% of maximum valid NOPM (the
BenchmarkSQL html reports will tell you this if you download the
archive I posted), and had 2,000 warehouses. TPC-C is *supposed* to be
I/O bound, and I/O bound workloads are what the patch helps with the
most. The general idea with TPC-C's NOPM is that you're required to
increase the number of warehouses as throughput increases. This stops
you from getting an unrealistically favorable result by churning
through a small amount of data, from the same few warehouses.

The only benchmark that I ran that actually satisfied TPC-C's NOPM
requirements had a total of 7,000 warehouses, and was almost a full
terabyte in size on the master branch. This was run on an i3.4xlarge
high I/O AWS ec2 instance. That was substantially I/O bound, and had
an improvement in throughput that was very similar to the mid-February
results which came from my home server -- we see a ~7.5% increase in
transaction throughput after a few hours. I attach a graph of block
device reads/writes for the second 4 hour run for this same 7,000
warehouse benchmark (master and patch). This shows a substantial
reduction in I/O according to OS-level instrumentation. (Note that the
same FS/logical block device was used for both WAL and database
files.)

In conclusion: I think that this regression is a cost worth accepting.
The regression in throughput is relatively small (2% - 3%), and the
NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
to be used for 45% of all transactions with TPC-C, and inserts into
the largest index by far, without reading much). The patch overtakes
master after a few hours anyway -- the patch will still win after
about 6 hours, once the database gets big enough, despite all the
contention. As I said, I think that we see a regression *because* the
indexes are much smaller, not in spite of the fact that they're
smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
smaller than they are on master, no matter the details, even after
many hours.

I'm not seeing the problem when pgbench is run with a small scale
factor but with a high client count. pgbench doesn't have the benefit
of much smaller indexes, so it also doesn't bear any cost when
contention is ramped up. The pgbench_accounts primary key (which is by
far the largest index) is *precisely* the same size as it is on
master, though the other indexes do seem to be a lot smaller. They
were already tiny, though. OTOH, the TPC-C NEW_ORDER transaction does
a lot of straight inserts, localized by warehouse, with skewed access.

[1] https://youtu.be/qYeRHK6oq7g?t=1340
[2] https://www.postgresql.org/message-id/CAH2-WzmsK-1qVR8xC86DXv8U0cHwfPcuH6hhA740fCeEu3XsVg@mail.gmail.com

--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 12/03/2019 04:47, Peter Geoghegan wrote:
> In conclusion: I think that this regression is a cost worth accepting.
> The regression in throughput is relatively small (2% - 3%), and the
> NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
> to be used for 45% of all transactions with TPC-C, and inserts into
> the largest index by far, without reading much). The patch overtakes
> master after a few hours anyway -- the patch will still win after
> about 6 hours, once the database gets big enough, despite all the
> contention. As I said, I think that we see a regression*because*  the
> indexes are much smaller, not in spite of the fact that they're
> smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
> smaller than they are on master, no matter the details, even after
> many hours.

Yeah, that's fine. I'm curious, though, could you bloat the indexes back 
to the old size by setting the fillfactor?

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Mar 11, 2019 at 11:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Yeah, that's fine. I'm curious, though, could you bloat the indexes back
> to the old size by setting the fillfactor?

I think that that might work, though it's hard to say for sure offhand.

The "split after new item" optimization is supposed to be a variation
of rightmost splits, of course. We apply fillfactor in the same way
much of the time. You would still literally split immediately after
the new item some of the time, though, which makes it unclear how much
bloat there would be without testing it.

Some indexes mostly apply fillfactor in non-rightmost pages, while
other indexes mostly split at the exact point past the new item,
depending on details like the size of the groupings.

I am currently doing a multi-day 6,000 warehouse benchmark, since I
want to be sure that the bloat resistance will hold up over days. I
think that it will, because there aren't that many updates, and
they're almost all HOT-safe. I'll put the idea of a 50/50 fillfactor
benchmark with the high-contention/regressed workload on my TODO list,
though.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Robert Haas
Дата:
On Mon, Mar 11, 2019 at 10:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The regression that I mentioned earlier isn't in pgbench type
> > workloads (even when the distribution is something more interesting
> > that the uniform distribution default). It is only in workloads with
> > lots of page splits and lots of index churn, where we get most of the
> > benefit of the patch, but also where the costs are most apparent.
> > Hopefully it can be fixed, but if not I'm inclined to think that it's
> > a price worth paying. This certainly still needs further analysis and
> > discussion, though. This revision of the patch does not attempt to
> > address that problem in any way.
>
> I believe that I've figured out what's going on here.
>
> At first, I thought that this regression was due to the cycles that
> have been added to page splits, but that doesn't seem to be the case
> at all. Nothing that I did to make page splits faster helped (e.g.
> temporarily go back to doing them "bottom up" made no difference). CPU
> utilization was consistently slightly *higher* with the master branch
> (patch spent slightly more CPU time idle). I now believe that the
> problem is with LWLock/buffer lock contention on index pages, and that
> that's an inherent cost with a minority of write-heavy high contention
> workloads. A cost that we should just accept.

If I wanted to try to say this in fewer words, would it be fair to say
that reducing the size of an index by 40% without changing anything
else can increase contention on the remaining pages?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 12, 2019 at 11:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> If I wanted to try to say this in fewer words, would it be fair to say
> that reducing the size of an index by 40% without changing anything
> else can increase contention on the remaining pages?

Yes.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Robert Haas
Дата:
On Tue, Mar 12, 2019 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Mar 12, 2019 at 11:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > If I wanted to try to say this in fewer words, would it be fair to say
> > that reducing the size of an index by 40% without changing anything
> > else can increase contention on the remaining pages?
>
> Yes.

Hey, I understood something today!

I think it's pretty clear that we have to view that as acceptable.  I
mean, we could reduce contention even further by finding a way to make
indexes 40% larger, but I think it's clear that nobody wants that.
Now, maybe in the future we'll want to work on other techniques for
reducing contention, but I don't think we should make that the problem
of this patch, especially because the regressions are small and go
away after a few hours of heavy use.  We should optimize for the case
where the user intends to keep the database around for years, not
hours.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Hey, I understood something today!

And I said something that could be understood!

> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use.  We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I think so too. There is a feature in other database systems called
"reverse key indexes", which deals with this problem in a rather
extreme way. This situation is very similar to the situation with
rightmost page splits, where fillfactor is applied to pack leaf pages
full. The only difference is that there are multiple groupings, not
just one single implicit grouping (everything in the index). You could
probably make very similar observations about rightmost page splits
that apply leaf fillfactor.

Here is an example of how the largest index looks for master with the
100 warehouse case that was slightly regressed:

    table_name    |      index_name       | page_type | npages  |
avg_live_items | avg_dead_items | avg_item_size
------------------+-----------------------+-----------+---------+----------------+----------------+---------------
 bmsql_order_line | bmsql_order_line_pkey | R         |         1 |
54.000       |    0.000       |   23.000
 bmsql_order_line | bmsql_order_line_pkey | I         |    11,482 |
143.200       |    0.000       |   23.000
 bmsql_order_line | bmsql_order_line_pkey | L         | 1,621,316 |
139.458       |    0.003       |   24.000

Here is what we see with the patch:

    table_name    |      index_name       | page_type | npages  |
avg_live_items | avg_dead_items | avg_item_size
------------------+-----------------------+-----------+---------+----------------+----------------+---------------
 bmsql_order_line | bmsql_order_line_pkey | R         |       1 |
29.000       |    0.000       |   22.000
 bmsql_order_line | bmsql_order_line_pkey | I         |   5,957 |
159.149       |    0.000       |   23.000
 bmsql_order_line | bmsql_order_line_pkey | L         | 936,170 |
233.496       |    0.052       |   23.999

REINDEX would leave bmsql_order_line_pkey with 262 items, and we see
here that it has 233 after several hours, which is pretty good given
the amount of contention. The index actually looks very much like it
was just REINDEXED when initial bulk loading finishes, before we get
any updates, even though that happens using retail insertions.

Notice that the number of internal pages is reduced by almost a full
50% -- it's somewhat better than the reduction in the number of leaf
pages, because the benefits compound (items in the root are even a bit
smaller, because of this compounding effect, despite alignment
effects). Internal pages are the most important pages to have cached,
but also potentially the biggest points of contention.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andres Freund
Дата:
Hi,

On 2019-03-11 19:47:29 -0700, Peter Geoghegan wrote:
> I now believe that the problem is with LWLock/buffer lock contention
> on index pages, and that that's an inherent cost with a minority of
> write-heavy high contention workloads. A cost that we should just
> accept.

Have you looked at an offwake or lwlock wait graph (bcc tools) or
something in that vein? Would be interesting to see what is waiting for
what most often...

Greetings,

Andres Freund


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 12, 2019 at 12:40 PM Andres Freund <andres@anarazel.de> wrote:
> Have you looked at an offwake or lwlock wait graph (bcc tools) or
> something in that vein? Would be interesting to see what is waiting for
> what most often...

Not recently, though I did use your BCC script for this very purpose
quite a few months ago. I don't remember it helping that much at the
time, but then that was with a version of the patch that lacked a
couple of important optimizations that we have now. We're now very
careful to not descend to the left with an equal pivot tuple. We
descend right instead when that's definitely the only place we'll find
matches (a high key doesn't count as a match in almost all cases!).
Edge-cases where we unnecessarily move left then right, or
unnecessarily move right a second time once on the leaf level have
been fixed. I fixed the regression I was worried about at the time,
without getting much benefit from the BCC script, and moved on.

This kind of minutiae is more important than it sounds. I have used
EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
understand where every single block access comes from with these
edge-cases, paying close attention to the structure of the index, and
how the key space is broken up (the values of pivot tuples in internal
pages). It is one thing to make the index smaller, and another thing
to take full advantage of that -- I have both. This is one of the
reasons why I believe that this minor regression cannot be avoided,
short of simply allowing the index to get bloated: I'm simply not
doing things that differently outside of the page split code, and what
I am doing differently is clearly superior. Both in general, and for
the NEW_ORDER transaction in particular.

I'll make that another TODO item -- this regression will be revisited
using BCC instrumentation. I am currently performing a multi-day
benchmark on a very large TPC-C/BenchmarkSQL database, and it will
have to wait for that. (I would like to use the same environment as
before.)

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Andres Freund
Дата:
On 2019-03-12 14:15:06 -0700, Peter Geoghegan wrote:
> On Tue, Mar 12, 2019 at 12:40 PM Andres Freund <andres@anarazel.de> wrote:
> > Have you looked at an offwake or lwlock wait graph (bcc tools) or
> > something in that vein? Would be interesting to see what is waiting for
> > what most often...
> 
> Not recently, though I did use your BCC script for this very purpose
> quite a few months ago. I don't remember it helping that much at the
> time, but then that was with a version of the patch that lacked a
> couple of important optimizations that we have now. We're now very
> careful to not descend to the left with an equal pivot tuple. We
> descend right instead when that's definitely the only place we'll find
> matches (a high key doesn't count as a match in almost all cases!).
> Edge-cases where we unnecessarily move left then right, or
> unnecessarily move right a second time once on the leaf level have
> been fixed. I fixed the regression I was worried about at the time,
> without getting much benefit from the BCC script, and moved on.
> 
> This kind of minutiae is more important than it sounds. I have used
> EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
> understand where every single block access comes from with these
> edge-cases, paying close attention to the structure of the index, and
> how the key space is broken up (the values of pivot tuples in internal
> pages). It is one thing to make the index smaller, and another thing
> to take full advantage of that -- I have both. This is one of the
> reasons why I believe that this minor regression cannot be avoided,
> short of simply allowing the index to get bloated: I'm simply not
> doing things that differently outside of the page split code, and what
> I am doing differently is clearly superior. Both in general, and for
> the NEW_ORDER transaction in particular.
> 
> I'll make that another TODO item -- this regression will be revisited
> using BCC instrumentation. I am currently performing a multi-day
> benchmark on a very large TPC-C/BenchmarkSQL database, and it will
> have to wait for that. (I would like to use the same environment as
> before.)

I'm basically just curious which buffers have most of the additional
contention. Is it the lower number of leaf pages, the inner pages, or
(somewhat unexplicably) the meta page, or ...?  I was thinking that the
callstack that e.g. my lwlock tool gives should be able to explain what
callstack most of the waits are occuring on.

(I should work a bit on that script, I locally had a version that showed
both waiters and the waking up callstack, but I don't find it anymore)

Greetings,

Andres Freund


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 12, 2019 at 2:22 PM Andres Freund <andres@anarazel.de> wrote:
> I'm basically just curious which buffers have most of the additional
> contention. Is it the lower number of leaf pages, the inner pages, or
> (somewhat unexplicably) the meta page, or ...?  I was thinking that the
> callstack that e.g. my lwlock tool gives should be able to explain what
> callstack most of the waits are occuring on.

Right -- that's exactly what I'm interested in, too. If we can
characterize the contention in terms of the types of nbtree blocks
that are involved (their level), that could be really helpful. There
are 200x+ more leaf blocks than internal blocks, so the internal
blocks are a lot hotter. But, there is also a lot fewer splits of
internal pages, because you need hundreds of leaf page splits to get
one internal split.

Is the problem contention caused by internal page splits, or is it
contention in internal pages that can be traced back to leaf splits,
that insert a downlink in to their parent page? It would be really
cool to have some idea of the answers to questions like these.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

> /*
>   * Do the insertion. First move right to find the correct page to
>   * insert to, if necessary. If we're inserting to a non-unique index,
>   * _bt_search() already did this when it checked if a move to the
>   * right was required for leaf page.  Insertion scankey's scantid
>   * would have been filled out at the time. On a unique index, the
>   * current buffer is the first buffer containing duplicates, however,
>   * so we may need to move right to the correct location for this
>   * tuple.
>   */
> if (checkingunique || itup_key->heapkeyspace)
>         _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>
> newitemoff = _bt_binsrch_insert(rel, &insertstate);

> The attached new version simplifies this, IMHO. The bounds and the
> current buffer go together in the same struct, so it's easier to keep
> track whether the bounds are valid or not.

Now that you have a full understanding of how the negative infinity
sentinel values work, and how page deletion's leaf page search and
!heapkeyspace indexes need to be considered, I think that we should
come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
that you now have a full understanding of all the subtleties of the
patch, including those that that affect unique index insertion. That
will make it much easier to talk about these unresolved questions.

My current sense is that it isn't useful to store the current buffer
alongside the binary search bounds/hint. It'll hardly ever need to be
invalidated, because we'll hardly ever have to move right within
_bt_findsplitloc() when doing unique index insertion (as I said
before, the regression tests *never* have to do this according to
gcov). We're talking about a very specific set of conditions here, so
I'd like something that's lightweight and specialized. I agree that
the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
think of anything that's better offhand. Perhaps you can suggest
something that is both lightweight, and an improvement on
savebinsrch/restorebinsrch.

I'm of the opinion that having a separate _bt_binsrch_insert() does
not make anything clearer. Actually, I think that saving the bounds
within the original _bt_binsrch() makes the design of that function
clearer, not less clear. It's all quite confusing at the moment, given
the rightmost/!leaf/page empty special cases. Seeing how the bounds
are reused (or not reused) outside of _bt_binsrch() helps with that.

The first 3 patches seem commitable now, but I think that it's
important to be sure that I've addressed everything you raised
satisfactorily before pushing. Or that everything works in a way that
you can live with, at least.

It would be great if you could take a look at the 'Add high key
"continuescan" optimization' patch, which is the only one you haven't
commented on so far (excluding the amcheck "relocate" patch, which is
less important). I can put that one off for a while after the first 3
go in. I will also put off the "split after new item" commit for at
least a week or two. I'm sure that the idea behind the "continuescan"
patch will now seem pretty obvious to you -- it's just taking
advantage of the high key when an index scan on the leaf level (which
uses a search style scankey, not an insertion style scankey) looks
like it may have to move to the next leaf page, but we'd like to avoid
it where possible. Checking the high key there is now much more likely
to result in the index scan not going to the next page, since we're
more careful when considering a leaf split point these days. The high
key often looks like the items on the page to the right, not the items
on the same page.

Thanks
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization --  a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 13/03/2019 03:28, Peter Geoghegan wrote:
> On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
>> search like _bt_binsrch does, but the bounds caching is only done in
>> _bt_binsrch_insert. Seems more clear to have separate functions for them
>> now, even though there's some duplication.
> 
>> /*
>>    * Do the insertion. First move right to find the correct page to
>>    * insert to, if necessary. If we're inserting to a non-unique index,
>>    * _bt_search() already did this when it checked if a move to the
>>    * right was required for leaf page.  Insertion scankey's scantid
>>    * would have been filled out at the time. On a unique index, the
>>    * current buffer is the first buffer containing duplicates, however,
>>    * so we may need to move right to the correct location for this
>>    * tuple.
>>    */
>> if (checkingunique || itup_key->heapkeyspace)
>>          _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>>
>> newitemoff = _bt_binsrch_insert(rel, &insertstate);
> 
>> The attached new version simplifies this, IMHO. The bounds and the
>> current buffer go together in the same struct, so it's easier to keep
>> track whether the bounds are valid or not.
> 
> Now that you have a full understanding of how the negative infinity
> sentinel values work, and how page deletion's leaf page search and
> !heapkeyspace indexes need to be considered, I think that we should
> come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
> that you now have a full understanding of all the subtleties of the
> patch, including those that that affect unique index insertion. That
> will make it much easier to talk about these unresolved questions.
> 
> My current sense is that it isn't useful to store the current buffer
> alongside the binary search bounds/hint. It'll hardly ever need to be
> invalidated, because we'll hardly ever have to move right within
> _bt_findsplitloc() when doing unique index insertion (as I said
> before, the regression tests *never* have to do this according to
> gcov).

It doesn't matter how often it happens, the code still needs to deal 
with it. So let's try to make it as readable as possible.

> We're talking about a very specific set of conditions here, so
> I'd like something that's lightweight and specialized. I agree that
> the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
> think of anything that's better offhand. Perhaps you can suggest
> something that is both lightweight, and an improvement on
> savebinsrch/restorebinsrch.

Well, IMHO holding the buffer and the bounds in the new struct is more 
clean than the savebinsrc/restorebinsrch flags. That's exactly why I 
suggested it. I don't know what else to suggest. I haven't done any 
benchmarking, but I doubt there's any measurable difference.

> I'm of the opinion that having a separate _bt_binsrch_insert() does
> not make anything clearer. Actually, I think that saving the bounds
> within the original _bt_binsrch() makes the design of that function
> clearer, not less clear. It's all quite confusing at the moment, given
> the rightmost/!leaf/page empty special cases. Seeing how the bounds
> are reused (or not reused) outside of _bt_binsrch() helps with that.

Ok. I think having some code duplication is better than one function 
that tries to do many things, but I'm not wedded to that.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 13/03/2019 03:28, Peter Geoghegan wrote:
> It would be great if you could take a look at the 'Add high key
> "continuescan" optimization' patch, which is the only one you haven't
> commented on so far (excluding the amcheck "relocate" patch, which is
> less important). I can put that one off for a while after the first 3
> go in. I will also put off the "split after new item" commit for at
> least a week or two. I'm sure that the idea behind the "continuescan"
> patch will now seem pretty obvious to you -- it's just taking
> advantage of the high key when an index scan on the leaf level (which
> uses a search style scankey, not an insertion style scankey) looks
> like it may have to move to the next leaf page, but we'd like to avoid
> it where possible. Checking the high key there is now much more likely
> to result in the index scan not going to the next page, since we're
> more careful when considering a leaf split point these days. The high
> key often looks like the items on the page to the right, not the items
> on the same page.

Oh yeah, that makes perfect sense. I wonder why we haven't done it like 
that before? The new page split logic makes it more likely to help, but 
even without that, I don't see any downside.

I find it a bit confusing, that the logic is now split between 
_bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() 
does the high-key check, but the corresponding "first-key" check in a 
backward scan is done in _bt_checkkeys(). I'd suggest moving the logic 
completely to _bt_readpage(), so that it's in one place. With that, 
_bt_checkkeys() can always check the keys as it's told, without looking 
at the LP_DEAD flag. Like the attached.

- Heikki

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Mar 14, 2019 at 4:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Oh yeah, that makes perfect sense. I wonder why we haven't done it like
> that before? The new page split logic makes it more likely to help, but
> even without that, I don't see any downside.

The only downside is that we spend a few extra cycles, and that might
not work out. This optimization would have always worked, though. The
new page split logic clearly makes checking the high key in the
"continuescan" path an easy win.

> I find it a bit confusing, that the logic is now split between
> _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage()
> does the high-key check, but the corresponding "first-key" check in a
> backward scan is done in _bt_checkkeys(). I'd suggest moving the logic
> completely to _bt_readpage(), so that it's in one place. With that,
> _bt_checkkeys() can always check the keys as it's told, without looking
> at the LP_DEAD flag. Like the attached.

I'm convinced. I'd like to go a bit further, and also pass tupnatts to
_bt_checkkeys().  That makes it closer to the similar
_bt_check_rowcompare() function that _bt_checkkeys() must sometimes
call. It also allows us to only call BTreeTupleGetNAtts() for the high
key, while passes down a generic, loop-invariant
IndexRelationGetNumberOfAttributes() value for non-pivot tuples.

I'll do it that way in the next revision.

Thanks
-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It doesn't matter how often it happens, the code still needs to deal
> with it. So let's try to make it as readable as possible.

> Well, IMHO holding the buffer and the bounds in the new struct is more
> clean than the savebinsrc/restorebinsrch flags. That's exactly why I
> suggested it. I don't know what else to suggest. I haven't done any
> benchmarking, but I doubt there's any measurable difference.

Fair enough. Attached is v17, which does it using the approach taken
in your earlier prototype. I even came around to your view on
_bt_binsrch_insert() -- I kept that part, too. Note, however, that I
still pass checkingunique to _bt_findinsertloc(), because that's a
distinct condition to whether or not bounds were cached (they happen
to be the same thing right now, but I don't want to assume that).

This revision also integrates your approach to the "continuescan"
optimization patch, with the small tweak I mentioned yesterday (we
also pass ntupatts). I also prefer this approach.

I plan on committing the first few patches early next week, barring
any objections, or any performance problems noticed during an
additional, final round of performance validation. I won't expect
feedback from you until Monday at the earliest. It would be nice if
you could take a look at the amcheck "relocate" patch. My intention is
to push patches up to and including the amcheck "relocate" patch on
the same day (I'll leave a few hours between the first two patches, to
confirm that the first patch doesn't break the buildfarm).

BTW, my multi-day, large BenchmarkSQL benchmark continues, with some
interesting results. The first round of 12 hour long runs showed the
patch nearly 6% ahead in terms of transaction throughput, with a
database that's almost 1 terabyte. The second round, which completed
yesterday and reuses the database initialized for the first round
showed that the patch had 10.7% higher throughput. That's a new record
for the patch. I'm going to leave this benchmark running for a few
more days, at least until it stops being interesting. I wonder how
long it will be before the master branch throughput stops declining
relative to throughput with the patched version. I expect that the
master branch will reach "index bloat saturation point" sooner or
later. Indexes in the patch's data directory continue to get larger,
as expected, but the amount of bloat accumulated over time is barely
noticeable (i.e. the pages are packed tight with tuples, which barely
declines over time).

This version of the patch series has attributions/credits at the end
of the commit messages. I have listed you as a secondary author on a
couple of the patches, where code was lifted from your feedback
patches. Let me know if you think that I have it right.

Thanks
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 16/03/2019 06:16, Peter Geoghegan wrote:
> On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> It doesn't matter how often it happens, the code still needs to deal
>> with it. So let's try to make it as readable as possible.
> 
>> Well, IMHO holding the buffer and the bounds in the new struct is more
>> clean than the savebinsrc/restorebinsrch flags. That's exactly why I
>> suggested it. I don't know what else to suggest. I haven't done any
>> benchmarking, but I doubt there's any measurable difference.
> 
> Fair enough. Attached is v17, which does it using the approach taken
> in your earlier prototype. I even came around to your view on
> _bt_binsrch_insert() -- I kept that part, too. Note, however, that I
> still pass checkingunique to _bt_findinsertloc(), because that's a
> distinct condition to whether or not bounds were cached (they happen
> to be the same thing right now, but I don't want to assume that).
> 
> This revision also integrates your approach to the "continuescan"
> optimization patch, with the small tweak I mentioned yesterday (we
> also pass ntupatts). I also prefer this approach.

Great, thank you!

> It would be nice if you could take a look at the amcheck "relocate"
> patch
When I started looking at this, I thought that "relocate" means "move". 
So I thought that the new mode would actually move tuples, i.e. that it 
would modify the index. That sounded horrible. Of course, it doesn't 
actually do that. It just checks that each tuple can be re-found, or 
"relocated", by descending the tree from the root. I'd suggest changing 
the language to avoid that confusion.

It seems like a nice way to catch all kinds of index corruption issues. 
Although, we already check that the tuples are in order within the page. 
Is it really necessary to traverse the tree for every tuple, as well? 
Maybe do it just for the first and last item?

> + * This routine can detect very subtle transitive consistency issues across
> + * more than one level of the tree.  Leaf pages all have a high key (even the
> + * rightmost page has a conceptual positive infinity high key), but not a low
> + * key.  Their downlink in parent is a lower bound, which along with the high
> + * key is almost enough to detect every possible inconsistency.  A downlink
> + * separator key value won't always be available from parent, though, because
> + * the first items of internal pages are negative infinity items, truncated
> + * down to zero attributes during internal page splits.  While it's true that
> + * bt_downlink_check() and the high key check can detect most imaginable key
> + * space problems, there are remaining problems it won't detect with non-pivot
> + * tuples in cousin leaf pages.  Starting a search from the root for every
> + * existing leaf tuple detects small inconsistencies in upper levels of the
> + * tree that cannot be detected any other way.  (Besides all this, it's
> + * probably a useful testing strategy to exhaustively verify that all
> + * non-pivot tuples can be relocated in the index using the same code paths as
> + * those used by index scans.)

I don't understand this. Can you give an example of this kind of 
inconsistency?

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > It would be nice if you could take a look at the amcheck "relocate"
> > patch
> When I started looking at this, I thought that "relocate" means "move".
> So I thought that the new mode would actually move tuples, i.e. that it
> would modify the index. That sounded horrible. Of course, it doesn't
> actually do that. It just checks that each tuple can be re-found, or
> "relocated", by descending the tree from the root. I'd suggest changing
> the language to avoid that confusion.

Okay. What do you suggest? :-)

> It seems like a nice way to catch all kinds of index corruption issues.
> Although, we already check that the tuples are in order within the page.
> Is it really necessary to traverse the tree for every tuple, as well?
> Maybe do it just for the first and last item?

It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.

> I don't understand this. Can you give an example of this kind of
> inconsistency?

The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 16/03/2019 10:51, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> It would be nice if you could take a look at the amcheck "relocate"
>>> patch
>> When I started looking at this, I thought that "relocate" means "move".
>> So I thought that the new mode would actually move tuples, i.e. that it
>> would modify the index. That sounded horrible. Of course, it doesn't
>> actually do that. It just checks that each tuple can be re-found, or
>> "relocated", by descending the tree from the root. I'd suggest changing
>> the language to avoid that confusion.
> 
> Okay. What do you suggest? :-)

Hmm. "re-find", maybe? We use that term in a few error messages already, 
to mean something similar.

>> It seems like a nice way to catch all kinds of index corruption issues.
>> Although, we already check that the tuples are in order within the page.
>> Is it really necessary to traverse the tree for every tuple, as well?
>> Maybe do it just for the first and last item?
> 
> It's mainly intended as a developer option. I want it to be possible
> to detect any form of corruption, however unlikely. It's an
> adversarial mindset that will at least make me less nervous about the
> patch.

Fair enough.

At first, I thought this would be horrendously expensive, but thinking 
about it a bit more, nearby tuples will always follow the same path 
through the upper nodes, so it'll all be cached. So maybe it's not quite 
so bad.

>> I don't understand this. Can you give an example of this kind of
>> inconsistency?
> 
> The commit message gives an example, but I suggest trying it out for
> yourself. Corrupt the least significant key byte of a root page of a
> B-Tree using pg_hexedit. Say it's an index on a text column, then
> you'd corrupt the last byte to be something slightly wrong. Then, the
> only way to catch it is with "relocate" verification. You'll only miss
> a few tuples on a cousin page at the leaf level (those on either side
> of the high key that the corrupted separator key in the root was
> originally copied from).
>
> The regular checks won't catch this, because the keys are similar
> enough one level down. The "minus infinity" item is a kind of a blind
> spot, because we cannot do a parent check of its children, because we
> don't have the key (it's truncated when the item becomes a right page
> minus infinity item, during an internal page split).

Hmm. So, the initial situation would be something like this:

                  +-----------+
                  | 1: root   |
                  |           |
                  | -inf -> 2 |
                  | 20   -> 3 |
                  |           |
                  +-----------+

         +-------------+ +-------------+
         | 2: internal | | 3: internal |
         |             | |             |
         | -inf -> 4   | | -inf -> 6   |
         | 10   -> 5   | | 30   -> 7   |
         |             | |             |
         | hi: 20      | |             |
         +-------------+ +-------------+

+---------+ +---------+ +---------+ +---------+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
|         | |         | |         | |         |
| 1       | | 11      | | 21      | | 31      |
| 5       | | 15      | | 25      | | 35      |
| 9       | | 19      | | 29      | | 39      |
|         | |         | |         | |         |
| hi: 10  | | hi: 20  | | hi: 30  | |         |
+---------+ +---------+ +---------+ +---------+

Then, a cosmic ray changes the 20 on the root page to 18. That causes 
the the leaf tuple 19 to become non-re-findable; if you descend the for 
19, you'll incorrectly land on page 6. But it also causes the high key 
on page 2 to be different from the downlink on the root page. Wouldn't 
the existing checks catch this?

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm. "re-find", maybe? We use that term in a few error messages already,
> to mean something similar.

WFM.

> At first, I thought this would be horrendously expensive, but thinking
> about it a bit more, nearby tuples will always follow the same path
> through the upper nodes, so it'll all be cached. So maybe it's not quite
> so bad.

That's deliberate, though you could call bt_relocate_from_root() from
anywhere else if you wanted to. It's a bit like a big nested loop
join, where the inner side has locality.

> Then, a cosmic ray changes the 20 on the root page to 18. That causes
> the the leaf tuple 19 to become non-re-findable; if you descend the for
> 19, you'll incorrectly land on page 6. But it also causes the high key
> on page 2 to be different from the downlink on the root page. Wouldn't
> the existing checks catch this?

No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".

Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Hmm. "re-find", maybe? We use that term in a few error messages already,
> > to mean something similar.
>
> WFM.

Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 16/03/2019 19:32, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote:
>> On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Hmm. "re-find", maybe? We use that term in a few error messages already,
>>> to mean something similar.
>>
>> WFM.
> 
> Actually, how about "rootsearch", or "rootdescend"? You're supposed to
> hyphenate "re-find", and so it doesn't really work as a function
> argument name.

Works for me.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 16/03/2019 18:55, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Then, a cosmic ray changes the 20 on the root page to 18. That causes
>> the the leaf tuple 19 to become non-re-findable; if you descend the for
>> 19, you'll incorrectly land on page 6. But it also causes the high key
>> on page 2 to be different from the downlink on the root page. Wouldn't
>> the existing checks catch this?
> 
> No, the existing checks will not check that. I suppose something
> closer to the existing approach *could* detect this issue, by making
> sure that the "seam of identical high keys" from the root to the leaf
> are a match, but we don't use the high key outside of its own page.
> Besides, there is something useful about having the code actually rely
> on _bt_search().
> 
> There are other similar cases, where it's not obvious how you can do
> verification without either 1) crossing multiple levels, or 2)
> retaining a "low key" as well as a high key (this is what Goetz Graefe
> calls "retaining fence keys to solve the cousin verification problem"
> in Modern B-Tree Techniques). What if the corruption was in the leaf
> page 6 from your example, which had a 20 at the start? We wouldn't
> have compared the downlink from the parent to the child, because leaf
> page 6 is the leftmost child, and so we only have "-inf". The lower
> bound actually comes from the root page, because we truncate "-inf"
> attributes during page splits, even though we don't have to. Most of
> the time they're not "absolute minus infinity" -- they're "minus
> infinity in this subtree".

AFAICS, there is a copy of every page's high key in its immediate 
parent. Either in the downlink of the right sibling, or as the high key 
of the parent page itself. Cross-checking those would catch any 
corruption in high keys.

Note that this would potentially catch some corruption that the 
descend-from-root would not. If you have a mismatch between the high key 
of a leaf page and its parent or grandparent, all the current tuples 
might be pass the rootdescend check. But a tuple might get inserted to 
wrong location later.

> Maybe you could actually do something with the high key from leaf page
> 5 to detect the stray value "20" in leaf page 6, but again, we don't
> do anything like that right now.

Hmm, yeah, to check for stray values, you could follow the left-link, 
get the high key of the left sibling, and compare against that.

- Heikki


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Mar 16, 2019 at 1:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> AFAICS, there is a copy of every page's high key in its immediate
> parent. Either in the downlink of the right sibling, or as the high key
> of the parent page itself. Cross-checking those would catch any
> corruption in high keys.

I agree that it's always true that the high key is also in the parent,
and we could cross-check that within the child. Actually, I should
probably figure out a way of arranging for the Bloom filter used
within bt_relocate_from_root() (which has been around since PG v11) to
include the key itself when fingerprinting. That would probably
necessitate that we don't truncate "negative infinity" items (it was
actually that way about 18 years ago), just for the benefit of
verification. This is almost the same thing as what Graefe argues for
(don't think that you need a low key on the leaf level, since you can
cross a single level there). I wonder if truncating the negative
infinity item in internal pages to zero attributes is actually worth
it, since a low key might be useful for a number of reasons.

> Note that this would potentially catch some corruption that the
> descend-from-root would not. If you have a mismatch between the high key
> of a leaf page and its parent or grandparent, all the current tuples
> might be pass the rootdescend check. But a tuple might get inserted to
> wrong location later.

I also agree with this. However, the rootdescend check will always
work better than this in some cases that you can at least imagine, for
as long as there are negative infinity items to worry about. (And,
even if we decided not to truncate to support easy verification, there
is still a good argument to be made for involving the authoritative
_bt_search() code at some point).

> > Maybe you could actually do something with the high key from leaf page
> > 5 to detect the stray value "20" in leaf page 6, but again, we don't
> > do anything like that right now.
>
> Hmm, yeah, to check for stray values, you could follow the left-link,
> get the high key of the left sibling, and compare against that.

Graefe argues for retaining a low key, even in leaf pages (the left
page's old high key becomes the left page's low key during a split,
and the left page's new high key becomes the new right pages low key
at the same time). It's useful for what he calls "write-optimized
B-Trees", and maybe even for optional compression. As I said earlier,
I guess you can just go left on the leaf level if you need to, and you
have all you need. But I'd need to think about it some more.

Point taken; rootdescend isn't enough to make verification exactly
perfect. But it makes verification approach being perfect, because
you're going to get right answers to queries as long as it passes (I
think). There could be a future problem for a future insertion that
you could also detect, but can't. But you'd have to be extraordinarily
unlucky to have that happen for any amount of time. Unlucky even by my
own extremely paranoid standard.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I agree that it's always true that the high key is also in the parent,
> and we could cross-check that within the child. Actually, I should
> probably figure out a way of arranging for the Bloom filter used
> within bt_relocate_from_root() (which has been around since PG v11) to
> include the key itself when fingerprinting. That would probably
> necessitate that we don't truncate "negative infinity" items (it was
> actually that way about 18 years ago), just for the benefit of
> verification.

Clarification: You'd fingerprint an entire pivot tuple -- key, block
number, everything. Then, you'd probe the Bloom filter for the high
key one level down, with the downlink block in the high key set to
point to the current sibling on the same level (the child level). As I
said, I think that the only reason that that cannot be done at present
is because of the micro-optimization of truncating the first item on
the right page to zero attributes during an internal page split. (We
can retain the key without getting rid of the hard-coded logic for
negative infinity within _bt_compare()).

bt_relocate_from_root() already has smarts around interrupted page
splits (with the incomplete split bit set).

Finally, you'd also make bt_downlink_check follow every downlink, not
all-but-one downlink (no more excuse for leaving out the first one if
we don't truncate during internal page splits).

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Mar 16, 2019 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I agree that it's always true that the high key is also in the parent,
> > and we could cross-check that within the child. Actually, I should
> > probably figure out a way of arranging for the Bloom filter used
> > within bt_relocate_from_root() (which has been around since PG v11) to
> > include the key itself when fingerprinting. That would probably
> > necessitate that we don't truncate "negative infinity" items (it was
> > actually that way about 18 years ago), just for the benefit of
> > verification.
>
> Clarification: You'd fingerprint an entire pivot tuple -- key, block
> number, everything. Then, you'd probe the Bloom filter for the high
> key one level down, with the downlink block in the high key set to
> point to the current sibling on the same level (the child level). As I
> said, I think that the only reason that that cannot be done at present
> is because of the micro-optimization of truncating the first item on
> the right page to zero attributes during an internal page split. (We
> can retain the key without getting rid of the hard-coded logic for
> negative infinity within _bt_compare()).
>
> bt_relocate_from_root() already has smarts around interrupted page
> splits (with the incomplete split bit set).

Clarification to my clarification: I meant
bt_downlink_missing_check(), not bt_relocate_from_root(). The former
really has been around since v11, unlike the latter, which is part of
this new amcheck patch we're discussing.


-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Actually, how about "rootsearch", or "rootdescend"? You're supposed to
> > hyphenate "re-find", and so it doesn't really work as a function
> > argument name.
>
> Works for me.

Attached is v18 of patch series, which calls the new verification
option "rootdescend" verification.

As previously stated, I intend to commit the first 4 patches (up to
and including this amcheck "rootdescend" patch) during the workday
tomorrow, Pacific time.

Other changes:

* Further consolidation of the nbtree.h comments from second patch, so
that the on-disk representation overview that you requested a while
back has all the details. A couple of these were moved from macro
comments also in nbtree.h, and were missed earlier.

* Tweaks to comments on _bt_binsrch_insert() and its callers.
Streamlined to reflect the fact that it doesn't need to talk so much
about cases that only apply to internal pages. Explicitly stated
requirements for caller.

* Made _bt_binsrch_insert() set InvalidOffsetNumber for bounds in case
were valid bounds cannot be established initially. This seemed like a
good idea.

* A few more defensive assertion were added to nbtinsert.c (also
related to _bt_binsrch_insert()).

Thanks
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Heikki Linnakangas
Дата:
On 18/03/2019 02:59, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Actually, how about "rootsearch", or "rootdescend"? You're supposed to
>>> hyphenate "re-find", and so it doesn't really work as a function
>>> argument name.
>>
>> Works for me.
> 
> Attached is v18 of patch series, which calls the new verification
> option "rootdescend" verification.

Thanks!

I'm getting a regression failure from the 'create_table' test with this:

> --- /home/heikki/git-sandbox/postgresql/src/test/regress/expected/create_table.out      2019-03-11 14:41:41.382759197
+0200
> +++ /home/heikki/git-sandbox/postgresql/src/test/regress/results/create_table.out       2019-03-18 13:49:49.480249055
+0200
> @@ -413,18 +413,17 @@
>         c text,
>         d text
>  ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d collate "C");
> +ERROR:  function plusone(integer) does not exist
> +HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Are you seeing that?

Looking at the patches 1 and 2 again:

I'm still not totally happy with the program flow and all the conditions 
in _bt_findsplitloc(). I have a hard time following which codepaths are 
taken when. I refactored that, so that there is a separate copy of the 
loop for V3 and V4 indexes. So, when the code used to be something like 
this:

_bt_findsplitloc(...)
{
     ...

     /* move right, if needed */
     for(;;)
     {
         /*
          * various conditions for when to stop. Different conditions
          * apply depending on whether it's a V3 or V4 index.
          */
     }

     ...
}

it is now:

_bt_findsplitloc(...)
{
     ...

     if (heapkeyspace)
     {
         /*
          * If 'checkingunique', move right to the correct page.
          */
         for (;;)
         {
             ...
         }
     }
     else
     {
         /*
          * Move right, until we find a page with enough space or "get
          * tired"
          */
         for (;;)
         {
             ...
         }
     }

     ...
}

I think this makes the logic easier to understand. Although there is 
some commonality, the conditions for when to move right on a V3 vs V4 
index are quite different, so it seems better to handle them separately. 
There is some code duplication, but it's not too bad. I moved the common 
code to step to the next page to a new helper function, _bt_stepright(), 
which actually seems like a good idea in any case.

See attached patches with those changes, plus some minor comment 
kibitzing. It's still failing the 'create_table' regression test, though.

- Heikki

PS. The commit message of the first patch needs updating, now that 
BTInsertState is different from BTScanInsert.

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Mar 18, 2019 at 4:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm getting a regression failure from the 'create_table' test with this:

> Are you seeing that?

Yes -- though the bug is in your revised v18, not the original v18,
which passed CFTester. Your revision fails on Travis/Linux, which is
pretty close to what I see locally, and much less subtle than the test
failures you mentioned:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507816665

"make check" did pass locally for me with your patch, but "make
check-world" (parallel recipe) did not.

The original v18 passed both CFTester tests about 15 hour ago:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507643402

I see the bug. You're not supposed to test this way with a heapkeyspace index:

> +               if (P_RIGHTMOST(lpageop) ||
> +                   _bt_compare(rel, itup_key, page, P_HIKEY) != 0)
> +                   break;

This is because the presence of scantid makes it almost certain that
you'll break when you shouldn't. You're doing it the old way, which is
inappropriate for a heapkeyspace index. Note that it would probably
take much longer to notice this bug if the "consider secondary
factors" patch was also applied, because then you would rarely have
cause to step right here (duplicates would never occupy more than a
single page in the regression tests). The test failures are probably
also timing sensitive, though they happen very reliably on my machine.

> Looking at the patches 1 and 2 again:
>
> I'm still not totally happy with the program flow and all the conditions
> in _bt_findsplitloc(). I have a hard time following which codepaths are
> taken when. I refactored that, so that there is a separate copy of the
> loop for V3 and V4 indexes.

The big difference is that you make the possible call to
_bt_stepright() conditional on this being a checkingunique index --
the duplicate code is indented in that branch of _bt_findsplitloc().
Whereas I break early in the loop when "checkingunique &&
heapkeyspace".

The flow of the original loop not only had less code. It also
contrasted the important differences between heapkeyspace and
!heapkeyspace cases:

        /* If this is the page that the tuple must go on, stop */
        if (P_RIGHTMOST(lpageop))
            break;
        cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
        if (itup_key->heapkeyspace)
        {
            if (cmpval <= 0)
                break;
        }
        else
        {
            /*
             * pg_upgrade'd !heapkeyspace index.
             *
             * May have to handle legacy case where there is a choice of which
             * page to place new tuple on, and we must balance space
             * utilization as best we can.  Note that this may invalidate
             * cached bounds for us.
             */
            if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
                break;
        }

I thought it was obvious that the "cmpval <= 0" code was different for
a reason. It now seems that this at least needs a comment.

I still believe that the best way to handle the !heapkeyspace case is
to make it similar to the heapkeyspace checkingunique case, regardless
of whether or not we're checkingunique. The fact that this bug slipped
in supports that view. Besides, the alternative that you suggest
treats !heapkeyspace indexes as if they were just as important to the
reader, which seems inappropriate (better to make the legacy case
follow the new case, not the other way around). I'm fine with the
comment tweaks that you made that are not related to
_bt_findsplitloc(), though.

I won't push the patches today, to give you the opportunity to
respond. I am not at all convinced right now, though.

--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use.  We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I came back to the question of contention recently. I don't think it's
okay to make contention worse in workloads where indexes are mostly
the same size as before, such as almost any workload that pgbench can
simulate. I have made a lot of the fact that the TPC-C indexes are
~40% smaller, in part because lots of people outside the community
find TPC-C interesting, and in part because this patch series is
focused on cases where we currently do unusually badly (cases where
good intuitions about how B-Trees are supposed to perform break down
[1]). These pinpointable problems must affect a lot of users some of
the time, but certainly not all users all of the time.

The patch series is actually supposed to *improve* the situation with
index buffer lock contention in general, and it looks like it manages
to do that with pgbench, which doesn't do inserts into indexes, except
for those required for non-HOT updates. pgbench requires relatively
few page splits, but is in every other sense a high contention
workload.

With pgbench scale factor 20, here are results for patch and master
with a Gaussian distribution on my 8 thread/4 core home server, with
each run reported lasting 10 minutes, repeating twice for client
counts 1, 2, 8, 16, and 64, patch and master branch:

\set aid random_gaussian(1, 100000 * :scale, 20)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

1st pass
========

(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7203.983289 (including connections establishing)
tps = 7204.020457 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.026 ms
1 client patch:
tps = 7012.575167 (including connections establishing)
tps = 7012.590007 (excluding connections establishing)
latency average = 0.143 ms
latency stddev = 0.020 ms

2 clients master:
tps = 13434.043832 (including connections establishing)
tps = 13434.076194 (excluding connections establishing)
latency average = 0.149 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13105.620223 (including connections establishing)
tps = 13105.654109 (excluding connections establishing)
latency average = 0.153 ms
latency stddev = 0.033 ms

8 clients master:
tps = 27126.852038 (including connections establishing)
tps = 27126.986978 (excluding connections establishing)
latency average = 0.295 ms
latency stddev = 0.095 ms
8 clients patch:
tps = 27945.457965 (including connections establishing)
tps = 27945.565242 (excluding connections establishing)
latency average = 0.286 ms
latency stddev = 0.089 ms

16 clients master:
tps = 32297.612323 (including connections establishing)
tps = 32297.743929 (excluding connections establishing)
latency average = 0.495 ms
latency stddev = 0.185 ms
16 clients patch:
tps = 33434.889405 (including connections establishing)
tps = 33435.021738 (excluding connections establishing)
latency average = 0.478 ms
latency stddev = 0.167 ms

64 clients master:
tps = 25699.029787 (including connections establishing)
tps = 25699.217022 (excluding connections establishing)
latency average = 2.482 ms
latency stddev = 1.715 ms
64 clients patch:
tps = 26513.816673 (including connections establishing)
tps = 26514.013638 (excluding connections establishing)
latency average = 2.405 ms
latency stddev = 1.690 ms

2nd pass
========

(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7172.995796 (including connections establishing)
tps = 7173.013472 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.022 ms
1 client patch:
tps = 7024.724365 (including connections establishing)
tps = 7024.739237 (excluding connections establishing)
latency average = 0.142 ms
latency stddev = 0.021 ms

2 clients master:
tps = 13489.016303 (including connections establishing)
tps = 13489.047968 (excluding connections establishing)
latency average = 0.148 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13210.292833 (including connections establishing)
tps = 13210.321528 (excluding connections establishing)
latency average = 0.151 ms
latency stddev = 0.029 ms

8 clients master:
tps = 27470.112858 (including connections establishing)
tps = 27470.229891 (excluding connections establishing)
latency average = 0.291 ms
latency stddev = 0.093 ms
8 clients patch:
tps = 28132.981815 (including connections establishing)
tps = 28133.096414 (excluding connections establishing)
latency average = 0.284 ms
latency stddev = 0.081 ms

16 clients master:
tps = 32409.399669 (including connections establishing)
tps = 32409.533400 (excluding connections establishing)
latency average = 0.493 ms
latency stddev = 0.182 ms
16 clients patch:
tps = 33678.304986 (including connections establishing)
tps = 33678.427420 (excluding connections establishing)
latency average = 0.475 ms
latency stddev = 0.168 ms

64 clients master:
tps = 25864.453485 (including connections establishing)
tps = 25864.639098 (excluding connections establishing)
latency average = 2.466 ms
latency stddev = 1.698 ms
64 clients patch:
tps = 26382.926218 (including connections establishing)
tps = 26383.166692 (excluding connections establishing)
latency average = 2.417 ms
latency stddev = 1.678 ms

There was a third run which has been omitted, because it's practically
the same as the first two. The order that results appear in is the
order things actually ran in (I like to interlace master and patch
runs closely).

Analysis
========

There seems to be a ~2% regression with one or two clients, but we
more than make up for that as the client count goes up -- the 8 and 64
client cases improve throughput by ~2.5%, and the 16 client case
improves throughput by ~4%. This seems like a totally reasonable
trade-off to me. As I said already, the patch isn't really about
workloads that we already do acceptably well on, such as this one, so
you're not expected to be impressed with these numbers. My goal is to
show that boring workloads that fit everything in shared_buffers
appear to be fine. I think that that's a reasonable conclusion, based
on these numbers. Lower client count cases are generally considered
less interesting, and also lose less in throughput than we go on to
gain later. as more clients are added. I'd be surprised if anybody
complained.

I think that the explanation for the regression with one or two
clients boils down to this: We're making better decisions about where
to split pages, and even about how pages are accessed by index scans
(more on that in the next paragraph). However, this isn't completely
free (particularly the page split stuff), and it doesn't pay for
itself until the number of clients ramps up. However, not being more
careful about that stuff is penny wise, pound foolish. I even suspect
that there are priority inversion issues when there is high contention
during unique index enforcement, which might be a big problem on
multi-socket machines with hundreds of clients. I am not in a position
to confirm that right now, but we have heard reports that are
consistent with this explanation at least once before now [2]. Zipfian
was also somewhat better when I last measured it, using the same
fairly modest machine -- I didn't repeat that here because I wanted
something simple and widely studied.

The patch establishes the principle that there is only one good reason
to visit more than one leaf page within index scans like those used by
pgbench: a concurrent page split, where the scan simply must go right
to find matches that were just missed in the first leaf page. That
should be very rare. We should never visit two leaf pages because
we're confused about where there might be matches. There is simply no
good reason for there to be any ambiguity or confusion.

The patch could still make index scans like these visit more than a
single leaf page for a bad reason, at least in theory: when there is
at least ~400 duplicates in a unique index, and we therefore can't
possibly store them all on one leaf page, index scans will of course
have to visit more than one leaf page. Again, that should be very
rare. All index scans can now check the high key on the leaf level,
and avoid going right when they happen to be very close to the right
edge of the leaf page's key space. And, we never have to take the
scenic route when descending the tree on an equal internal page key,
since that condition has practically been eliminated by suffix
truncation. No new tuple can be equal to negative infinity, and
negative infinity appears in every pivot tuple. There is a place for
everything, and everything is in its place.

[1] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
[2] https://postgr.es/m/BF3B6F54-68C3-417A-BFAB-FB4D66F2B410@postgrespro.ru
--
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Robert Haas
Дата:
On Mon, Mar 18, 2019 at 7:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> With pgbench scale factor 20, here are results for patch and master
> with a Gaussian distribution on my 8 thread/4 core home server, with
> each run reported lasting 10 minutes, repeating twice for client
> counts 1, 2, 8, 16, and 64, patch and master branch:
>
> 1 client master:
> tps = 7203.983289 (including connections establishing)
> 1 client patch:
> tps = 7012.575167 (including connections establishing)
>
> 2 clients master:
> tps = 13434.043832 (including connections establishing)
> 2 clients patch:
> tps = 13105.620223 (including connections establishing)

Blech.  I think the patch has enough other advantages that it's worth
accepting that, but it's not great.  We seem to keep finding reasons
to reduce single client performance in the name of scalability, which
is often reasonable not but wonderful.

> However, this isn't completely
> free (particularly the page split stuff), and it doesn't pay for
> itself until the number of clients ramps up.

I don't really understand that explanation.  It makes sense that more
intelligent page split decisions could require more CPU cycles, but it
is not evident to me why more clients would help better page split
decisions pay off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Mar 18, 2019 at 5:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Blech.  I think the patch has enough other advantages that it's worth
> accepting that, but it's not great.  We seem to keep finding reasons
> to reduce single client performance in the name of scalability, which
> is often reasonable not but wonderful.

The good news is that the quicksort that we now perform in
nbtsplitloc.c is not optimized at all. Heikki thought it premature to
optimize that, for example by inlining/specializing the quicksort. I
can make that 3x faster fairly easily, which could easily change the
picture here. The code will be uglier that way, but not much more
complicated. I even prototyped this, and managed to make serial
microbenchmarks I've used noticeably faster. This could very well fix
the problem here. It clearly showed up in perf profiles with serial
bulks loads.

> > However, this isn't completely
> > free (particularly the page split stuff), and it doesn't pay for
> > itself until the number of clients ramps up.
>
> I don't really understand that explanation.  It makes sense that more
> intelligent page split decisions could require more CPU cycles, but it
> is not evident to me why more clients would help better page split
> decisions pay off.

Smarter choices on page splits pay off with higher client counts
because they reduce contention at likely hot points. It's kind of
crazy that the code in _bt_check_unique() sometimes has to move right,
while holding an exclusive buffer lock on the original page and a
shared buffer lock on its sibling page at the same time. It then has
to hold a third buffer lock concurrently, this time on any heap pages
it is interested in. Each in turn, to check if they're possibly
conflicting. gcov shows that that never happens with the regression
tests once the patch is applied (you can at least get away with only
having one buffer lock on a leaf page at all times in practically all
cases).

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Mar 18, 2019 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Smarter choices on page splits pay off with higher client counts
> because they reduce contention at likely hot points. It's kind of
> crazy that the code in _bt_check_unique() sometimes has to move right,
> while holding an exclusive buffer lock on the original page and a
> shared buffer lock on its sibling page at the same time. It then has
> to hold a third buffer lock concurrently, this time on any heap pages
> it is interested in.

Actually, by the time we get to 16 clients, this workload does make
the indexes and tables smaller. Here is pg_buffercache output
collected after the first 16 client case:

Master
======

                 relname                 │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │     avg_buffer_usg

─────────────────────────────────────────┼───────────────┼───────────────────────────┼──────────────┼────────────────────────
 pgbench_history                         │             0 │
      123,484 │      123,484 │     4.9989715266755207
 pgbench_accounts                        │             0 │
       34,665 │       10,682 │     4.4948511514697622
 pgbench_accounts_pkey                   │             0 │
        5,708 │        1,561 │     4.8731582319026265
 pgbench_tellers                         │             0 │
          489 │          489 │     5.0000000000000000
 pgbench_branches                        │             0 │
          284 │          284 │     5.0000000000000000
 pgbench_tellers_pkey                    │             0 │
           56 │           56 │     5.0000000000000000
....

Patch
=====

                 relname                 │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │     avg_buffer_usg

─────────────────────────────────────────┼───────────────┼───────────────────────────┼──────────────┼────────────────────────
 pgbench_history                         │             0 │
      127,864 │      127,864 │     4.9980447975974473
 pgbench_accounts                        │             0 │
       33,933 │        9,614 │     4.3517786561264822
 pgbench_accounts_pkey                   │             0 │
        5,487 │        1,322 │     4.8857791225416036
 pgbench_tellers                         │             0 │
          204 │          204 │     4.9803921568627451
 pgbench_branches                        │             0 │
          198 │          198 │     4.3535353535353535
 pgbench_tellers_pkey                    │             0 │
           14 │           14 │     5.0000000000000000
....

The main fork for pgbench_history is larger with the patch, obviously,
but that's good. pgbench_accounts_pkey is about 4% smaller, which is
probably the most interesting observation that can be made here, but
the tables are also smaller. pgbench_accounts itself is ~2% smaller.
pgbench_branches is ~30% smaller, and pgbench_tellers is 60% smaller.
Of course, the smaller tables were already very small, so maybe that
isn't important. I think that this is due to more effective pruning,
possibly because we get better lock arbitration as a consequence of
better splits, and also because duplicates are in heap TID order. I
haven't observed this effect with larger databases, which have been my
focus.

It isn't weird that shared_buffers doesn't have all the
pgbench_accounts blocks, since, of course, this is highly skewed by
design -- most blocks were never accessed from the table.

This effect seems to be robust, at least with this workload. The
second round of benchmarks (which have their own pgbench -i
initialization) show very similar amounts of bloat at the same point.
It may not be that significant, but it's also not a fluke.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Mon, Mar 18, 2019 at 10:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> The big difference is that you make the possible call to
> _bt_stepright() conditional on this being a checkingunique index --
> the duplicate code is indented in that branch of _bt_findsplitloc().
> Whereas I break early in the loop when "checkingunique &&
> heapkeyspace".

Heikki and I discussed this issue privately, over IM, and reached
final agreement on remaining loose ends. I'm going to use his code for
_bt_findsplitloc(). Plan to push a final version of the first four
patches tomorrow morning PST.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Tue, Mar 19, 2019 at 4:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Heikki and I discussed this issue privately, over IM, and reached
> final agreement on remaining loose ends. I'm going to use his code for
> _bt_findsplitloc(). Plan to push a final version of the first four
> patches tomorrow morning PST.

I've committed the first 4 patches. Many thanks to Heikki for his very
valuable help! Thanks also to the other reviewers.

I'll likely push the remaining two patches on Sunday or Monday.

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I've committed the first 4 patches. Many thanks to Heikki for his very
> valuable help! Thanks also to the other reviewers.
>
> I'll likely push the remaining two patches on Sunday or Monday.

I noticed that if I initidb and run "make installcheck" with and
without the "split after new tuple" optimization patch, the largest
system catalog indexes shrink quite noticeably:

Master
======
pg_depend_depender_index 1456 kB
pg_depend_reference_index 1416 kB
pg_class_tblspc_relfilenode_index 224 kB

Patch
=====
pg_depend_depender_index 1088 kB   -- ~25% smaller
pg_depend_reference_index 1136 kB   -- ~20% smaller
pg_class_tblspc_relfilenode_index 160 kB -- 28% smaller

This is interesting to me because it is further evidence that the
problem that the patch targets is reasonably common. It's also
interesting to me because we benefit despite the fact there are a lot
of duplicates in parts of these indexes; we vary our strategy at
different parts of the key space, which works well. We pack pages
tightly where they're full of duplicates, using the "single value"
strategy that I've already committed, whereas the apply the "split
after new tuple" optimization in parts of the index with localized
monotonically increasing insertions. If there were no duplicates in
the indexes, then they'd be about 40% smaller, which is exactly what
we see with the TPC-C indexes (they're all unique indexes, with very
few physical duplicates). Looks like the duplicates are mostly
bootstrap mode entries. Lots of the pg_depend_depender_index
duplicates look like "(classid, objid, objsubid)=(0, 0, 0)", for
example.

I also noticed one further difference: the pg_shdepend_depender_index
index grew from 40 kB to 48 kB. I guess that might count as a
regression, though I'm not sure that it should. I think that we would
do better if the volume of data in the underlying table was greater.
contrib/pageinspect shows that a small number of the leaf pages in the
improved cases are not very filled at all, which is more than made up
for by the fact that many more pages are packed as if they were
created by a rightmost split (262 items 24 byte tuples is exactly
consistent with that). IOW, I suspect that the extra page in
pg_shdepend_depender_index is due to a "local minimum".

-- 
Peter Geoghegan


Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От
Peter Geoghegan
Дата:
On Fri, Mar 22, 2019 at 2:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I'll likely push the remaining two patches on Sunday or Monday.
>
> I noticed that if I initidb and run "make installcheck" with and
> without the "split after new tuple" optimization patch, the largest
> system catalog indexes shrink quite noticeably:

I pushed this final patch a week ago, as commit f21668f3, concluding
work on integrating the patch series.

I have some closing thoughts that I would like to close out the
project on. I was casually discussing this project over IM with Robert
the other day. I was asked a question I'd often asked myself about the
"split after new item" heuristics: What if you're wrong? What if some
"black swan" type workload fools your heuristics into bloating an
index uncontrollably?

I gave an answer to his question that may have seemed kind of
inscrutable. My intuition about the worst case for the heuristics is
based on its similarity to the worst case for quicksort. Any
real-world instance of quicksort going quadratic is essentially a case
where we *consistently* do the wrong thing when selecting a pivot. A
random pivot selection will still perform reasonably well, because
we'll still choose the median pivot on average. A malicious actor will
always be able to fool any quicksort implementation into going
quadratic [1] in certain circumstances. We're defending against
Murphy, not Machiavelli, though, so that's okay.

I think that I can produce a more tangible argument than this, though.
Attached patch removes every heuristic that limits the application of
the "split after new item" optimization (it doesn't force the
optimization in the case of rightmost splits, or in the case where the
new item happens to be first on the page, since caller isn't prepared
for that). This is an attempt to come up with a wildly exaggerated
worst case. Nevertheless, the consequences are not actually all that
bad. Summary:

* The "UK land registry" test case that I leaned on a lot for the
patch has a final index that's about 1% larger. However, it was about
16% smaller compared to Postgres without the patch, so this is not a
problem.

* Most of the TPC-C indexes are actually slightly smaller, because we
didn't quite go as far as we could have (TPC-C strongly rewards this
optimization). 8 out of the 10 indexes are either smaller or
unchanged. The customer name index is about 28% larger, though. The
oorder table index is also about 28% larger.

* TPC-E never benefits from the "split after new item" optimization,
and yet the picture isn't so bad here either. The holding history PK
is about 40% bigger, which is quite bad, and the biggest regression
overall. However, in other affected cases indexes are about 15%
larger, which is not that bad.

Also attached are the regressions from my test suite in the form of
diff files -- these are the full details of the regression, just in
case that's interesting to somebody.

This isn't the final word. I'm not asking anybody to accept with total
certainty that there can never be a "black swan" workload that the
heuristics consistently mishandle, leading to pathological
performance. However, I think it's fair to say that the risk of that
happening has been managed well. The attached test patch literally
removes any restraint on applying the optimization, and yet we
arguably do no worse than Postgres 11 would overall.

Once again, I would like to thank my collaborators for all their help,
especially Heikki.

[1] https://www.cs.dartmouth.edu/~doug/mdmspe.pdf
-- 
Peter Geoghegan

Вложения