Обсуждение: Making all nbtree entries unique by having heap TIDs participate in comparisons

Поиск

Список

Период

Сортировка

Making all nbtree entries unique by having heap TIDs participate in comparisons

От

Peter Geoghegan

Дата:

15 июня 2018 г., 00:44:46

I've been thinking about using heap TID as a tie-breaker when
comparing B-Tree index tuples for a while now [1]. I'd like to make
all tuples at the leaf level unique, as assumed by L&Y. This can
enable "retail index tuple deletion", which I think we'll probably end
up implementing in some form or another, possibly as part of the zheap
project. It's also possible that this work will facilitate GIN-style
deduplication based on run length encoding of TIDs, or storing
versioned heap TIDs in an out-of-line nbtree-versioning structure
(unique indexes only). I can see many possibilities, but we have to
start somewhere.

I attach an unfinished prototype of suffix truncation, that also
sometimes *adds* a new attribute in pivot tuples. It adds an extra
heap TID from the leaf level when truncating away non-distinguishing
attributes during a leaf page split, though only when it must. The
patch also has nbtree treat heap TID as a first class part of the key
space of the index. Claudio wrote a patch that did something similar,
though without the suffix truncation part [2] (I haven't studied his
patch, to be honest). My patch is actually a very indirect spin-off of
Anastasia's covering index patch, and I want to show what I have in
mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members. On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

Note that the approach to suffix truncation that I've taken isn't even
my preferred approach [3] -- it's a medium-term solution that enables
making a heap TID attribute part of the key space, which enables
everything else. Cheap incremental/retail tuple deletion is the real
prize here; don't lose sight of that when looking through my patch. If
we're going to teach nbtree to truncate this new implicit heap TID
attribute, which seems essential, then we might as well teach nbtree
to do suffix truncation of other (user-visible) attributes while we're
at it. This patch isn't a particularly effective implementation of
suffix truncation, because that's not what I'm truly interested in
improving here (plus I haven't even bothered to optimize the logic for
picking a split point in light of suffix truncation).

amcheck
=======

This patch adds amcheck coverage, which seems like essential
infrastructure for developing a feature such as this. Extensive
amcheck coverage gave me confidence in my general approach. The basic
idea, invariant-wise, is to treat truncated attributes (often
including a truncated heap TID attribute in internal pages) as "minus
infinity" attributes, which participate in comparisons if and only if
we reach such attributes before the end of the scan key (a smaller
keysz for the index scan could prevent this). I've generalized the
minus infinity concept that _bt_compare() has always considered as a
special case, extending it to individual attributes. It's actually
possible to remove that old hard-coded _bt_compare() logic with this
patch applied without breaking anything, since we can rely on the
comparison of an explicitly 0-attribute tuple working the same way
(pg_upgrade'd databases will break if we do this, however, so I didn't
go that far).

Note that I didn't change the logic that has _bt_binsrch() treat
internal pages in a special way when tuples compare as equal. We still
need that logic for cases where keysz is less than the number of
indexed columns. It's only possible to avoid this _bt_binsrch() thing
for internal pages when all attributes, including heap TID, were
specified and compared (an insertion scan key has to have an entry for
every indexed column, including even heap TID). Doing better there
doesn't seem worth the trouble of teaching _bt_compare() to tell the
_bt_binsrch() caller about this as a special case. That means that we
still move left on equality in some cases where it isn't strictly
necessary, contrary to L&Y. However, amcheck verifies that the classic
"Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1")
when verifying parent/child relationships, which demonstrates that I
have restored the classic invariant (I just don't find it worthwhile
to take advantage of it within _bt_binsrch() just yet).

Most of this work was done while I was an employee of VMware, though I
joined Crunchy data on Monday and cleaned it up a bit more since then.
I'm excited about joining Crunchy, but I should also acknowledge
VMware's strong support of my work.

[1]
https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
[2] https://postgr.es/m/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com
[3] https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys
--
Peter Geoghegan

Вложения

0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Robert Haas

Дата:

16 июня 2018 г., 03:36:10

On Thu, Jun 14, 2018 at 2:44 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I've been thinking about using heap TID as a tie-breaker when
> comparing B-Tree index tuples for a while now [1]. I'd like to make
> all tuples at the leaf level unique, as assumed by L&Y. This can
> enable "retail index tuple deletion", which I think we'll probably end
> up implementing in some form or another, possibly as part of the zheap
> project. It's also possible that this work will facilitate GIN-style
> deduplication based on run length encoding of TIDs, or storing
> versioned heap TIDs in an out-of-line nbtree-versioning structure
> (unique indexes only). I can see many possibilities, but we have to
> start somewhere.

Yes, retail index deletion is essential for the delete-marking
approach that is proposed for zheap.

It could also be extremely useful in some workloads with the regular
heap.  If the indexes are large -- say, 100GB -- and the number of
tuples that vacuum needs to kill is small -- say, 5 -- scanning them
all to remove the references to those tuples is really inefficient.
If we had retail index deletion, then we could make a cost-based
decision about which approach to use in a particular case.

> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members.

Can you enumerate some of the technical obstacles that you see?

> On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).

I think it would be helpful if you could talk more about these
regressions (and the wins).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

16 июня 2018 г., 05:46:44

On Fri, Jun 15, 2018 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yes, retail index deletion is essential for the delete-marking
> approach that is proposed for zheap.

Makes sense.

I don't know that much about zheap. I'm sure that retail index tuple
deletion is really important in general, though. The Gray & Reuter
book treats unique keys as a basic assumption, as do other
authoritative reference works and papers. Other database systems
probably make unique indexes simply use the user-visible attributes as
unique values, but appending heap TID as a unique-ifier is probably a
reasonably common design for secondary indexes (it would also be nice
if we could simply not store duplicates for unique indexes, rather
than using heap TID). I generally have a very high opinion on the
nbtree code, but this seems like a problem that ought to be fixed.

I've convinced myself that I basically have the right idea with this
patch, because the classic L&Y invariants have all been tested with an
enhanced amcheck run against all indexes in a regression test
database. There was other stress-testing, too. The remaining problems
are fixable, but I need some guidance.

> It could also be extremely useful in some workloads with the regular
> heap.  If the indexes are large -- say, 100GB -- and the number of
> tuples that vacuum needs to kill is small -- say, 5 -- scanning them
> all to remove the references to those tuples is really inefficient.
> If we had retail index deletion, then we could make a cost-based
> decision about which approach to use in a particular case.

I remember talking to Andres about this in a bar 3 years ago. I can
imagine variations of pruning that do some amount of this when there
are lots of duplicates. Perhaps something like InnoDB's purge threads,
which do things like in-place deletes of secondary indexes after an
updating (or deleting) xact commits. I believe that that mechanism
targets secondary indexes specifically, and that is operates quite
eagerly.

> Can you enumerate some of the technical obstacles that you see?

The #1 technical obstacle is that I simply don't know where I should
try to take this patch, given that it probably needs to be tied to
some much bigger project, such as zheap. I have an open mind, though,
and intend to help if I can. I'm not really sure what the #2 and #3
problems are, because I'd need to be able to see a few steps ahead to
be sure. Maybe #2 is that I'm doing something wonky to avoid breaking
duplicate checking for unique indexes. (The way that unique duplicate
checking has always worked [1] is perhaps questionable, though.)

> I think it would be helpful if you could talk more about these
> regressions (and the wins).

I think that the performance regressions are due to the fact that when
you have a huge number of duplicates today, it's useful to be able to
claim space to fit further duplicates from almost any of the multiple
leaf pages that contain or have contained duplicates. I'd hoped that
the increased temporal locality that the patch gets would more than
make up for that. As far as I can tell, the problem is that temporal
locality doesn't help enough. I saw that performance was somewhat
improved with extreme Zipf distribution contention, but it went the
other way with less extreme contention. The details are not that fresh
in my mind, since I shelved this patch for a while following limited
performance testing.

The code could certainly use more performance testing, and more
general polishing. I'm not strongly motivated to do that right now,
because I don't quite see a clear path to making this patch useful.
But, as I said, I have an open mind about what the next step should
be.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement
-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Claudio Freire

Дата:

18 июня 2018 г., 20:57:12

On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

> > I think it would be helpful if you could talk more about these
> > regressions (and the wins).
>
> I think that the performance regressions are due to the fact that when
> you have a huge number of duplicates today, it's useful to be able to
> claim space to fit further duplicates from almost any of the multiple
> leaf pages that contain or have contained duplicates. I'd hoped that
> the increased temporal locality that the patch gets would more than
> make up for that. As far as I can tell, the problem is that temporal
> locality doesn't help enough. I saw that performance was somewhat
> improved with extreme Zipf distribution contention, but it went the
> other way with less extreme contention. The details are not that fresh
> in my mind, since I shelved this patch for a while following limited
> performance testing.
>
> The code could certainly use more performance testing, and more
> general polishing. I'm not strongly motivated to do that right now,
> because I don't quite see a clear path to making this patch useful.
> But, as I said, I have an open mind about what the next step should
> be.

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

I never got around to implementing it though, and it does get tricky
if you don't want to allow unbounded latency spikes.

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

18 июня 2018 г., 23:03:20

On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Way back when I was dabbling in this kind of endeavor, my main idea to
> counteract that, and possibly improve performance overall, was a
> microvacuum kind of thing that would do some on-demand cleanup to
> remove duplicates or make room before page splits. Since nbtree
> uniqueification enables efficient retail deletions, that could end up
> as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits. Maybe we could justify
having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
they're visible to anyone, and if not recycle), because we at least
know that the deleting transaction committed there. That is, they
could be recently dead or dead, and it may be worth going to the extra
trouble of checking which when we know that it's one of the two
possibilities.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Claudio Freire

Дата:

18 июня 2018 г., 23:39:14

On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > Way back when I was dabbling in this kind of endeavor, my main idea to
> > counteract that, and possibly improve performance overall, was a
> > microvacuum kind of thing that would do some on-demand cleanup to
> > remove duplicates or make room before page splits. Since nbtree
> > uniqueification enables efficient retail deletions, that could end up
> > as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits. Maybe we could justify
> having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
> they're visible to anyone, and if not recycle), because we at least
> know that the deleting transaction committed there. That is, they
> could be recently dead or dead, and it may be worth going to the extra
> trouble of checking which when we know that it's one of the two
> possibilities.

Yes, but currently bt_vacuum_one_page does local work on the pinned
page. Doing dead tuple deletion however involves reading the heap to
check visibility at the very least, and doing it on the whole page
might involve several heap fetches, so it's an order of magnitude
heavier if done naively.

But the idea is to do just that, only not naively.

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Amit Kapila

Дата:

19 июня 2018 г., 17:03:40

On Mon, Jun 18, 2018 at 10:33 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Way back when I was dabbling in this kind of endeavor, my main idea to
>> counteract that, and possibly improve performance overall, was a
>> microvacuum kind of thing that would do some on-demand cleanup to
>> remove duplicates or make room before page splits. Since nbtree
>> uniqueification enables efficient retail deletions, that could end up
>> as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits.
>

No, I don't think that is the case because we want to perform in-place
updates for indexed-column-updates.  If we won't delete-mark the index
tuple before performing in-place update, then we will have two tuples
in the index which point to the same heap-TID.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

19 июня 2018 г., 23:43:10

On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I imagine that retail index tuple deletion (the whole point of this
>> project) would be run by a VACUUM-like process that kills tuples that
>> are dead to everyone. Even with something like zheap, you cannot just
>> delete index tuples until you establish that they're truly dead. I
>> guess that the delete marking stuff that Robert mentioned marks tuples
>> as dead when the deleting transaction commits.
>>
>
> No, I don't think that is the case because we want to perform in-place
> updates for indexed-column-updates.  If we won't delete-mark the index
> tuple before performing in-place update, then we will have two tuples
> in the index which point to the same heap-TID.

How can an old MVCC snapshot that needs to find the heap tuple using
some now-obsolete key values get to the heap tuple via an index scan
if there are no index tuples that stick around until "recently dead"
heap tuples become "fully dead"? How can you avoid keeping around both
old and new index tuples at the same time?

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Amit Kapila

Дата:

20 июня 2018 г., 09:52:49

On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I imagine that retail index tuple deletion (the whole point of this
>>> project) would be run by a VACUUM-like process that kills tuples that
>>> are dead to everyone. Even with something like zheap, you cannot just
>>> delete index tuples until you establish that they're truly dead. I
>>> guess that the delete marking stuff that Robert mentioned marks tuples
>>> as dead when the deleting transaction commits.
>>>
>>
>> No, I don't think that is the case because we want to perform in-place
>> updates for indexed-column-updates.  If we won't delete-mark the index
>> tuple before performing in-place update, then we will have two tuples
>> in the index which point to the same heap-TID.
>
> How can an old MVCC snapshot that needs to find the heap tuple using
> some now-obsolete key values get to the heap tuple via an index scan
> if there are no index tuples that stick around until "recently dead"
> heap tuples become "fully dead"? How can you avoid keeping around both
> old and new index tuples at the same time?
>

Both values will be present in the index, but the old value will be
delete-marked.  It is correct that we can't remove the value (index
tuple) from the index until it is truly dead (not visible to anyone),
but during a delete or index-update operation, we need to traverse the
index to mark the entries as delete-marked.  See, at this stage, I
don't want to go in too much detail discussion of how delete-marking
will happen in zheap and also I am not sure this thread is the right
place to discuss details of that technology.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

20 июня 2018 г., 10:16:55

On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Both values will be present in the index, but the old value will be
> delete-marked.  It is correct that we can't remove the value (index
> tuple) from the index until it is truly dead (not visible to anyone),
> but during a delete or index-update operation, we need to traverse the
> index to mark the entries as delete-marked.  See, at this stage, I
> don't want to go in too much detail discussion of how delete-marking
> will happen in zheap and also I am not sure this thread is the right
> place to discuss details of that technology.

I don't understand, but okay. I can provide feedback once a design for
delete marking is available.


-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

02 июля 2018 г., 23:43:30

On Thu, Jun 14, 2018 at 11:44 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I attach an unfinished prototype of suffix truncation, that also
> sometimes *adds* a new attribute in pivot tuples. It adds an extra
> heap TID from the leaf level when truncating away non-distinguishing
> attributes during a leaf page split, though only when it must. The
> patch also has nbtree treat heap TID as a first class part of the key
> space of the index. Claudio wrote a patch that did something similar,
> though without the suffix truncation part [2] (I haven't studied his
> patch, to be honest). My patch is actually a very indirect spin-off of
> Anastasia's covering index patch, and I want to show what I have in
> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members. On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).

I attach a revised version, which is still very much of prototype
quality, but manages to solve a few of the problems that v1 had.
Andrey Lepikhov (CC'd) asked me to post any improved version I might
have for use with his retail index tuple deletion patch, so I thought
I'd post what I have.

The main development for v2 is that the sort order of the implicit
heap TID attribute is flipped. In v1, it was in "ascending" order. In
v2, comparisons of heap TIDs are inverted to make the attribute order
"descending". This has a number of advantages:

* It's almost consistent with the current behavior when there are
repeated insertions of duplicates. Currently, this tends to result in
page splits of the leftmost leaf page among pages that mostly consist
of the same duplicated value. This means that the destabilizing impact
on DROP SCHEMA ... CASCADE regression test output noted before [1] is
totally eliminated. There is now only a single trivial change to
regression test "expected" files, whereas in v1 dozens of "expected"
files had to be changed, often resulting in less useful reports for
the user.

* The performance regression I observed with various pgbench workloads
seems to have gone away, or is now within the noise range. A patch
like this one requires a lot of validation and testing, so this should
be taken with a grain of salt.

I may have been too quick to give up on my original ambition of
writing a stand-alone patch that can be justified entirely on its own
merits, without being tied to some much more ambitious project like
retail index tuple deletion by VACUUM, or zheap's deletion marking. I
still haven't tried to replace the kludgey handling of unique index
enforcement, even though that would probably have a measurable
additional performance benefit. I think that this patch could become
an unambiguous win.

[1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com
-- 
Peter Geoghegan

Вложения

v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

18 июля 2018 г., 01:21:27

Attached is my v3, which has some significant improvements:

* The hinting for unique index inserters within _bt_findinsertloc()
has been restored, more or less.

* Bug fix for case where left side of split comes from tuple being
inserted. We need to pass this to _bt_suffix_truncate() as the left
side of the split, which we previously failed to do. The amcheck
coverage I've added allowed me to catch this issue during a benchmark.
(I use amcheck during benchmarks to get some amount of stress-testing
in.)

* New performance optimization that allows us to descend a downlink
when its user-visible attributes have scankey-equal values. We avoid
an unnecessary move left by using a sentinel scan tid that's less than
any possible real heap TID, but still greater than minus infinity to
_bt_compare().

I am now considering pursuing this as a project in its own right,
which can be justified without being part of some larger effort to add
retail index tuple deletion (e.g. by VACUUM). I think that I can get
it to the point of being a totally unambiguous win, if I haven't
already. So, this patch is no longer just an interesting prototype of
a new architectural direction we should take. In any case, it has far
fewer problems than v2.

Testing the performance characteristics of this patch has proven
difficult. My home server seems to show a nice win with a pgbench
workload that uses a Gaussian distribution for the pgbench_accounts
queries (script attached). That seems consistent and reproducible. My
home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
scale 4800 look like this:

Master:

25 clients:
tps = 15134.223357 (excluding connections establishing)

50 clients:
tps = 13708.419887 (excluding connections establishing)

75 clients:
tps = 12951.286926 (excluding connections establishing)

90 clients:
tps = 12057.852088 (excluding connections establishing)

Patch:

25 clients:
tps = 17857.863353 (excluding connections establishing)

50 clients:
tps = 14319.514825 (excluding connections establishing)

75 clients:
tps = 14015.794005 (excluding connections establishing)

90 clients:
tps = 12495.683053 (excluding connections establishing)

I ran this twice, and got pretty consistent results each time (there
were many other benchmarks on my home server -- this was the only one
that tested this exact patch, though). Note that there was only one
pgbench initialization for each set of runs. It looks like a pretty
strong result for the patch - note that the accounts table is about
twice the size of available main memory. The server is pretty well
overloaded in every individual run.

Unfortunately, I have a hard time showing much of any improvement on a
storage-optimized AWS instance with EBS storage, with scaled up
pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
appears to more or less make no overall difference there, for reasons
that I have yet to get to the bottom of. I conceived this AWS
benchmark as something that would have far longer run times with a
scaled-up database size. My expectation was that it would confirm the
preliminary result, but it hasn't.

Maybe the issue is that it's far harder to fill the I/O queue on this
AWS instance? Or perhaps its related to the higher latency of EBS,
compared to the local SSD on my home server? I would welcome any ideas
about how to benchmark the patch. It doesn't necessarily have to be a
huge win for a very generic workload like the one I've tested, since
it would probably still be enough of a win for things like free space
management in secondary indexes [1]. Plus, of course, it seems likely
that we're going to eventually add retail index tuple deletion in some
form or another, which this is prerequisite to.

For a project like this, I expect an unambiguous, across the board win
from the committed patch, even if it isn't a huge win. I'm encouraged
by the fact that this is starting to look like credible as a
stand-alone patch, but I have to admit that there's probably still
significant gaps in my understanding of how it affects real-world
performance. I don't have a lot of recent experience with benchmarking
workloads like this one.

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Andrey Lepikhov

Дата:

02 августа 2018 г., 10:48:34

I use v3 version of the patch for a Retail Indextuple Deletion and from 
time to time i catch regression test error (see attachment).
As i see in regression.diff, the problem is instability order of DROP 
... CASCADE deletions.
Most frequently i get error on a test called 'updatable views'.
I check nbtree invariants during all tests, but index relations is in 
consistent state all time.
My hypothesis is: instability order of logical duplicates in index 
relations on a pg_depend relation.
But 'updatable views' test not contains any sources of instability: 
concurrent insertions, updates, vacuum and so on. This fact discourage me.
May be you have any ideas on this problem?


18.07.2018 00:21, Peter Geoghegan пишет:
> Attached is my v3, which has some significant improvements:
> 
> * The hinting for unique index inserters within _bt_findinsertloc()
> has been restored, more or less.
> 
> * Bug fix for case where left side of split comes from tuple being
> inserted. We need to pass this to _bt_suffix_truncate() as the left
> side of the split, which we previously failed to do. The amcheck
> coverage I've added allowed me to catch this issue during a benchmark.
> (I use amcheck during benchmarks to get some amount of stress-testing
> in.)
> 
> * New performance optimization that allows us to descend a downlink
> when its user-visible attributes have scankey-equal values. We avoid
> an unnecessary move left by using a sentinel scan tid that's less than
> any possible real heap TID, but still greater than minus infinity to
> _bt_compare().
> 
> I am now considering pursuing this as a project in its own right,
> which can be justified without being part of some larger effort to add
> retail index tuple deletion (e.g. by VACUUM). I think that I can get
> it to the point of being a totally unambiguous win, if I haven't
> already. So, this patch is no longer just an interesting prototype of
> a new architectural direction we should take. In any case, it has far
> fewer problems than v2.
> 
> Testing the performance characteristics of this patch has proven
> difficult. My home server seems to show a nice win with a pgbench
> workload that uses a Gaussian distribution for the pgbench_accounts
> queries (script attached). That seems consistent and reproducible. My
> home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
> 250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
> scale 4800 look like this:
> 
> Master:
> 
> 25 clients:
> tps = 15134.223357 (excluding connections establishing)
> 
> 50 clients:
> tps = 13708.419887 (excluding connections establishing)
> 
> 75 clients:
> tps = 12951.286926 (excluding connections establishing)
> 
> 90 clients:
> tps = 12057.852088 (excluding connections establishing)
> 
> Patch:
> 
> 25 clients:
> tps = 17857.863353 (excluding connections establishing)
> 
> 50 clients:
> tps = 14319.514825 (excluding connections establishing)
> 
> 75 clients:
> tps = 14015.794005 (excluding connections establishing)
> 
> 90 clients:
> tps = 12495.683053 (excluding connections establishing)
> 
> I ran this twice, and got pretty consistent results each time (there
> were many other benchmarks on my home server -- this was the only one
> that tested this exact patch, though). Note that there was only one
> pgbench initialization for each set of runs. It looks like a pretty
> strong result for the patch - note that the accounts table is about
> twice the size of available main memory. The server is pretty well
> overloaded in every individual run.
> 
> Unfortunately, I have a hard time showing much of any improvement on a
> storage-optimized AWS instance with EBS storage, with scaled up
> pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
> vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
> appears to more or less make no overall difference there, for reasons
> that I have yet to get to the bottom of. I conceived this AWS
> benchmark as something that would have far longer run times with a
> scaled-up database size. My expectation was that it would confirm the
> preliminary result, but it hasn't.
> 
> Maybe the issue is that it's far harder to fill the I/O queue on this
> AWS instance? Or perhaps its related to the higher latency of EBS,
> compared to the local SSD on my home server? I would welcome any ideas
> about how to benchmark the patch. It doesn't necessarily have to be a
> huge win for a very generic workload like the one I've tested, since
> it would probably still be enough of a win for things like free space
> management in secondary indexes [1]. Plus, of course, it seems likely
> that we're going to eventually add retail index tuple deletion in some
> form or another, which this is prerequisite to.
> 
> For a project like this, I expect an unambiguous, across the board win
> from the committed patch, even if it isn't a huge win. I'm encouraged
> by the fact that this is starting to look like credible as a
> stand-alone patch, but I have to admit that there's probably still
> significant gaps in my understanding of how it affects real-world
> performance. I don't have a lot of recent experience with benchmarking
> workloads like this one.
> 
> [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
> 

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

02 августа 2018 г., 10:59:48

On Wed, Aug 1, 2018 at 9:48 PM, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I use v3 version of the patch for a Retail Indextuple Deletion and from time
> to time i catch regression test error (see attachment).
> As i see in regression.diff, the problem is instability order of DROP ...
> CASCADE deletions.
> Most frequently i get error on a test called 'updatable views'.
> I check nbtree invariants during all tests, but index relations is in
> consistent state all time.
> My hypothesis is: instability order of logical duplicates in index relations
> on a pg_depend relation.
> But 'updatable views' test not contains any sources of instability:
> concurrent insertions, updates, vacuum and so on. This fact discourage me.
> May be you have any ideas on this problem?

It's caused by an implicit dependency on the order of items in an
index. See https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql.

I've been making "\set VERBOSITY terse" changes like this whenever it
happens in a new place. It seems to have finally stopped happening.
Note that this is a preexisting issue; there are already places in the
regression tests where we paper over the problem in a similar way. I
notice that it tends to happen when the machine running the regression
tests is heavily loaded.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

13 сентября 2018 г., 00:11:18

Attached is v4. I have two goals in mind for this revision, goals that
are of great significance to the project as a whole:

* Making better choices around leaf page split points, in order to
maximize suffix truncation and thereby maximize fan-out. This is
important when there are mostly-distinct index tuples on each leaf
page (i.e. most of the time). Maximizing the effectiveness of suffix
truncation needs to be weighed against the existing/main
consideration: evenly distributing space among each half of a page
split. This is tricky.

* Not regressing the logic that lets us pack leaf pages full when
there are a great many logical duplicates. That is, I still want to
get the behavior I described on the '"Write amplification" is made
worse by "getting tired" while inserting into nbtree secondary
indexes' thread [1]. This is not something that happens as a
consequence of thinking about suffix truncation specifically, and
seems like a fairly distinct thing to me. It's actually a bit similar
to the rightmost 90/10 page split case.

v4 adds significant new logic to make us do better on the first goal,
without hurting the second goal. It's easy to regress one while
focussing on the other, so I've leaned on a custom test suite
throughout development. Previous versions mostly got the first goal
wrong, but got the second goal right. For the time being, I'm
focussing on index size, on the assumption that I'll be able to
demonstrate a nice improvement in throughput or latency later. I can
get the main TPC-C order_line pkey about 7% smaller after an initial
bulk load with the new logic added to get the first goal (note that
the benefits with a fresh CREATE INDEX are close to zero). The index
is significantly smaller, even though the internal page index tuples
can themselves never be any smaller due to alignment -- this is all
about not restricting what can go on each leaf page by too much. 7% is
not as dramatic as the "get tired" case, which saw something like a
50% decrease in bloat for one pathological case, but it's still
clearly well worth having. The order_line primary key is the largest
TPC-C index, and I'm merely doing a standard bulk load to get this 7%
shrinkage. The TPC-C order_line primary key happens to be kind of
adversarial or pathological to B-Tree space management in general, but
it's still fairly realistic.

For the first goal, page splits now weigh what I've called the
"distance" between tuples, with a view to getting the most
discriminating split point -- the leaf split point that maximizes the
effectiveness of suffix truncation, within a range of acceptable split
points (acceptable from the point of view of not implying a lopsided
page split). This is based on probing IndexTuple contents naively when
deciding on a split point, without regard for the underlying
opclass/types. We mostly just use char integer comparisons to probe,
on the assumption that that's a good enough proxy for using real
insertion scankey comparisons (only actual truncation goes to those
lengths, since that's a strict matter of correctness). This distance
business might be considered a bit iffy by some, so I want to get
early feedback. This new "distance" code clearly needs more work, but
I felt that I'd gone too long without posting a new version.

For the second goal, I've added a new macro that can be enabled for
debugging purposes. This has the implementation sort heap TIDs in ASC
order, rather than DESC order. This nicely demonstrates how my two
goals for v4 are fairly independent; uncommenting "#define
BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
duplicates are inserted, but won't regress things like the TPC-C
indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
tests, though in a benign way can safely be ignored.)

Open items:

* Do more traditional benchmarking.

* Add pg_upgrade support.

* Simplify _bt_findsplitloc() logic.

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
--
Peter Geoghegan

Вложения

v4-0001-Make-all-nbtree-index-tuples-have-unique-keys.patch

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

20 сентября 2018 г., 00:23:11

Attached is v5, which significantly simplifies the _bt_findsplitloc()
logic. It's now a great deal easier to follow. It would be helpful if
someone could do code-level review of the overhauled
_bt_findsplitloc(). That's the most important part of the patch. It
involves relatively subjective trade-offs around total effort spent
during a page split, space utilization, and avoiding "false sharing"
(I call the situation where a range of duplicate values straddle two
leaf pages unnecessarily "false sharing", since it necessitates that
subsequent index scans visit two index scans rather than just one,
even when that's avoidable.)

This version has slightly improved performance, especially for cases
where an index gets bloated without any garbage being generated. With
the UK land registry data [1], an index on (county, city, locality) is
shrunk by just over 18% by the new logic (I recall that it was shrunk
by ~15% in an earlier version). In concrete terms, it goes from being
1.288 GiB on master to being 1.054 GiB with v5 of the patch. This is
mostly because the patch intelligently packs together duplicate-filled
pages tightly (in particular, it avoids "getting tired"), but also
because it makes pivots less restrictive about where leaf tuples can
go. I still manage to shrink the largest TPC-C and TPC-H indexes by at
least 5% following an initial load performed by successive INSERTs.
Those are unique indexes, so the benefits are certainly not limited to
cases involving many duplicates.

3 modes
-------

My new approach is to teach _bt_findsplitloc() 3 distinct modes of
operation: Regular/default mode, many duplicates mode, and single
value mode. The higher level split code always asks for a default mode
call to _bt_findsplitloc(), so that's always where we start. For leaf
page splits, _bt_findsplitloc() will occasionally call itself
recursively in either many duplicates mode or single value mode. This
happens when the default strategy doesn't work out.

* Default mode almost does what we do already, but remembers the top n
candidate split points, sorted by the delta between left and right
post-split free space, rather than just looking for the overall lowest
delta split point.

Then, we go through a second pass over the temp array of "acceptable"
split points, that considers the needs of suffix truncation.

* Many duplicates mode is used when we fail to find a "distinguishing"
split point in regular mode, but have determined that it's possible to
get one if a new, exhaustive search is performed.

We go to great lengths to avoid having to append a heap TID to the new
left page high key -- that's what I mean by "distinguishing". We're
particularly concerned with false sharing by subsequent point lookup
index scans here.

* Single value mode is used when we see that even many duplicates mode
would be futile, as the leaf page is already *entirely* full of
logical duplicates.

Single value mode isn't exhaustive, since there is clearly nothing to
exhaustively search for. Instead, it packs together as many tuples as
possible on the right side of the split. Since heap TIDs sort in
descending order, this is very much like a "leftmost" split that tries
to free most of the space on the left side, and pack most of the page
contents on the right side. Except that it's leftmost, and in
particular is leftmost among pages full of logical duplicates (as
opposed to being leftmost/rightmost among pages on an entire level of
the tree, as with the traditional rightmost 90:10 split thing).

Other changes
-------------

* I now explicitly use fillfactor in the manner of a rightmost split
to get the single value mode behavior.

I call these types of splits (rightmost and single value mode splits)
"weighted" splits in the patch. This is much more consistent with our
existing conventions than my previous approach.

* Improved approached to inexpensively determining how effective
suffix truncation will be for a given candidate split point.

I no longer naively probe the contents of index tuples to do char
comparisons.  Instead, I use a tuple descriptor to get offsets to each
attribute in each tuple in turn, then calling to datumIsEqual() to
determine if they're equal. This is almost as good as a full scan key
comparison. This actually seems to be a bit faster, and also takes
care of INCLUDE indexes without special care (no need to worry about
probing non-key attributes, and reaching a faulty conclusion about
which split point helps with suffix truncation).

I still haven't managed to add pg_upgrade support, but that's my next
step. I am more or less happy with the substance of the patch in v5,
and feel that I can now work backwards towards figuring out the best
way to deal with on-disk compatibility. It shouldn't be too hard --
most of the effort will involve coming up with a good test suite.

[1] https://wiki.postgresql.org/wiki/Sample_Databases
-- 
Peter Geoghegan

On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan <pg@bowt.ie> wrote:
> 3 modes
> -------
>
> My new approach is to teach _bt_findsplitloc() 3 distinct modes of
> operation: Regular/default mode, many duplicates mode, and single
> value mode.

I think that I'll have to add a fourth mode, since I came up with
another strategy that is really effective though totally complementary
to the other 3 -- "multiple insertion point" mode. Credit goes to
Kevin Grittner for pointing out that this technique exists about 2
years ago [1]. The general idea is to pick a split point just after
the insertion point of the new item (the incoming tuple that prompted
a page split) when it looks like there are localized monotonically
increasing ranges. This is like a rightmost 90:10 page split, except
the insertion point is not at the rightmost page on the level -- it's
rightmost within some local grouping of values.

This makes the two largest TPC-C indexes *much* smaller. Previously,
they were shrunk by a little over 5% by using the new generic
strategy, a win that now seems like small potatoes. With this new
mode, TPC-C's order_line primary key, which is the largest index of
all, is ~45% smaller following a standard initial bulk load at
scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020
blocks (429.84 MiB). It's actually slightly smaller than it would be
after a fresh REINDEX with the new strategy. We see almost as big a
win with the second largest TPC-C index, the stock table's primary key
-- it's ~40% smaller.

Here is the definition of the biggest index, the order line primary key index:

pg@tpcc[3666]=# \d order_line_pkey
Index "public.order_line_pkey"
Column │ Type │ Key? │ Definition
───────────┼─────────┼──────┼────────────
ol_w_id │ integer │ yes │ ol_w_id
ol_d_id │ integer │ yes │ ol_d_id
ol_o_id │ integer │ yes │ ol_o_id
ol_number │ integer │ yes │ ol_number
primary key, btree, for table "public.order_line"

The new strategy/mode works very well because we see monotonically
increasing inserts on ol_number (an order's item number), but those
are grouped by order. It's kind of an adversarial case for our
existing implementation, and yet it seems like it's probably a fairly
common scenario in the real world.

Obviously these are very significant improvements. They really exceed
my initial expectations for the patch. TPC-C is generally considered
to be by far the most influential database benchmark of all time, and
this is something that we need to pay more attention to. My sense is
that the TPC-C benchmark is deliberately designed to almost require
that the system under test have this "multiple insertion point" B-Tree
optimization, suffix truncation, etc. This is exactly the same index
that we've seen reports of out of control bloat on when people run
TPC-C over hours or days [2].

My next task is to find heuristics to make the new page split
mode/strategy kick in when it's likely to help, but not kick in when
it isn't (when we want something close to a generic 50:50 page split).
These heuristics should look similar to what I've already done to get
cases with lots of duplicates to behave sensibly. Anyone have any
ideas on how to do this? I might end up inferring a "multiple
insertion point" case from the fact that there are multiple
pass-by-value attributes for the index, with the new/incoming tuple
having distinct-to-immediate-left-tuple attribute values for the last
column, but not the first few. It also occurs to me to consider the
fragmentation of the page as a guide, though I'm less sure about that.
I'll probably need to experiment with a variety of datasets before I
settle on something that looks good. Forcing the new strategy without
considering any of this actually works surprisingly well on cases
where you'd think it wouldn't, since a 50:50 page split is already
something of a guess about where future insertions will end up.

[1] https://postgr.es/m/CACjxUsN5fV0kV=YirXwA0S7LqoOJuy7soPtipDhUCemhgwoVFg@mail.gmail.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Eisentraut

Дата:

28 сентября 2018 г., 20:50:33

On 19/09/2018 20:23, Peter Geoghegan wrote:
> Attached is v5,

So.  I don't know much about the btree code, so don't believe anything I
say.

I was very interested in the bloat test case that you posted on
2018-07-09 and I tried to understand it more.  The current method for
inserting a duplicate value into a btree is going to the leftmost point
for that value and then move right until we find some space or we get
"tired" of searching, in which case just make some space right there.
The problem is that it's tricky to decide when to stop searching, and
there are scenarios when we stop too soon and repeatedly miss all the
good free space to the right, leading to bloat even though the index is
perhaps quite empty.

I tried playing with the getting-tired factor (it could plausibly be a
reloption), but that wasn't very successful.  You can use that to
postpone the bloat, but you won't stop it, and performance becomes terrible.

You propose to address this by appending the tid to the index key, so
each key, even if its "payload" is a duplicate value, is unique and has
a unique place, so we never have to do this "tiresome" search.  This
makes a lot of sense, and the results in the bloat test you posted are
impressive and reproducible.

I tried a silly alternative approach by placing a new duplicate key in a
random location.  This should be equivalent since tids are effectively
random.  I didn't quite get this to fully work yet, but at least it
doesn't blow up, and it gets the same regression test ordering
differences for pg_depend scans that you are trying to paper over. ;-)

As far as the code is concerned, I agree with Andrey Lepikhov that one
more abstraction layer that somehow combines the scankey and the tid or
some combination like that would be useful, instead of passing the tid
as a separate argument everywhere.

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
 That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc.  In the last few
versions, the patches have still been growing significantly in size and
functionality, and most of the supposed benefits are not readily visible
in tests.

And of course we need to think about how to handle upgrades, but you
have already started a separate discussion about that.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

29 сентября 2018 г., 00:08:07

On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> So.  I don't know much about the btree code, so don't believe anything I
> say.

I think that showing up and reviewing this patch makes you somewhat of
an expert, by default. There just isn't enough expertise in this area.

> I was very interested in the bloat test case that you posted on
> 2018-07-09 and I tried to understand it more.

Up until recently, I thought that I would justify the patch primarily
as a project to make B-Trees less bloated when there are many
duplicates, with maybe as many as a dozen or more secondary benefits.
That's what I thought it would say in the release notes, even though
the patch was always a broader strategic thing. Now I think that the
TPC-C multiple insert point bloat issue might be the primary headline
benefit, though.

I hate to add more complexity to get it to work well, but just look at
how much smaller the indexes are following an initial bulk load (bulk
insertions) using my working copy of the patch:

Master

customer_pkey: 75 MB
district_pkey: 40 kB
idx_customer_name: 107 MB
item_pkey: 2216 kB
new_order_pkey: 22 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
oorder_pkey: 78 MB
order_line_pkey: 774 MB
stock_pkey: 181 MB
warehouse_pkey: 24 kB

Patch

customer_pkey: 50 MB
district_pkey: 40 kB
idx_customer_name: 105 MB
item_pkey: 2216 kB
new_order_pkey: 12 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
oorder_pkey: 42 MB
order_line_pkey: 429 MB
stock_pkey: 111 MB
warehouse_pkey: 24 kB

All of the indexes used by oltpbench to do TPC-C are listed, so you're
seeing the full picture for TPC-C bulk loading here (actually, there
is another index that has an identical definition to
oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted
as redundant). As you can see, all the largest indexes are
*significantly* smaller, with the exception of
oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this
improvement until I post the next version, though, since this is a
brand new development. Note that VACUUM hasn't been run at all, and
doesn't need to be run, as there are no dead tuples. Note also that
this has *nothing* to do with getting tired -- almost all of these
indexes are unique indexes.

Note that I'm also testing TPC-E and TPC-H in a very similar way,
which have both been improved noticeably, but to a degree that's much
less compelling than what we see with TPC-C. They have "getting tired"
cases that benefit quite a bit, but those are the minority.

Have you ever used HammerDB? I got this data from oltpbench, but I
think that HammerDB might be the way to go with TPC-C testing
Postgres.

> You propose to address this by appending the tid to the index key, so
> each key, even if its "payload" is a duplicate value, is unique and has
> a unique place, so we never have to do this "tiresome" search.This
> makes a lot of sense, and the results in the bloat test you posted are
> impressive and reproducible.

Thanks.

> I tried a silly alternative approach by placing a new duplicate key in a
> random location.  This should be equivalent since tids are effectively
> random.

You're never going to get any other approach to work remotely as well,
because while the TIDs may seem to be random in some sense, they have
various properties that are very useful from a high level, data life
cycle point of view. For insertions of duplicates, heap TID has
temporal locality --  you are only going to dirty one or two leaf
pages, rather than potentially dirtying dozens or hundreds.
Furthermore, heap TID is generally strongly correlated with primary
key values, so VACUUM can be much much more effective at killing
duplicates in low cardinality secondary indexes when there are DELETEs
with a range predicate on the primary key. This is a lot more
realistic than the 2018-07-09 test case, but it still could make as
big of a difference.

>  I didn't quite get this to fully work yet, but at least it
> doesn't blow up, and it gets the same regression test ordering
> differences for pg_depend scans that you are trying to paper over. ;-)

FWIW, I actually just added to the papering over, rather than creating
a new problem. There are plenty of instances of "\set VERBOSITY terse"
in the regression tests already, for the same reason. If you run the
regression tests with ignore_system_indexes=on, there are very similar
failures [1].

> As far as the code is concerned, I agree with Andrey Lepikhov that one
> more abstraction layer that somehow combines the scankey and the tid or
> some combination like that would be useful, instead of passing the tid
> as a separate argument everywhere.

I've already drafted this in my working copy. It is a clear
improvement. You can expect it in the next version.

> I think it might help this patch move along if it were split up a bit,
> for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
> That way, it would also be easier to test out each piece separately.
> For example, how much space does suffix truncation save in what
> scenario, are there any performance regressions, etc.

I'll do my best. I don't think I can sensibly split out suffix
truncation from the TID stuff -- those seem truly inseparable, since
my mental model for suffix truncation breaks without fully unique
keys. I can break out all the cleverness around choosing a split point
into its own patch, though -- _bt_findsplitloc() has only been changed
to give weight to several factors that become important. It's the
"brain" of the optimization, where 90% of the complexity actually
lives.

Removing the _bt_findsplitloc() changes will make the performance of
the other stuff pretty poor, and in particular will totally remove the
benefit for cases that "become tired" on the master branch. That could
be slightly interesting, I suppose.

> In the last few
> versions, the patches have still been growing significantly in size and
> functionality, and most of the supposed benefits are not readily visible
> in tests.

I admit that this patch has continued to evolve up until this week,
despite the fact that I thought it would be a lot more settled by now.
It has actually become simpler in recent months, though. And, I think
that the results justify the iterative approach I've taken. This stuff
is inherently very subtle, and I've had to spend a lot of time paying
attention to tiny regressions across a fairly wide variety of test
cases.

> And of course we need to think about how to handle upgrades, but you
> have already started a separate discussion about that.

Right.

[1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com
--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Andrey Lepikhov

Дата:

29 сентября 2018 г., 11:58:00

28.09.2018 23:08, Peter Geoghegan wrote:
> On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> I think it might help this patch move along if it were split up a bit,
>> for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
>> That way, it would also be easier to test out each piece separately.
>> For example, how much space does suffix truncation save in what
>> scenario, are there any performance regressions, etc.
> 
> I'll do my best. I don't think I can sensibly split out suffix
> truncation from the TID stuff -- those seem truly inseparable, since
> my mental model for suffix truncation breaks without fully unique
> keys. I can break out all the cleverness around choosing a split point
> into its own patch, though -- _bt_findsplitloc() has only been changed
> to give weight to several factors that become important. It's the
> "brain" of the optimization, where 90% of the complexity actually
> lives.
> 
> Removing the _bt_findsplitloc() changes will make the performance of
> the other stuff pretty poor, and in particular will totally remove the
> benefit for cases that "become tired" on the master branch. That could
> be slightly interesting, I suppose.

I am reviewing this patch too. And join to Peter Eisentraut opinion 
about splitting the patch into a hierarchy of two or three patches: 
"functional" - tid stuff and "optimizational" - suffix truncation & 
splitting. My reasons are simplification of code review, investigation 
and benchmarking.
Now benchmarking is not clear. Possible performance degradation from TID 
ordering interfere with positive effects from the optimizations in 
non-trivial way.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

01 октября 2018 г., 03:33:15

On Fri, Sep 28, 2018 at 10:58 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I am reviewing this patch too. And join to Peter Eisentraut opinion
> about splitting the patch into a hierarchy of two or three patches:
> "functional" - tid stuff and "optimizational" - suffix truncation &
> splitting. My reasons are simplification of code review, investigation
> and benchmarking.

As I mentioned to Peter, I don't think that I can split out the heap
TID stuff from the suffix truncation stuff. At least not without
making the patch even more complicated, for no benefit. I will split
out the "brain" of the patch (the _bt_findsplitloc() stuff, which
decides on a split point using sophisticated rules) from the "brawn"
(the actually changes to how index scans work, including the heap TID
stuff, as well as the code for actually physically performing suffix
truncation). The brain of the patch is where most of the complexity
is, as well as most of the code. The brawn of the patch is _totally
unusable_ without intelligence around split points, but I'll split
things up along those lines anyway. Doing so should make the whole
design a little easier to see follow.

> Now benchmarking is not clear. Possible performance degradation from TID
> ordering interfere with positive effects from the optimizations in
> non-trivial way.

Is there any evidence of a regression in the last 2 versions? I've
been using pgbench, which didn't show any. That's not a sympathetic
case for the patch, though it would be nice to confirm if there was
some small improvement there. I've seen contradictory results (slight
improvements and slight regressions), but that was with a much earlier
version, so it just isn't relevant now. pgbench is mostly interesting
as a thing that we want to avoid regressing.

Once I post the next version, it would be great if somebody could use
HammerDB's OLTP test, which seems like the best fair use
implementation of TPC-C that's available. I would like to make that
the "this is why you should care, even if you happen to not believe in
the patch's strategic importance" benchmark. TPC-C is clearly the most
influential database benchmark ever, so I think that that's a fair
request. (See the TPC-C commentary at
https://www.hammerdb.com/docs/ch03s02.html, for example.)

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

04 октября 2018 г., 05:39:24

On Sun, Sep 30, 2018 at 2:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Now benchmarking is not clear. Possible performance degradation from TID
> > ordering interfere with positive effects from the optimizations in
> > non-trivial way.
>
> Is there any evidence of a regression in the last 2 versions?

I did find a pretty clear regression, though only with writes to
unique indexes. Attached is v6, which fixes the issue. More on that
below.

v6 also:

* Adds a new-to-v6 "insert at new item's insertion point"
optimization, which is broken out into its own commit.

This *greatly* improves the index bloat situation with the TPC-C
benchmark in particular, even before the benchmark starts (just with
the initial bulk load). See the relevant commit message for full
details, or a couple of my previous mails on this thread. I will
provide my own TPC-C test data + test case to any reviewer that wants
to see this for themselves. It shouldn't be hard to verify the
improvement in raw index size with any TPC-C implementation, though.
Please make an off-list request if you're interested. The raw dump is
1.8GB.

The exact details of when this new optimization kick in and how it
works are tentative. They should really be debated. Reviewers should
try to think of edge cases in which my "heap TID adjacency" approach
could make the optimization kick in when it shouldn't -- cases where
it causes bloat rather than preventing it. I couldn't find any such
regressions, but this code was written very recently.

I should also look into using HammerDB to do a real TPC-C benchmark,
and really put the patch to the test...anybody have experience with
it?

* Generally groups everything into a relatively manageable series of
cumulative improvements, starting with the infrastructure required to
physically truncate tuples correctly, without any of the smarts around
selecting a split point.

The base patch is useless on its own, since it's just necessary to
have the split point selection smarts to see a consistent benefit.
Reviewers shouldn't waste their time doing any real benchmarking with
just the first patch applied.

* Adds a lot of new information to the nbtree README, about the
high-level thought process behind the design, including citing the
classic paper that this patch was primarily inspired by.

* Adds a new, dedicated insertion scan key struct --
BTScanInsert[Data]. This is passed around to a number of different
routines (_btsearch(), _bt_binsrch(), _bt_compare(), etc). This was
suggested by Andrey, and also requested by Peter Eisentraut.

While this BTScanInsert work started out as straightforward
refactoring, it actually led to my discovering and fixing the
regression I mentioned. Previously, I passed a lower bound on a binary
search to _bt_binsrch() within _bt_findinsertloc(). This wasn't nearly
as effective as what the master branch does for unique indexes at the
same point -- it usually manages to reuse a result from an earlier
_bt_binsrch() as the offset for the new tuple, since it has no need to
worry about the new tuple's position *among duplicates* on the page.
In earlier versions of my patch, most of the work of a second binary
search took place, despite being redundant and unnecessary. This
happened for every new insertion into a non-unique index -- I could
easily measure the problem with a simple serial test case. I can see
no regression there against master now, though.

My fix for the regression involves including some mutable state in the
new BTScanInsert struct (within v6-0001-*patch), to explicitly
remember and restore some internal details across two binary searches
against the same leaf page. We now remember a useful lower *and* upper
bound within bt_binsrch(), which is what is truly required to fix the
regression. While there is still a second call to _bt_binsrch() within
_bt_findinsertloc() for unique indexes, it will do no comparisons in
the common case where there are no existing dead duplicate tuples in
the unique index. This means that the number of _bt_compare() calls we
get in this _bt_findinsertloc() unique index path is the same as the
master branch in almost all cases (I instrumented the regression tests
to make sure of this). I also think that having BTScanInsert will ease
things around pg_upgrade support, something that remains an open item.
Changes in this area seem to make everything clearer -- the signature
of _bt_findinsertloc() seemed a bit jumbled to me.

Aside: I think that this BTScanInsert mutable state idea could be
pushed even further in the future. "Dynamic prefix truncation" could
be implemented by taking a similar approach when descending composite
indexes for an index scan (doesn't have to be a unique index). We can
observe that earlier attributes must all be equal to our own scankey's
values once we descend the tree and pass between a pair of pivot
tuples where a common prefix (some number of leading attributes) is
fully equal. It's safe to just not bother comparing these prefix
attributes on lower levels, because we can reason about their values
transitively; _bt_compare() can be told to always skip the first
attribute or two during later/lower-in-the-tree binary searches. This
idea will not be implemented for Postgres v12 by me, though.

--
Peter Geoghegan

Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public' is the patch case. Sorry for not mentioning it initially.

--
Peter Geoghegan
(Sent from my phone)

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

19 октября 2018 г., 00:10:02

On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:
> What kind of backend_flush_after values where you trying?
> backend_flush_after=0 obviously is the default, so I'm not clear on
> that.   How large is the database here, and how high is shared_buffers

I *was* trying backend_flush_after=512kB, but it's
backend_flush_after=0 in the benchmark I posted. See the
"postgres*settings" files.

On the master branch, things looked like this after the last run:

pg@tpcc_oltpbench[15547]=# \dt+
                      List of relations
 Schema │    Name    │ Type  │ Owner │   Size   │ Description
────────┼────────────┼───────┼───────┼──────────┼─────────────
 public │ customer   │ table │ pg    │ 4757 MB  │
 public │ district   │ table │ pg    │ 5240 kB  │
 public │ history    │ table │ pg    │ 1442 MB  │
 public │ item       │ table │ pg    │ 10192 kB │
 public │ new_order  │ table │ pg    │ 140 MB   │
 public │ oorder     │ table │ pg    │ 1185 MB  │
 public │ order_line │ table │ pg    │ 19 GB    │
 public │ stock      │ table │ pg    │ 9008 MB  │
 public │ warehouse  │ table │ pg    │ 4216 kB  │
(9 rows)

pg@tpcc_oltpbench[15547]=# \di+
                                         List of relations
 Schema │                 Name                 │ Type  │ Owner │
Table    │  Size   │ Description
────────┼──────────────────────────────────────┼───────┼───────┼────────────┼─────────┼─────────────
 public │ customer_pkey                        │ index │ pg    │
customer   │ 367 MB  │
 public │ district_pkey                        │ index │ pg    │
district   │ 600 kB  │
 public │ idx_customer_name                    │ index │ pg    │
customer   │ 564 MB  │
 public │ idx_order                            │ index │ pg    │
oorder     │ 715 MB  │
 public │ item_pkey                            │ index │ pg    │ item
     │ 2208 kB │
 public │ new_order_pkey                       │ index │ pg    │
new_order  │ 188 MB  │
 public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg    │
oorder     │ 715 MB  │
 public │ oorder_pkey                          │ index │ pg    │
oorder     │ 958 MB  │
 public │ order_line_pkey                      │ index │ pg    │
order_line │ 9624 MB │
 public │ stock_pkey                           │ index │ pg    │ stock
     │ 904 MB  │
 public │ warehouse_pkey                       │ index │ pg    │
warehouse  │ 56 kB   │
(11 rows)

> Is it possible that there's new / prolonged cases where a buffer is read
> from disk after the patch? Because that might require doing *write* IO
> when evicting the previous contents of the victim buffer, and obviously
> that can take longer if you're running with backend_flush_after > 0.

Yes, I suppose that that's possible, because the buffer
popularity/usage_count will be affected in ways that cannot easily be
predicted. However, I'm not running with "backend_flush_after > 0"
here -- that was before.

> I wonder if it'd make sense to hack up a patch that logs when evicting a
> buffer while already holding another lwlock. That shouldn't be too hard.

I'll look into this.

Thanks
-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

20 октября 2018 г., 07:51:11

On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if it'd make sense to hack up a patch that logs when evicting a
> buffer while already holding another lwlock. That shouldn't be too hard.

I tried this. It looks like we're calling FlushBuffer() with more than
a single LWLock held (not just the single buffer lock) somewhat *less*
with the patch. This is a positive sign for the patch, but also means
that I'm no closer to figuring out what's going on.

I tested a case with a 1GB shared_buffers + a TPC-C database sized at
about 10GB. I didn't want the extra LOG instrumentation to influence
the outcome.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Andrey Lepikhov

Дата:

23 октября 2018 г., 13:35:21

On 19.10.2018 0:54, Peter Geoghegan wrote:
> I would welcome any theories as to what could be the problem here. I'm
> think that this is fixable, since the picture for the patch is very
> positive, provided you only focus on bgwriter/checkpoint activity and
> on-disk sizes. It seems likely that there is a very specific gap in my
> understanding of how the patch affects buffer cleaning.

I have same problem with background heap & index cleaner (based on your 
patch). In this case the bottleneck is WAL-record which I need to write 
for each cleaned block and locks which are held during the WAL-record 
writing process.
Maybe you will do a test without writing any data to disk?

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

23 октября 2018 г., 16:09:54

On Tue, Oct 23, 2018 at 11:35 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I have same problem with background heap & index cleaner (based on your
> patch). In this case the bottleneck is WAL-record which I need to write
> for each cleaned block and locks which are held during the WAL-record
> writing process.

Part of the problem here is that v6 uses up to 25 candidate split
points, even during regularly calls to _bt_findsplitloc(). That was
based on some synthetic test-cases. I've found that I can get most of
the benefit in index size with far fewer spilt points, though. The
extra work done with an exclusive buffer lock held will be
considerably reduced in v7. I'll probably post that in a couple of
weeks, since I'm in Europe for pgConf.EU. I don't fully understand the
problems here, but even still I know that what you were testing wasn't
very well optimized for write-heavy workloads. It would be especially
bad with pgbench, since there isn't much opportunity to reduce the
size of indexes there.

> Maybe you will do a test without writing any data to disk?

Yeah, I should test that on its own. I'm particularly interested in
TPC-C, because it's a particularly good target for my patch. I can
find a way of only executing the read TPC-C queries, to see where they
are on their own. TPC-C is particularly write-heavy, especially
compared to the much more recent though less influential TPC-E
benchmark.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Andrey Lepikhov

Дата:

02 ноября 2018 г., 13:06:32

I do the code review.
Now, it is first patch - v6-0001... dedicated to a logical duplicates 
ordering.

Documentation is full and clear. All non-trivial logic is commented 
accurately.

Patch applies cleanly on top of current master. Regression tests passed 
and my "Retail Indextuple deletion" use cases works without mistakes.
But I have two comments on the code.
New BTScanInsert structure reduces parameters list of many functions and 
look fine. But it contains some optimization part ('restorebinsrch' 
field et al.). It is used very locally in the code - 
_bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize 
this logic into separate struct, which will passed to _bt_binsrch() as 
pointer. Another routines may pass NULL value to this routine. It is may 
simplify usability of the struct.

Due to the optimization the _bt_binsrch() size has grown twice. May be 
you move this to some service routine?


-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

03 ноября 2018 г., 03:00:47

On Fri, Nov 2, 2018 at 3:06 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Documentation is full and clear. All non-trivial logic is commented
> accurately.

Glad you think so.

I had the opportunity to discuss this patch at length with Heikki
during pgConf.EU. I don't want to speak on his behalf, but I will say
that he seemed to understand all aspects of the patch series, and
seemed generally well disposed towards the high level design. The
high-level design is the most important aspect -- B-Trees can be
optimized in many ways, all at once, and we must be sure to come up
with something that enables most or all of them. I really care about
the long term perspective.

That conversation with Heikki eventually turned into a conversation
about reimplementing GIN using the nbtree code, which is actually
related to my patch series (sorted on heap TID is the first step to
optional run length encoding for duplicates). Heikki seemed to think
that we can throw out a lot of the optimizations within GIN, and add a
few new ones to nbtree, while still coming out ahead. This made the
general nbtree-as-GIN idea (which we've been talking about casually
for years) seem a lot more realistic to me. Anyway, he requested that
I support this long term goal by getting rid of the DESC TID sort
order thing -- that breaks GIN-style TID compression. It also
increases the WAL volume unnecessarily when a page is split that
contains all duplicates.

The DESC heap TID sort order thing probably needs to go. I'll probably
have to go fix the regression test failures that occur when ASC heap
TID order is used. (Technically those failures are a pre-existing
problem, a problem that I mask by using DESC order...which is weird.
The problem is masked in the master branch by accidental behaviors
around nbtree duplicates, which is something that deserves to die.
DESC order is closer to the accidental current behavior.)

> Patch applies cleanly on top of current master. Regression tests passed
> and my "Retail Indextuple deletion" use cases works without mistakes.

Cool.

> New BTScanInsert structure reduces parameters list of many functions and
> look fine. But it contains some optimization part ('restorebinsrch'
> field et al.). It is used very locally in the code -
> _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize
> this logic into separate struct, which will passed to _bt_binsrch() as
> pointer. Another routines may pass NULL value to this routine. It is may
> simplify usability of the struct.

Hmm. I see your point. I did it that way because the knowledge of
having cached an upper and lower bound for a binary search of a leaf
page needs to last for a relatively long time. I'll look into it
again, though.

> Due to the optimization the _bt_binsrch() size has grown twice. May be
> you move this to some service routine?

Maybe. There are some tricky details that seem to work against it.
I'll see if it's possible to polish that some more, though.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Andrey Lepikhov

Дата:

04 ноября 2018 г., 06:52:35

On 03.11.2018 5:00, Peter Geoghegan wrote:
> The DESC heap TID sort order thing probably needs to go. I'll probably
> have to go fix the regression test failures that occur when ASC heap
> TID order is used. (Technically those failures are a pre-existing
> problem, a problem that I mask by using DESC order...which is weird.
> The problem is masked in the master branch by accidental behaviors
> around nbtree duplicates, which is something that deserves to die.
> DESC order is closer to the accidental current behavior.)

I applied your patches at top of master. After tests corrections 
(related to TID ordering in index relations DROP...CASCADE operation) 
'make check-world' passed successfully many times.
In the case of 'create view' regression test - 'drop cascades to 62 
other objects' problem - I verify an Álvaro Herrera hypothesis [1] and 
it is true. You can verify it by tracking the 
object_address_present_add_flags() routine return value.
Some doubts, however, may be regarding the 'triggers' test.
May you specify the test failures do you mean?

[1] 
https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

04 ноября 2018 г., 07:26:21

On Fri, Nov 2, 2018 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I had the opportunity to discuss this patch at length with Heikki
> during pgConf.EU.

> The DESC heap TID sort order thing probably needs to go. I'll probably
> have to go fix the regression test failures that occur when ASC heap
> TID order is used.

I've found that TPC-C testing with ASC heap TID order fixes the
regression that I've been concerned about these past few weeks. Making
this change leaves the patch a little bit faster than the master
branch for TPC-C, while still leaving TPC-C indexes about as small as
they were with v6 of the patch (i.e. much smaller). I now get about a
1% improvement in transaction throughput, an improvement that seems
fairly consistent. It seems likely that the next revision of the patch
series will be an unambiguous across the board win for performance. I
think that I come out ahead with ASC heap TID order because that has
the effect of reducing the volume of WAL generated by page splits.
Page splits are already optimized for splitting right, not left.

I should thank Heikki for pointing me in the right direction here.

--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

04 ноября 2018 г., 07:31:46

On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I applied your patches at top of master. After tests corrections
> (related to TID ordering in index relations DROP...CASCADE operation)
> 'make check-world' passed successfully many times.
> In the case of 'create view' regression test - 'drop cascades to 62
> other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
> it is true. You can verify it by tracking the
> object_address_present_add_flags() routine return value.

I'll have to go and fix the problem directly, so that ASC sort order
can be used.

> Some doubts, however, may be regarding the 'triggers' test.
> May you specify the test failures do you mean?

Not sure what you mean. The order of items that are listed in the
DETAIL for a cascading DROP can have an "implementation defined"
order. I think that this is an example of the more general problem --
what you call the 'drop cascades to 62 other objects' problem is a
more specific subproblem, or, if you prefer, a more specific symptom
of the same problem.

Since I'm going to have to fix the problem head-on, I'll have to study
it in detail anyway.

--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Andrey Lepikhov

Дата:

04 ноября 2018 г., 19:21:08

On 04.11.2018 9:31, Peter Geoghegan wrote:
> On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> I applied your patches at top of master. After tests corrections
>> (related to TID ordering in index relations DROP...CASCADE operation)
>> 'make check-world' passed successfully many times.
>> In the case of 'create view' regression test - 'drop cascades to 62
>> other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
>> it is true. You can verify it by tracking the
>> object_address_present_add_flags() routine return value.
> 
> I'll have to go and fix the problem directly, so that ASC sort order
> can be used.
> 
>> Some doubts, however, may be regarding the 'triggers' test.
>> May you specify the test failures do you mean?
> 
> Not sure what you mean. The order of items that are listed in the
> DETAIL for a cascading DROP can have an "implementation defined"
> order. I think that this is an example of the more general problem --
> what you call the 'drop cascades to 62 other objects' problem is a
> more specific subproblem, or, if you prefer, a more specific symptom
> of the same problem.

I mean that your code have not any problems that I can detect by 
regression tests and by the retail index tuple deletion patch.
Difference in amount of dropped objects is not a problem. It is caused 
by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file 
catalog/dependency.c and it is legal behavior: column row object deleted 
without any report because we already decided to drop its whole table.

Also, I checked the triggers test. Difference in the ERROR message 
'cannot drop trigger trg1' is caused by different order of tuples in the 
relation with the dependDependerIndexId relid. It is legal behavior and 
we can simply replace test results.

May be you know any another problems of the patch?

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

04 ноября 2018 г., 21:58:19

On Sun, Nov 4, 2018 at 8:21 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I mean that your code have not any problems that I can detect by
> regression tests and by the retail index tuple deletion patch.
> Difference in amount of dropped objects is not a problem. It is caused
> by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file
> catalog/dependency.c and it is legal behavior: column row object deleted
> without any report because we already decided to drop its whole table.

The behavior implied by using ASC heap TID order is always "legal",
but it may cause a regression in certain functionality -- something
that an ordinary user might complain about. There were some changes
when DESC heap TID order is used too, of course, but those were safe
to ignore (it seemed like nobody could ever care). It might have been
okay to just use DESC order, but since it now seems like I must use
ASC heap TID order for performance reasons, I have to tackle a couple
of these issues head-on (e.g.  'cannot drop trigger trg1').

> Also, I checked the triggers test. Difference in the ERROR message
> 'cannot drop trigger trg1' is caused by different order of tuples in the
> relation with the dependDependerIndexId relid. It is legal behavior and
> we can simply replace test results.

Let's look at this specific "trg1" case:

"""
 create table trigpart (a int, b int) partition by range (a);
 create table trigpart1 partition of trigpart for values from (0) to (1000);
 create trigger trg1 after insert on trigpart for each row execute
procedure trigger_nothing();
 ...
 drop trigger trg1 on trigpart1; -- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger
trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table
trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
"""

The original hint suggests "you need to drop the object on the
partition parent instead of its child", which is useful. The new hint
suggests "instead of dropping the trigger on the partition child,
maybe drop the child itself!". That's almost an insult to the user.

Now, I suppose that I could claim that it's not my responsibility to
fix this, since we get the useful behavior only due to accidental
implementation details. I'm not going to take that position, though. I
think that I am obliged to follow both the letter and the spirit of
the law. I'm almost certain that this regression test was written
because somebody specifically cared about getting the original, useful
message. The underlying assumptions may have been a bit shaky, but we
all know how common it is for software to evolve to depend on
implementation-defined details. We've all written code that does it,
but hopefully it didn't hurt us much because we also wrote regression
tests that exercised the useful behavior.

> May be you know any another problems of the patch?

Just the lack of pg_upgrade support. That is progressing nicely,
though. I'll probably have that part in the next revision of the
patch. I've found what looks like a workable approach, though I need
to work on a testing strategy for pg_upgrade.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

13 ноября 2018 г., 04:47:45

On Sun, Nov 4, 2018 at 10:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Just the lack of pg_upgrade support.

Attached is v7 of the patch series. Changes:

* Pre-pg_upgrade indexes (indexes of an earlier BTREE_VERSION) are now
supported. Using pg_upgrade will be seamless to users. "Getting tired"
returns, for the benefit of old indexes that regularly have lots of
duplicates inserted.

Notably, the new/proposed version of btree (BTREE_VERSION 4) cannot be
upgraded on-the-fly -- we're changing more than the contents of the
metapage, so that won't work. Version 2 -> version 3 upgrades can
still take place dynamically/on-the-fly. It you want to upgrade to
version 4, you'll need to REINDEX. The performance of the patch with
pg_upgrade'd indexes has been validated. There doesn't seem to be any
regressions.

amcheck checks both the old invariants, and the new/stricter/L&Y
invariants. Which set are checked depends on the btree version of the
index undergoing verification.

* ASC heap TID order is now used -- not DESC order, as before. This
fixed all performance regressions that I'm aware of, and seems quite a
lot more elegant overall.

I believe that the patch series is now an unambiguous, across the
board win for performance. I could see about a 1% increase in
transaction throughput with my own TPC-C tests, while the big drop in
the size of indexes was preserved. pgbench testing also showed as much
as a 3.5% increase in transaction throughput in some cases with
non-uniform distributions. Thanks for the suggestion, Heikki!

Unfortunately, and as predicted, this change created a new problem
that I need to fix directly: it makes certain diagnostic messages that
accidentally depend on a certain pg_depend scan order say something
different, and less useful (though still technically correct). I'll
tackle that problem over on the dedicated thread I started [1]. (For
now, I include a separate patch to paper over questionable regression
test changes in a controlled way:
v7-0005-Temporarily-paper-over-problematic-regress-output.patch.)

* New optimization that has index scans avoid visiting the next page
by checking the high key -- this is broken out into its own commit
(v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch).

This is related to an optimization that has been around for years --
we're now using the high key, rather than using a normal (non-pivot)
index tuple. High keys are much more likely to indicate that the scan
doesn't need to visit the next page with the earlier patches in the
patch series applied, since the new logic for choosing a split point
favors a high key with earlier differences. It's pretty easy to take
advantage of that. With a composite index, or a secondary index, it's
particularly likely that we can avoid visiting the next leaf page. In
other words, now that we're being smarter about future locality of
access during page splits, we should take full advantage during index
scans.

The v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch
commit uses a _bt_lowest_scantid() sentinel value to avoid
unnecessarily visiting a page to the left of the page we actually
ought to go to directly during a descent of a B-Tree -- that
optimization was around in all earlier versions of the patch series.
It seems natural to also have this new-to-v7 optimization. It avoids
unnecessarily going right once we reach the leaf level, so it "does
the same thing on the right side" -- the two optimizations mirror each
other. If you don't get what I mean by that, then imagine a secondary
index where each value appears a few hundred times. Literally every
simple lookup query will either benefit from the first optimization on
the way down the tree, or from the second optimization towards the end
of the scan. (The page split logic ought to pack large groups of
duplicates together, ideally confining them to one leaf page.)

Andrey: the BTScanInsert struct still has the restorebinsrch stuff
(mutable binary search optimization state) in v7. It seemed to make
sense to keep it there, because I think that we'll be able to add
similar optimizations in the future, that use similar mutable state.
See my remarks on "dynamic prefix truncation" [2]. I think that that
could be very helpful with skip scans, for example, so we'll probably
end up adding it before too long. I hope you don't feel too strongly
about it.

[1] https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
[2] https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

25 ноября 2018 г., 02:13:42

Attached is v8 of the patch series, which has some relatively minor changes:

* A new commit adds an artificial tie-breaker column to pg_depend
indexes, comprehensively solving the issues with regression test
instability. This is the only really notable change.

* Clean-up of how the design in described in the nbtree README, and
elsewhere. I want to make it clear that we're now more or less using
the Lehman and Yao design. I re-read the Lehman and Yao paper to make
sure that the patch acknowledges what Lehman and Yao say to expect, at
least in cases that seemed to matter.

* Stricter verification by contrib/amcheck. Not likely to catch a case
that wouldn't have been caught by previous revisions, but should make
the design a bit clearer to somebody following L&Y.

* Tweaks to how _bt_findsplitloc() accumulates candidate split points.
We're less aggressive in choosing a smaller tuple during an internal
page split in this revision.

The overall impact of the pg_depend change is that required regression
test output changes are *far* less numerous than they were in v7.
There are now only trivial differences in the output order of items.
And, there are very few diagnostic message changes overall -- we see
exactly 5 changes now, rather than dozens. Importantly, there is no
longer any question about whether I could make diagnostic messages
less useful to users, because the existing behavior for
findDependentObjects() is retained. This is an independent
improvement, since it fixes an independent problem with test
flappiness that we've been papering-over for some time [2] -- I make
the required order actually-deterministic, removing heap TID ordering
as a factor that can cause seemingly-random regression test failures
on slow/overloaded buildfarm animals.

Robert Haas remarked that he thought that the pg_depend index
tie-breaker commit's approach is acceptable [1] -- see the other
thread that Robert weighed in on for all the gory details. The patch's
draft commit message may also be interesting. Note that adding a new
column turns out to have *zero* storage overhead, because we only ever
end up filling up space that was already getting lost to alignment.

The pg_depend thing is clearly a kludge. It's ugly, though in no small
part because it acknowledges the existing reality of how
findDependentObjects() already depends on scan order. I'm optimistic
that I'll be able to push this groundwork commit before too long; it
doesn't hinge on whether or not the nbtree patches are any good.

[1] https://postgr.es/m/CA+TgmoYNeFxdPimiXGL=tCiCXN8zWosUFxUfyDBaTd2VAg-D9w@mail.gmail.com
[2] https://postgr.es/m/11852.1501610262%40sss.pgh.pa.us
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Dmitry Dolgov

Дата:

01 декабря 2018 г., 15:11:01

> On Sun, Nov 25, 2018 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached is v8 of the patch series, which has some relatively minor changes:

Thank you for working on this patch,

Just for the information, cfbot says there are problems on windows:

src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
undeclared identifier

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

02 декабря 2018 г., 05:16:23

On Sat, Dec 1, 2018 at 4:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Just for the information, cfbot says there are problems on windows:
>
> src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
> undeclared identifier

Thanks. Looks like I should have used PG_INT32_MAX.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

04 декабря 2018 г., 06:10:37

On Sat, Dec 1, 2018 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Thanks. Looks like I should have used PG_INT32_MAX.

Attached is v9, which does things that way. There are no interesting
changes, though I have set things up so that a later patch in the
series can add "dynamic prefix truncation" -- I do not include any
such patch in v9, though. I'm going to start a new thread on that
topic, and include the patch there, since it's largely unrelated to
this work, and in any case still isn't in scope for Postgres 12 (the
patch is still experimental, for reasons that are of general
interest). If nothing else, Andrey and Peter E. will probably get a
better idea of why I thought that an insertion scan key was a good
place to put mutable state if they go read that other thread -- there
really was a bigger picture to setting things up that way.

--
Peter Geoghegan

On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I'm envisioning that you have an array, with one element for each item
> > on the page (including the tuple we're inserting, which isn't really on
> > the page yet). In the first pass, you count up from left to right,
> > filling the array. Next, you compute the complete penalties, starting
> > from the middle, walking outwards.

> Ah, right. I think I see what you mean now.

> Leave it with me. I'll need to think about this some more.

Attached is v10 of the patch series, which has many changes based on
your feedback. However, I didn't end up refactoring _bt_findsplitloc()
in the way you described, because it seemed hard to balance all of the
concerns there. I still have an open mind on this question, and
recognize the merit in what you suggested. Perhaps it's possible to
reach a compromise here.

I did refactor the _bt_findsplitloc() stuff to make the division of
work clearer, though -- I think that you'll find that to be a clear
improvement, even though it's less than what you asked for. I also
moved all of the _bt_findsplitloc() stuff (old and new) into its own
.c file, nbtsplicloc.c, as you suggested.

Other significant changes
=========================

* Creates a new commit that changes routines like _bt_search() and
_bt_binsrch() to use a dedicated insertion scankey struct, per request
from Heikki.

* As I mentioned in passing, many other small changes to comments, the
nbtree README, and the commit messages based on your (Heikki's) first
round of review.

* v10 generalizes the previous _bt_lowest_scantid() logic for adding a
tie-breaker on equal pivot tuples during a descent of a B-Tree.

The new code works with any truncated attribute, not just a truncated
heap TID (I removed _bt_lowest_scantid() entirely). This also allowed
me to remove a couple of places that previously opted in to
_bt_lowest_scantid(), since the new approach can work without anybody
explicitly opting in. As a bonus, the new approach makes the patch
faster, since remaining queries where we unnecessarily follow an
equal-though-truncated downlink are fixed (it's usually only the heap
TID that's truncated when we can do this, but not always).

The idea behind this new generalized approach is to recognize that
minus infinity is an artificial/sentinel value that doesn't appear in
real keys (it only appears in pivot tuples). The majority of callers
(all callers aside from VACUUM's leaf page deletion code) can
therefore go to the right of a pivot that has all-equal attributes, if
and only if:

1. The pivot has at least one truncated/minus infinity attribute *and*

2. The number of attributes matches the scankey.

In other words, we tweak the comparison logic to add a new
tie-breaker. There is no change to the on-disk structures compared to
v9 of the patch -- I've only made index scans able to take advantage
of minus infinity values in *all* cases.

If this explanation is confusing to somebody less experienced with
nbtree than Heikki: consider the way we descend *between* the values
on internal pages, rather than expecting exact matches. _bt_binsrch()
behaves slightly differently when doing a binary search on an internal
page already: equality actually means "go left" when descending the
tree (though it doesn't work like that on leaf pages, where insertion
scankeys almost always search for a >= match). We want to "go right"
instead in cases where it's clear that tuples of interest to our scan
can only be in that child page (we're rarely searching for a minus
infinity value, since that doesn't appear in real tuples). (Note that
this optimization has nothing to do with "moving right" to recover
from concurrent page splits -- we would have relied on code like
_bt_findsplicloc() and _bt_readpage() to move right once we reach the
leaf level when we didn't have this optimization, but that code isn't
concerned with recovering from concurrent page splits.)

Minor changes
=============

* Addresses Heikki's concerns about locking the metapage more
frequently in a general way. Comments are added to nbtpage.c, and
updated in a number of places that already talk about the same risk.

The master branch seems to be doing much the same thing in similar
situations already (e.g. during a root page split, when we need to
finish an interrupted page split but don't have a usable
parent/ancestor page stack). Importantly, the patch does not change
the dependency graph.

* Small changes to user docs where existing descriptions of things
seem to be made inaccurate by the patch.

Benchmarking
============

I have also recently been doing a lot of automated benchmarking. Here
are results of a BenchmarkSQL benchmark (plus various instrumentation)
as a bz2 archive:

https://drive.google.com/file/d/1RVJUzMtMNDi4USg0-Yo56LNcRItbFg1Q/view?usp=sharing

I completed on my home server last night, against v10 of the patch
series. Note that there were 4 runs for each case (master case +
public/patch case), with each run lasting 2 hours (so the benchmark
took over 8 hours once you include bulk loading time). There were 400
"warehouses" (this is similar to pgbench's scale factor), and 16
terminals/clients. This left the database 110GB+ in size on a server
with 32GB of memory and a fast consumer grade SSD. Autovacuum was
tuned to perform aggressive cleanup of bloat. All the settings used
are available in the bz2 archive (there are "settings" output files,
too).

Summary
-------

See the html "report" files for a quick visual indication of how the
tests progresses. BenchmarkSQL uses R to produce useful graphs, which
is quite convenient. (I have automated a lot of this with my own ugly
shellscript.)

We see a small but consistent increase in transaction throughput here,
as well as a small but consistent decrease in average latency for each
class of transaction. There is also a large and consistent decrease in
the on-disk size of indexes, especially if you just consider the
number of internal pages (diff the "balance" files to see what I
mean). Note that the performance is expected to degrade across runs,
since the database is populated once, at the start, and has more data
added over time; the important thing is that run n on master be
compared to run n on public/patch. Note also that I use my own fork of
BenchmarkSQL that does its CREATE INDEX before initial bulk loading,
not after [1]. It'll take longer to see problems on Postgres master if
the initial bulk load does CREATE INDEX after BenchmarkSQL workers
populate tables (we only need INSERTs to see significant index bloat).
Avoiding pristine indexes at the start of the benchmark makes the
problems on the master branch apparent sooner.

The benchmark results also include things like pg_statio* +
pg_stat_bgwriter view output (reset between test runs), which gives
some insight into what's going on. Checkpoints tend to write out a few
more dirty buffers with the patch, while there is a much larger drop
in the number of buffers written out by backends. There are probably
workloads where we'd see a much larger increase in transaction
throughput -- TPC-C happens to access index pages with significant
locality, and happens to be very write-heavy, especially compared to
the more modern (though less influential) TPC-E benchmark. Plus, the
TPC-C workload isn't at all helped by the fact that the patch will
never "get tired", even though that's the most notable improvement
overall.

[1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

24 января 2019 г., 04:44:41

On Tue, Jan 8, 2019 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v10 of the patch series, which has many changes based on
> your feedback. However, I didn't end up refactoring _bt_findsplitloc()
> in the way you described, because it seemed hard to balance all of the
> concerns there. I still have an open mind on this question, and
> recognize the merit in what you suggested. Perhaps it's possible to
> reach a compromise here.

> * Addresses Heikki's concerns about locking the metapage more
> frequently in a general way. Comments are added to nbtpage.c, and
> updated in a number of places that already talk about the same risk.

Attached is v11 of the patch, which removes the comments mentioned
here, and instead finds a way to not do new things with buffer locks.

Changes
=======

* We simply avoid holding buffer locks while accessing the metapage.
(Of course, the old root page split stuff still does this -- it has
worked that way forever.)

* We also avoid calling index_getprocinfo() with any buffer lock held.
We'll still call support function 1 with a buffer lock held to
truncate, but that's not new -- *any* insertion will do that.

For some reason I got stuck on the idea that we need to use a
scankey's own values within _bt_truncate()/_bt_keep_natts() by
generating a new insertion scankey every time. We now simply ignore
those values, and call the comparator with pairs of tuples that each
come from the page directly. Usually, we'll just reuse the insertion
scankey that we were using for the insertion anyway (we no longer
build our own scankey for truncation). Other times, we'll build an
empty insertion scankey (one that has the function pointer and so on,
but no values). The only downside is that I cannot have an assertion
that calls _bt_compare() to make sure we truncated correctly there and
then, since a dedicated insertion scankey is no longer conveniently
available.

I feel rather silly for not having gone this way from the beginning,
because the new approach is quite obviously simpler and safer.
nbtsort.c now gets a reusable, valueless insertion scankey that it
uses for both truncation and for setting up a merge of the two spools
for unique index builds. This approach allows me to remove
_bt_mkscankey_nodata() altogether -- callers build a "nodata"
insertion scankey with empty values by passing _bt_mkscankey() a NULL
tuple. That's equivalent to having an insertion scankey built from an
all-attributes-truncated tuple. IOW, the patch now makes the "nodata"
stuff a degenerate case of building a scankey from a
truncated-attributes tuple. tuplesort.c also uses the new BTScanInsert
struct. There is no longer any case where there in an insertion
scankey that isn't accessed using the BTScanInsert struct.

* No more pg_depend tie-breaker column commit. That was an ugly hack,
that I'm glad to be rid of -- many thanks to Tom for working through a
number of test instability issues that affected the patch. I do still
need to paper-over one remaining regression test issue/bug that the
patch happens to unmask, pending Tom fixing it directly [1]. This
papering-over is broken out into its own commit
("v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch"). I
expect that Tom will fix the bug before too long, at which point the
temporary work around can just be reverted from your local tree.

Outlook
=======

I feel that this version is pretty close to being commitable, since
everything about the design is settled. It completely avoids saying
anything new about buffer locking protocols, LWLock deadlock safety,
etc. VACUUM and crash recovery are also unchanged, so subtle bugs
should at least not be too hard to reproduce when observed once. It's
pretty complementary code: the new logic for picking a split point
builds a list of candidate split points using the old technique, with
a second pass to choose the best one for suffix truncation among the
accumulated list. Hard to see how that could introduce an invalid
split point choice.

I take the risk of introducing new corruption bugs very seriously.
contrib/amcheck now verifies all aspects of the new on-disk
representation. The stricter Lehman & Yao style invariant ("the
subtree S is described by Ki < v <= Ki + 1 ...") allows amcheck to be
stricter in what it will accept (e.g., heap TID needs to be in order
among logical duplicates, we always expect to see a representation of
the number of pivot tuple attributes, and we expect the high key to be
strictly greater than items on internal pages).

Review
======

It would be very helpful if a reviewer such as Heikki or Alexander
could take a look at the patch once more. I suggest that they look at
the following points in the patch:

*  The minusinfkey stuff, which is explained within _bt_compare(), and
within nbtree.h header comments. Page deletion by VACUUM is the only
_bt_search() caller that sets minusinfkey to true (though older
versions of btree and amcheck also set minusinfkey to true).

* Does the value of BTREE_SINGLEVAL_FILLFACTOR make sense? Am I being
a little too aggressive there, possibly hurting workloads where HOT
pruning occurs periodically? Sane duplicate handling is the most
compelling improvement that the patch makes, but I may still have been
a bit too aggressive in packing pages full of duplicates so tightly. I
figured that that was the closest thing to the previous behavior
that's still reasonable.

* Does the _bt_splitatnewitem() criteria for deciding if we should
split at the point the new tuple is positioned at miss some subtlety?
It's important that splitting at the new item when we shouldn't
doesn't happen, or hardly ever happens -- it should be
*self-limiting*. This was tested using BenchmarkSQL/TPC-C [2] -- TPC-C
has a workload where this particular enhancement makes indexes a lot
smaller.

* There was also testing of index bloat following bulk insertions,
based on my own custom test suite. Data and indexes were taken from
TPC-C tables, TPC-H tables, TPC-E tables, UK land registry data [3],
and the Mouse Genome Database Project (Postgres schema + indexes) [4].
This takes almost an hour to run on my development machine, though the
most important tests finish in less than 5 minutes. I can provide
access to all or some of these tests, if reviewers are interested and
are willing to download several gigabytes of sample data that I'll
provide privately.

[1] https://postgr.es/m/19220.1547767251@sss.pgh.pa.us
[2] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
[3] https://wiki.postgresql.org/wiki/Sample_Databases
[4] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

28 января 2019 г., 18:31:56

On 09/01/2019 02:47, Peter Geoghegan wrote:
> On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I'm envisioning that you have an array, with one element for each item
>>> on the page (including the tuple we're inserting, which isn't really on
>>> the page yet). In the first pass, you count up from left to right,
>>> filling the array. Next, you compute the complete penalties, starting
>>> from the middle, walking outwards.
> 
>> Ah, right. I think I see what you mean now.
> 
>> Leave it with me. I'll need to think about this some more.
> 
> Attached is v10 of the patch series, which has many changes based on
> your feedback. However, I didn't end up refactoring _bt_findsplitloc()
> in the way you described, because it seemed hard to balance all of the
> concerns there. I still have an open mind on this question, andAt a 
> recognize the merit in what you suggested. Perhaps it's possible to
> reach a compromise here.

I spent some time first trying to understand the current algorithm, and 
then rewriting it in a way that I find easier to understand. I came up 
with the attached. I think it optimizes for the same goals as your 
patch, but the approach  is quite different. At a very high level, I 
believe the goals can be described as:

1. Find out how much suffix truncation is possible, i.e. how many key 
columns can be truncated away, in the best case, among all possible ways 
to split the page.

2. Among all the splits that achieve that optimum suffix truncation, 
find the one with smallest "delta".

For performance reasons, it doesn't actually do it in that order. It's 
more like this:

1. First, scan all split positions, recording the 'leftfree' and 
'rightfree' at every valid split position. The array of possible splits 
is kept in order by offset number. (This scans through all items, but 
the math is simple, so it's pretty fast)

2. Compute the optimum suffix truncation, by comparing the leftmost and 
rightmost keys, among all the possible split positions.

3. Split the array of possible splits in half, and process both halves 
recursively. The recursive process "zooms in" to the place where we'd 
expect to find the best candidate, but will ultimately scan through all 
split candidates, if no "good enough" match is found.

I've only been testing this on leaf splits. I didn't understand how the 
penalty worked for internal pages in your patch. In this version, the 
same algorithm is used for leaf and internal pages. I'm sure this still 
has bugs in it, and could use some polishing, but I think this will be 
more readable way of doing it.

What have you been using to test this? I wrote the attached little test 
extension, to explore what _bt_findsplitloc() decides in different 
scenarios. It's pretty rough, but that's what I've been using while 
hacking on this. It prints output like this:

postgres=# select test_split();
NOTICE:  test 1:
left    2/358: 1 0
left  358/358: 1 356
right   1/ 51: 1 357
right  51/ 51: 1 407  SPLIT TUPLE
split ratio: 12/87

NOTICE:  test 2:
left    2/358: 0 0
left  358/358: 356 356
right   1/ 51: 357 357
right  51/ 51: 407 407  SPLIT TUPLE
split ratio: 12/87

NOTICE:  test 3:
left    2/358: 0 0
left  320/358: 10 10  SPLIT TUPLE
left  358/358: 48 48
right   1/ 51: 49 49
right  51/ 51: 99 99
split ratio: 12/87

NOTICE:  test 4:
left    2/309: 1 100
left  309/309: 1 407  SPLIT TUPLE
right   1/100: 2 0
right 100/100: 2 99
split ratio: 24/75

Each test consists of creating a temp table with one index, and 
inserting rows in a certain pattern, until the root page splits. It then 
prints the first and last tuples on both pages, after the split, as well 
as the tuple that caused the split. I don't know if this is useful to 
anyone but myself, but I thought I'd share it.

- Heikki

On Tue, Feb 5, 2019 at 4:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Heikki and I had the opportunity to talk about this recently. We found
> an easy way forward. I believe that the nbtsplitloc.c algorithm itself
> is fine -- the code will need to be refactored, though.

Attached v12 does not include this change, though I have every
intention of doing the refactoring described for v13. The
nbtsplitloc.c/split algorithm refactoring would necessitate
revalidating the patch's performance, though, which didn't seem worth
blocking on. Besides, there was bit rot that needed to be fixed.

Notable improvements in v12:

* No more papering-over regression test differences caused by
pg_depend issues, thanks to recent work by Tom (today's commit
1d92a0c9).

* I simplified the code added to _bt_binsrch() to deal with saving and
restoring binary search bounds for _bt_check_unique()-caller
insertions (this is from first/"Refactor nbtree insertion scankeys"
patch). I also improved matters within _bt_check_unique() itself: the
early "break" there (based on reaching the known strict upper bound
from cache binary search) works in terms of the existing
_bt_check_unique() loop invariant.

This even allowed me to add a new assertion that makes sure that
breaking out of the loop early is correct -- we call _bt_isequal() for
next item on assert-enabled builds when we break having reached strict
upper bound established by initial binary search. In other words,
_bt_check_unique() ends up doing the same number of _bt_isequal()
calls as it did on the master branch, provided assertions are enabled.

* I've restored regression test coverage that the patch previously
inadvertently took away. Suffix truncation made deliberately-tall
B-Tree indexes from the regression tests much shorter, making the
tests fail to test the code paths the tests originally targeted. I
needed to find ways to "defeat" suffix truncation so I still ended up
with a fairly tall tree that hit various code paths.

I think that we went from having 8 levels in btree_tall_idx (i.e.
ridiculously many) to having only a single root page when I first
caught the problem! Now btree_tall_idx only has 3 levels, which is all
we really need. Even multi-level page deletion didn't have any
coverage in previous versions. I used gcov to specifically verify that
we have good multi-level page deletion coverage. I also used gcov to
make sure that we have coverage of the v11 "cache rightmost block"
optimization, since I noticed that that was missing (though present on
the master branch) -- that's actually all that the btree_tall_idx
tests in the patch, since multi-level page deletion is covered by a
covering-indexes-era test. Finally, I made sure that we have coverage
of fast root splits. In general, I preserved the original intent
behind the existing tests, all of which I was fairly familiar with
from previous projects.

* I've added a new "relocate" bt_index_parent_check()/amcheck option,
broken out in a separate commit. This new option makes verification
relocate each and every leaf page tuple, starting from the root each
time. This means that there will be at least one piece of code that
specifically relies on "every tuple should have a unique key" from the
start, which seems like a good idea.

This enhancement to amcheck allows me to detect various forms of
corruption that no other existing verification option would catch. In
particular, I can catch various very subtle "cross-cousin
inconsistencies" that require that we verify a page using its
grandparent rather than its parent [1] (existing checks catch some but
not all "cousin problem" corruption). Simply put, this amcheck
enhancement allows me to detect corruption of the least significant
byte in a key value in the root page -- that kind of corruption will
cause index scans to miss only a small number of tuples at the leaf
level. Maybe this scenario isn't realistic, but I'd rather not take
any chances.

* I rethought the "single value mode" fillfactor, which I've been
suspicious of for a while now. It's now 96, down from 99.

Micro-benchmarks involving concurrent sessions inserting into a low
cardinality index led me to the conclusion that 99 was aggressively
high. It was not that hard to get excessive page splits with these
microbenchmarks, since insertions with monotonically increasing heap
TIDs arrived a bit out of order with a lot of concurrency. 99 worked a
bit better than 96 with only one session, but significantly worse with
concurrent sessions. I still think that it's a good idea to be more
aggressive than default leaf fillfactor, but reducing "single value
mode" fillfactor to 90 (or whatever the user set general leaf
fillfactor to) wouldn't be so bad.

[1] http://subs.emis.de/LNI/Proceedings/Proceedings144/32.pdf
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

14 февраля 2019 г., 09:47:03

On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Notable improvements in v12:

I've been benchmarking v12, once again using a slightly modified
BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1], since
the problems with index bloat don't take so long to manifest
themselves when the indexes are inserted into incrementally from the
very beginning. This benchmarking process took over 20 hours, with a
database that started off at about 90GB (700 TPC-C/BenchmarkSQL
warehouses were used). That easily exceeded available main memory on
my test server, which was 32GB. This is a pretty I/O bound workload,
and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB,
NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only
tablespace.

Importantly, I figured out that I should disable both hash joins and
merge joins with BenchmarkSQL, in order to force all joins to be
nested loop joins. Otherwise, the "stock level" transaction eventually
starts to use a hash join, even though that's about 10x slower than a
nestloop join (~4ms vs. ~40ms on this machine) -- the hash join
produces a lot of noise without really testing anything. It usually
takes a couple of hours before we start to get obviously-bad plans,
but it also usually takes about that long until the patch series
starts to noticeably overtake the master branch. I don't think that
TPC-C will ever benefit from using a hash join or a merge join, since
it's supposed to be a pure OLTP benchmark, and is a benchmark that
MySQL is known to do at least respectably-well on.

This is the first benchmark I've published that was considerably I/O
bound. There are significant improvements in performance across the
board, on every measure, though it takes several hours for that to
really show. The benchmark was not rate-limited. 16
clients/"terminals" are used throughout. There were 5 runs for master
and 5 for patch, interlaced, each lasting 2 hours. Initialization
occurred once, so it's expected that both databases will gradually get
larger across runs.

Summary (appears in same order as the execution of each run) -- each
run is 2 hours, so 20 hours total excluding initial load time (2 hours
* 5 runs for master + 2 hours * 5 runs for patch):

Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC
(NewOrders) = 39172.37
Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC
(NewOrders) = 39530.2

Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC
(NewOrders) = 33530.66
Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC
(NewOrders) = 36508.38    <-- 8.8% increase in tpmTOTAL/throughput

Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC
(NewOrders) = 30949.24
Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC
(NewOrders) = 34021.98   <-- 9.8% increase in tpmTOTAL/throughput

Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC
(NewOrders) = 31163.29
Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC
(NewOrders) = 31793.99

Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC
(NewOrders) = 28908.8
Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders)
= 30885.56  <-- 6.9% increase in tpmTOTAL/throughput

There were *also* significant reductions in transaction latency for
the patch -- see the full html reports in the provided tar archive for
full details (URL provided below). The html reports have nice SVG
graphs, generated by BenchmarkSQL using R -- one for transaction
throughput, and another for transaction latency. The overall picture
is that the patched version starts out ahead, and has a much more
gradual decline as the database becomes larger and more bloated.

Note also that the statistics collector stats show a *big* reduction
in blocks read into shared_buffers for the duration of these runs. For
example, here is what pg_stat_database shows for run 3 (I reset the
stats between runs):

master: blks_read = 78,412,640, blks_hit = 4,022,619,556
patch: blks_read = 70,033,583, blks_hit = 4,505,308,517  <-- 10.7%
reduction in blks_read/logical I/O

This suggests an indirect benefit, likely related to how buffers are
evicted in each case. pg_stat_bgwriter indicates that more buffers are
written out during checkpoints, while fewer are written out by
backends. I won't speculate further on what all of this means right
now, though.

You can find the raw details for blks_read for each and every run in
the full tar archive. It is available for download from:

https://drive.google.com/file/d/1kN4fDmh1a9jtOj8URPrnGYAmuMPmcZax/view?usp=sharing

There are also dumps of the other pg_stat* views at the end of each
run, logs for each run, etc. There's more information than anybody
else is likely to find interesting.

If anyone needs help in recreating this benchmark, then I'd be happy
to assist in that. The is a shell script (zsh) included in the tar
archive, although that will need to be changed a bit to point to the
correct installations and so on. Independent validation of the
performance of the patch series on this and other benchmarks is very
welcome.

[1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

26 февраля 2019 г., 07:31:17

On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I spent some time first trying to understand the current algorithm, and
> then rewriting it in a way that I find easier to understand. I came up
> with the attached. I think it optimizes for the same goals as your
> patch, but the approach  is quite different.

Attached is v13 of the patch series, which significantly refactors
nbtsplitloc.c to implement the algorithm using the approach from your
prototype posted on January 28 -- I now take a "top down" approach
that materializes all legal split points up-front, as opposed to the
initial "bottom up" approach that used recursion, and weighed
everything (balance of free space, suffix truncation, etc) all at
once. Some of the code is directly lifted from your prototype, so
there is now a question about whether or not you should be listed as a
co-author. (I think that you should be credited as a secondary author
of the nbtsplitloc.c patch, and as a secondary author in the release
notes for the feature as a whole, which I imagine will be rolled into
one item there.)

I always knew that a "top down" approach would be simpler, but I
underestimated how much better it would be overall, and how manageable
the downsides are -- the added cycles are not actually noticeable when
compared to the master branch, even during microbenchmarks. Thanks for
suggesting this approach!

I don't even need to simulate recursion with a loop or a goto;
everything is structured as a linear series of steps now. There are
still the same modes as before, though; the algorithm is essentially
unchanged. All of my tests show that it's at least as effective as v12
was in terms of how effective the final _bt_findsplitloc() results are
in reducing index size. The new approach will make more sophisticated
suffix truncation costing much easier to implement in a future
release, when suffix truncation is taught to truncate *within*
individual datums/attributes (e.g. generate the text string "new m"
given a split point between "new jersey" and "new york", by using some
new opclass infrastructure). "Top down" also makes the implementation
of the "split after new item" optimization safer and simpler -- we
already have all split points conveniently available, so we can seek
out an exact match instead of interpolating where it ought appear
later using a dynamic fillfactor. We can back out of the "split after
new item" optimization in the event of the *precise* split point we
want to use not being legal. That shouldn't be necessary, and isn't
necessary in practice, but it seems like a good idea be defensive with
something so delicate as this.

I'm using qsort() to sort the candidate split points array. I'm not
trying to do something clever to avoid the up-front effort of sorting
everything, even though we could probably get away with that much of
the time (e.g. by doing a top-N sort in default mode). Testing has
shown that using an inlined qsort() routine in the style of
tuplesort.c would make my serial test cases/microbenchmarks faster,
without adding much complexity. We're already competitive with the
master branch even without this microoptimization, so I've put that
off for now. What do you think of the idea of specializing an
inlineable qsort() for nbtsplitloc.c?

Performance is at least as good as v12 with a more relevant workload,
such as BenchmarkSQL. Transaction throughput is 5% - 10% greater in my
most recent tests (benchmarks for v13 specifically).

Unlike in your prototype, v13 makes the array for holding candidate
split points into a single big allocation that is always exactly
BLCKSZ. The idea is that palloc() can thereby recycle the big
_bt_findsplitloc() allocation within _bt_split(). palloc() considers
8KiB to be the upper limit on the size of individual blocks it
manages, and we'll always go on to palloc(BLCKSZ) through the
_bt_split() call to PageGetTempPage(). In a sense, we're not even
allocating memory that we weren't allocating already. (Not sure that
this really matters, but it is easy to do it that way.)

Other changes from your prototype:

*  I found a more efficient representation than a pair of raw
IndexTuple pointers for each candidate split. Actually, I use the same
old representation (firstoldonright + newitemonleft) in each split,
and provide routines to work backwards from that to get the lastleft
and firstright tuples. This approach is far more space efficient, and
space efficiency matters when you've allocating space for hundreds of
items in a critical path like this.

* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?

* Fixed a bug in the loop that adds split points. Your refactoring
made the main loop responsible for new item space handling, as just
mentioned, but it didn't create a split where the new item is first on
the page, and the split puts the new item on the left page on its own,
on all existing items on the new right page. I couldn't prove that
this caused failures to find a legal split, but it still seemed like a
bug.

In general, I think that we should generate our initial list of split
points in exactly the same manner as we do so already. The only
difference as far as split legality/feasibility goes is that we
pessimistically assume that suffix truncation will have to add a heap
TID in all cases. I don't see any advantage to going further than
that.

Changes to my own code since v12:

* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.

* Removed subtly wrong assertion in nbtpage.c, concerning VACUUM's
page deletion. Even a page that is about to be deleted can be filled
up again and split when we release and reacquire a lock, however
unlikely that may be.

* Rename _bt_checksplitloc() to _bt_recordsplit(). I think that it
makes more sense to make that about recording a split point, rather
than about making sure a split point is legal. It still does that, but
perhaps 99%+ of calls to _bt_recordsplit()/_bt_checksplitloc() result
in the split being deemed legal, so the new name makes much more
sense.

--
Peter Geoghegan

On Sun, Mar 3, 2019 at 5:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Great, looks much simpler now, indeed! Now I finally understand the
> algorithm.

Glad to hear it. Thanks for the additional review!

Attached is v14, which has changes based on your feedback. This
includes changes based on your more recent feedback on
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch, though
I'll respond to those points directly in a later email.

v14 also changes the logic by which we decide if alternative strategy
should be used to use leftmost and rightmost splits for the entire
page, rather than accessing the page directly. We always handle the
newitem-at-end edge case correctly now, since the new "top down"
approach has all legal splits close at hand. This is more elegant,
more obviously correct, and even more effective, at least in some
cases -- it's another example of why "top down" is the superior
approach for nbtsplitloc.c. This made my "UK land registry data" index
have about 2.5% fewer leaf pages than with v13, which is small but
significant.

Separately, I made most of the new nbtsplitloc.c functions use a
FindSplitData argument in v14, which simplifies their signatures quite
a bit.

> What would be the worst case scenario for this? Splitting a page that
> has as many tuples as possible, I guess, so maybe inserting into a table
> with a single-column index, with 32k BLCKSZ. Have you done performance
> testing on something like that?

I'll test that (added to my project TODO list), though it's not
obvious that that's the worst case. Page splits will be less frequent,
and have better choices about where to split.

> Rounding up the allocation to BLCKSZ seems like a premature
> optimization. Even if it saved some cycles, I don't think it's worth the
> trouble of having to explain all that in the comment.

Removed that optimization.

> I think you could change the curdelta, leftfree, and rightfree fields in
> SplitPoint to int16, to make the array smaller.

Added this alternative optimization to replace the BLCKSZ allocation
thing. I even found a way to get the array elements down to 8 bytes,
but that made the code noticeably slower with "many duplicates"
splits, so I didn't end up doing that (I used bitfields, plus the same
pragmas that we use to make sure that item pointers are packed).

> > * You seemed to refactor _bt_checksplitloc() in passing, making it not
> > do the newitemisfirstonright thing. I changed that back. Did I miss
> > something that you intended here?
>
> My patch treated the new item the same as all the old items, in
> _bt_checksplitloc(), so it didn't need newitemisonright. You still need
> it with your approach.

I would feel better about it if we stuck to the same method for
calculating if a split point is legal as before (the only difference
being that we pessimistically add heap TID overhead to new high key on
leaf level). That seems safer to me.

> > Changes to my own code since v12:
> >
> > * Simplified "Add "split after new tuple" optimization" commit, and
> > made it more consistent with associated code. This is something that
> > was made a lot easier by the new approach. It would be great to hear
> > what you think of this.
>
> I looked at it very briefly. Yeah, it's pretty simple now. Nice!

I can understand why it might be difficult to express an opinion on
the heuristics themselves. The specific cut-off points (e.g. details
of what "heap TID adjacency" actually means) are not that easy to
defend with a theoretical justification, though they have been
carefully tested. I think it's worth comparing the "split after new
tuple" optimization to the traditional leaf fillfactor of 90, which is
a very similar situation. Why should it be 90? Why not 85, or 95? Why
is it okay to assume that the rightmost page shouldn't be split 50/50?

The answers to all of these questions about the well established idea
of a leaf fillfactor boil down to this: it's very likely to be correct
on average, and when it isn't correct the problem is self-limiting,
and has an infinitesimally small chance of continually recurring
(unless you imagine an *adversarial* case). Similarly, it doesn't
matter if these new heuristics get it wrong once every 1000 splits (a
very pessimistic estimate), because even then those will cancel each
other out in the long run. It is necessary to take a holistic view of
things. We're talking about an optimization that makes the two largest
TPC-C indexes over 40% smaller -- I can hold my nose if I must in
order to get that benefit. There were also a couple of indexes in the
real-world mouse genome database that this made much smaller, so this
will clearly help in the real world.

Besides all this, the "split after new tuple" optimization fixes an
existing worst case, rather than being an optimization, at least in my
mind. It's not supposed to be possible to have leaf pages that are all
only 50% full without any deletes, and yet we allow it to happen in
this one weird case. Even completely random insertions result in 65% -
70% average space utilization, so the existing worst case really is
exceptional. We are forced to take a holistic view, and infer
something about the pattern of insertions over time, even though
holistic is a dirty word.

> > (New header comment block over _bt_findsplitloc())
>
> This is pretty good, but I think some copy-editing can make this even
> better

I've restored the old structure of the _bt_findsplitloc() header comments.

> The explanation of how the high key for the left page is formed (5th
> paragraph), seems out-of-place here, because the high key is not formed
> here.

Moved that to _bt_split(), which seems like a good compromise.

> Somewhere in the 1st or 2nd paragraph, perhaps we should mention that
> the function effectively uses a different fillfactor in some other
> scenarios too, not only when it's the rightmost page.

Done.

> >       state.maxsplits = maxoff + 2;
> >       state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
> >       state.nsplits = 0;
>
> I wouldn't be that paranoid. The code that populates the array is pretty
> straightforward.

Done that way. But are you sure? Some of the attempts to create a new
split point are bound to fail, because they try to put everything
(including new item) on one size of the split. I'll leave the
assertion there.

> >        * Still, it's typical for almost all calls to _bt_recordsplit to
> >        * determine that the split is legal, and therefore enter it into the
> >        * candidate split point array for later consideration.
> >        */
>
> Suggestion: Remove the "Still" word. The observation that typically all
> split points are legal is valid, but it seems unrelated to the first
> paragraph. (Do we need to mention it at all?)

Removed the second paragraph entirely.

> >       /*
> >        * If the new item goes as the last item, record the split point that
> >        * leaves all the old items on the left page, and the new item on the
> >        * right page.  This is required because a split that leaves the new item
> >        * as the firstoldonright won't have been reached within the loop.  We
> >        * always record every possible split point.
> >        */
>
> Suggestion: Remove the last sentence.

Agreed. Removed.

> ISTM that figuring out which "mode" we want to operate in is actually
> the *primary* purpose of _bt_perfect_penalty(). We only really use the
> penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd
> suggest changing the function name to something like _bt_choose_mode(),
> and have secondmode be the primary return value from it, with
> perfectpenalty being the secondary result through a pointer.

I renamed _bt_perfect_penalty() to _bt_strategy(), since I agree that
its primary purpose is to decide on a strategy (which is what I'm now
calling a mode, per your request a bit further down). It still returns
perfectpenalty, though, since that seemed more natural to me, probably
because its style matches the style of caller/_bt_findsplitloc().
perfectpenalty isn't a mere optimization -- it is important to prevent
many duplicates mode from going overboard with suffix truncation. It
does more than just save _bt_bestsplitloc() cycles, which I've tried
to make clearer in v14.

> It doesn't really choose the mode, either, though. At least after the
> next "Add split after new tuple optimization" patch. The caller has a
> big part in choosing what to do. So maybe split _bt_perfect_penalty into
> two functions: _bt_perfect_penalty(), which just computes the perfect
> penalty, and _bt_analyze_split_interval(), which determines how many
> duplicates there are in the top-N split points.

Hmm. I didn't create a _bt_analyze_split_interval(), because now
_bt_perfect_penalty()/_bt_strategy() is responsible for setting the
perfect penalty in all cases. It was a mistake for me to move some
perfect penalty stuff for alternative modes/strategies out to the
caller in v13. In v14, we never make _bt_findsplitloc() change its
perfect penalty -- it only changes its split interval, based on the
strategy/mode, possibly after sorting. Let me know what you think of
this.

> BTW, I like the word "strategy", like you called it in the comment on
> SplitPoint struct, better than "mode".

I've adopted that terminology in v14 -- it's always "strategy", never "mode".

> How about removing the "usemult" variable, and just check if
> fillfactormult == 0.5?

I need to use "usemult" to determine if the "split after new tuple"
optimization should apply leaf fillfactor, or should instead split at
the exact point after the newly inserted tuple -- it's very natural to
have a single bool flag for that. It's seems simpler to continue to
use "usemult" for everything, and not distinguish "split after new
tuple" as a special case later on. (Besides, the master branch already
uses a bool for this, even though it handles far fewer things.)

> >       /*
> >        * There are a much smaller number of candidate split points when
> >        * splitting an internal page, so we can afford to be exhaustive.  Only
> >        * give up when pivot that will be inserted into parent is as small as
> >        * possible.
> >        */
> >       if (!state->is_leaf)
> >               return MAXALIGN(sizeof(IndexTupleData) + 1);
>
> Why are there fewer candidate split points on an internal page?

The comment should say that there is typically a much smaller split
interval (this used to be controlled by limiting the size of the array
initially -- should have updated this for v13 of the patch). I believe
that you understand that, and are interested in why the split interval
itself is different on internal pages. Or why we are more conservative
with internal pages in general. Assuming that's what you meant, here
is my answer:

The "Prefix B-Tree" paper establishes the idea that there are
different split intervals for leaf pages and internal pages (which it
calls branch pages). We care about different things in each case. With
leaf pages, we care about choosing the split point that allows us to
create the smallest possible pivot tuple as our secondary goal
(primary goal is balancing space). With internal pages, we care about
choosing the smallest tuple to insert into parent of internal page
(often the root) as our secondary goal, but don't care about
truncation, because _bt_split() won't truncate new pivot. That's why
the definition of "penalty" varies according to whether we're
splitting an internal page or a leaf page. Clearly the idea of having
separate split intervals is well established, and makes sense.

It's fair to ask if I'm being too conservative (or not conservative
enough) with split interval in either case. Unfortunately, the Prefix
B-Tree paper never seems to give practical advice about how to come up
with an interval. They say:

"We have not analyzed the influence of sigma L [leaf interval] or
sigma B [branch/internal interval] on the performance of the trees. We
expect such an analysis to be quite involved and difficult. We are
quite confident, however, that small split intervals improve
performance considerably. Sets of keys that arise in practical
applications are often far from random, and clusters of similar keys
differing only in the last few letters (e.g. plural forms) are quite
common."

I am aware of another, not-very-notable paper that tries to to impose
some theory here, but doesn't really help much [1]. Anyway, I've found
that I was too conservative with split interval for internal pages. It
pays to make internal interval that higher than leaf interval, because
internal pages cover a much bigger portion of the key space than leaf
pages, which will tend to get filled up one way or another. Leaf pages
cover a tight part of the key space, in contrast. In v14, I've
increased internal page to 18, a big increase from 3, and twice what
it is for leaf splits (still 9 -- no change there). This mostly isn't
that different to 3, since there usually are pivot tuples that are all
the same size anyway. However, with cases where suffix truncation
makes pivot tuples a lot smaller (e.g. UK land registry test case),
this makes the items in the root a lot smaller on average, and even
makes the whole index smaller. My entire test suite has a few cases
that are noticeably improved by this change, and no cases that are
hurt.

I'm going to have to revalidate the performance of long-running
benchmarks with this change, so this should be considered provisional.
I think that it will probably be kept, though. Not expecting it to
noticeably impact either BenchmarkSQL or pgbench benchmarks.

[1] https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1
--
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

05 марта 2019 г., 00:59:13

On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Some comments on
> v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
> Mostly about code comments. In general, I think a round of copy-editing
> the comments, to use simpler language, would do good. The actual code
> changes look good to me.

I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.

> Seems confusing to first say assertively that "*bufptr contains the page
> that the new tuple unambiguously belongs to", and then immediately go on
> to list a whole bunch of exceptions. Maybe just remove "unambiguously".

This is fixed in v14 of the patch series.

> This happens very seldom, because you only get an incomplete split if
> you crash in the middle of a page split, and that should be very rare. I
> don't think we need to list more fine-grained conditions here, that just
> confuses the reader.

Fixed in v14.

> > /*
> >  *    _bt_useduplicatepage() -- Settle for this page of duplicates?

> So, this function is only used for legacy pg_upgraded indexes. The
> comment implies that, but doesn't actually say it.

I made that more explicit in v14.

> > /*
> >  * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
> >  * and non-pivot tuples, despite differences in how heap TID is represented.

> > #define BTreeTupleGetHeapTID(itup) ...

I fixed up the comments above BTreeTupleGetHeapTID() significantly.

> The comment claims that "all pivot tuples must be as of BTREE_VERSION
> 4". I thought that all internal tuples are called pivot tuples, even on
> version 3.

In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).

Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.

We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.

> I think what this means to say is that this macro is only
> used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
> have a heap TID in BTREE_VERSION 4 indexes.

My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.

That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.

There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.

> This macro (and many others in nbtree.h) is quite complicated. A static
> inline function might be easier to read.

I agree that the macros are complicated, but that seems to be because
the rules are complicated. I'd rather leave the macros in place, and
improve the commentary on the rules.

> 'xlmeta.version' is set incorrectly.

Oops. Fixed in v14.

> I find this comment difficult to read. I suggest rewriting it to:
>
> /*
>   * The current Btree version is 4. That's what you'll get when you create
>   * a new index.

I used your wording for this in v14, almost verbatim.

> Now that the index tuple format becomes more complicated, I feel that
> there should be some kind of an overview explaining the format. All the
> information is there, in the comments in nbtree.h, but you have to piece
> together all the details to get the overall picture. I wrote this to
> keep my head straight:

v14 uses your diagrams in nbtree.h, and expands some existing
discussion of INCLUDE indexes/non-key attributes/tuple format. Let me
know what you think.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

05 марта 2019 г., 14:37:08

I'm looking at the first patch in the series now. I'd suggest that you 
commit that very soon. It's useful on its own, and seems pretty much 
ready to be committed already. I don't think it will be much affected by 
whatever changes we make to the later patches, anymore.

I did some copy-editing of the code comments, see attached patch which 
applies on top of v14-0001-Refactor-nbtree-insertion-scankeys.patch. 
Mostly, to use more Plain English: use active voice instead of passive, 
split long sentences, avoid difficult words.

I also had a few comments and questions on some details. I added them in 
the same patch, marked with "HEIKKI:". Please take a look.

- Heikki

Вложения

v14-0001-Refactor-nbtree-insertion-scankeys-HEIKKI-comments.patch

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

05 марта 2019 г., 23:03:17

On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm looking at the first patch in the series now. I'd suggest that you
> commit that very soon. It's useful on its own, and seems pretty much
> ready to be committed already. I don't think it will be much affected by
> whatever changes we make to the later patches, anymore.

I agree that the parts covered by the first patch in the series are
very unlikely to need changes, but I hesitate to commit it weeks ahead
of the other patches. Some of the things that make _bt_findinsertloc()
fast are missing for v3 indexes. The "consider secondary factors
during nbtree splits" patch actually more than compensates for that
with v3 indexes, at least in some cases, but the first patch applied
on its own will slightly regress performance. At least, I benchmarked
the first patch on its own several months ago and noticed a small
regression at the time, though I don't have the exact details at hand.
It might have been an invalid result, because I wasn't particularly
thorough at the time.

We do make some gains in the first patch  (the _bt_check_unique()
thing), but we also check the high key more than we need to within
_bt_findinsertloc() for non-unique indexes. Plus, the microvacuuming
thing isn't as streamlined.

It's a lot of work to validate and revalidate the performance of a
patch like this, and I'd rather commit the first three patches within
a couple of days of each other (I can validate v3 indexes and v4
indexes separately, though). We can put off the other patches for
longer, and treat them as independent. I guess I'd also push the final
amcheck patch following the first three -- no point in holding back on
that. Then we'd be left with "Add "split after new tuple"
optimization", and "Add high key "continuescan" optimization" as
independent improvements that can be pushed at the last minute of the
final CF.

> I also had a few comments and questions on some details. I added them in
> the same patch, marked with "HEIKKI:". Please take a look.

Will respond now. Any point that I haven't responding to directly has
been accepted.

> +HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
> +weird to talk about it in the function comment. I didn't understand what
> +the point of adding this sentence was, so I removed it.

Maybe there is no point in the comment you reference here, but I like
the idea of "checkingunique", because that symbol name is a common
thread between a number of functions that coordinate with each other.
It's not just a local variable in one function.

> @@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
>             if (P_RIGHTMOST(opaque))
>                 break;
>             highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
> +
> +           /*
> +            * HEIKKI: This assertion might fire if the user-defined opclass
> +            * is broken. It's just an assertion, so maybe that's ok. With a
> +            * broken opclass, it's obviously "garbage in, garbage out", but
> +            * we should try to behave sanely anyway. I don't remember what
> +            * our general policy on that is; should we assert, elog(ERROR),
> +            * or continue silently in that case? An elog(ERROR) or
> +            * elog(WARNING) would feel best to me, but I don't remember what
> +            * we usually do.
> +            */
>             Assert(highkeycmp <= 0);
>             if (highkeycmp != 0)
>                 break;

We don't really have a general policy on it. However, I don't have any
sympathy for the idea of trying to solider on with a corrupt index. I
also don't think that it's worth making this a "can't happen" error.
Like many of my assertions, this assertion is intended to document an
invariant. I don't actually anticipate that it could ever really fail.

> +Should we mention explicitly that this binary-search reuse is only applicable
> +if unique checks were performed? It's kind of implied by the fact that it's
> +_bt_check_unique() that saves the state, but perhaps we should be more clear
> +about it.

I guess so.

> +What is a "garbage duplicate"? Same as a "dead duplicate"?

Yes.

> +The last sentence, about garbage duplicates, seems really vague. Why do we
> +ever do any comparisons that are not strictly necessary? Perhaps it's best to
> +just remove that last sentence.

Okay -- will remove.

> +
> +HEIKKI: I don't buy the argument that microvacuuming has to happen here. You
> +could easily imagine a separate function that does microvacuuming, and resets
> +(or even updates) the binary-search cache in the insertion key. I agree this
> +is a convenient place to do it, though.

It wasn't supposed to be a water-tight argument. I'll just say that
it's convenient.

> +/* HEIKKI:
> +Do we need 'checkunique' as an argument? If unique checks were not
> +performed, the insertion key will simply not have saved state.
> +*/

We need it in the next patch in the series, because it's also useful
for optimizing away the high key check with non-unique indexes. We
know that _bt_moveright() was called at the leaf level, with scantid
filled in, so there is no question of needing to move right within
_bt_findinsertloc() (provided it's a heapkeyspace index).

Actually, we even need it in the first patch: we only restore a binary
search because we know that there is something to restore, and must
ask for it to be restored explicitly (anything else seems unsafe).
Maybe we can't restore it because it's not a unique index, or maybe we
can't restore it because we microvacuumed, or moved right to get free
space. I don't think that it'll be helpful to make _bt_findinsertloc()
pretend that it doesn't know exactly where the binary search bounds
come from -- it already knows plenty about unique indexes
specifically, and about how it may have to invalidate the bounds. The
whole way that it couples buffer locks is only useful for unique
indexes, so it already knows *plenty* about unique indexes
specifically.

I actually like the idea of making certain insertion scan key mutable
state relating to search bounds hidden in the case of "dynamic prefix
truncation" [1]. Doesn't seem to make sense here, though.

> +   /* HEIKKI: I liked this comment that we used to have here, before this patch: */
> +   /*----------
> +    * If we will need to split the page to put the item on this page,
> +    * check whether we can put the tuple somewhere to the right,
> +    * instead.  Keep scanning right until we

> +   /* HEIKKI: Maybe it's not relevant with the later patches, but at least
> +    * with just this first patch, it's still valid. I noticed that the
> +    * comment is now in _bt_useduplicatepage, it seems a bit out-of-place
> +    * there. */

I don't think it matters, because I don't think that the first patch
can be justified as an independent piece of work. I like the idea of
breaking up the patch series, because it makes it all easier to
understand, but the first three patches are kind of intertwined.

> +HEIKKI: In some scenarios, if the BTP_HAS_GARBAGE flag is falsely set, we would
> +try to microvacuum the page twice: first in _bt_useduplicatepage, and second
> +time here. That's because _bt_vacuum_one_page() doesn't clear the flag, if
> +there are in fact no LP_DEAD items. That's probably insignificant and not worth
> +worrying about, but I thought I'd mention it.

Right. It's also true that all future insertions will reach
_bt_vacuum_one_page() and do the same again, until there either is
garbage, or until the page splits.

> -    * rightmost page case), all the items on the right half will be user data
> -    * (there is no existing high key that needs to be relocated to the new
> -    * right page).
> +    * rightmost page case), all the items on the right half will be user
> +    * data.
> +    *
> +HEIKKI: I don't think the comment change you made here was needed or
> +helpful, so I reverted it.

I thought it added something when you're looking at it from a
WAL-logging point of view. But I can live without this.

> - * starting a regular index scan some can be omitted.  The array is used as a
> + * starting a regular index scan, some can be omitted.  The array is used as a
>   * flexible array member, though it's sized in a way that makes it possible to
>   * use stack allocations.  See nbtree/README for full details.
> +
> +HEIKKI: I don't see anything in the README about stack allocations. What
> +exactly does the README reference refer to? No code seems to actually allocate
> +this in the stack, so we don't really need that.

The README discusses insertion scankeys in general, though. I think
that you read it that way because you're focussed on my changes, and
not because it actually implies that the README talks about the stack
thing specifically. (But I can change it if you like.)

There is a stack allocation in _bt_first(). This was once just another
dynamic allocation, that called _bt_mkscankey(), but that regressed
nested loop joins, so I had to make it work the same way as before. I
noticed this about six months ago, because there was a clear impact on
the TPC-C "Stock level" transaction, which is now sometimes twice as
fast with the patch series. Note also that commit d961a568, from 2005,
changed the _bt_first() code to use a stack allocation. Besides,
sticking to a stack allocation makes the changes to _bt_first()
simpler, even though it has to duplicate a few things from
_bt_mkscankey().

I could get you a v15 that integrates your changes pretty quickly, but
I'll hold off on that for at least a few days. I have a feeling that
you'll have more feedback for me to work through before too long.

[1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Robert Haas

Дата:

07 марта 2019 г., 00:37:11

On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I agree that the parts covered by the first patch in the series are
> very unlikely to need changes, but I hesitate to commit it weeks ahead
> of the other patches.

I know I'm stating the obvious here, but we don't have many weeks left
at this point.  I have not reviewed any code, but I have been
following this thread and I'd really like to see this work go into
PostgreSQL 12, assuming it's in good enough shape.  It sounds like
really good stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

07 марта 2019 г., 02:42:32

On Wed, Mar 6, 2019 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I know I'm stating the obvious here, but we don't have many weeks left
> at this point.  I have not reviewed any code, but I have been
> following this thread and I'd really like to see this work go into
> PostgreSQL 12, assuming it's in good enough shape.  It sounds like
> really good stuff.

Thanks!

Barring any objections, I plan to commit the first 3 patches (plus the
amcheck "relocate" patch) within 7 - 10 days (that's almost
everything). Heikki hasn't reviewed 'Add high key "continuescan"
optimization' yet, and it seems like he should take a look at that
before I proceed with it. But that seems like the least controversial
enhancement within the entire patch series, so I'm not very worried
about it.

I'm currently working on v15, which has comment-only revisions
requested by Heikki. I expect to continue to work with him to make
sure that he is happy with the presentation. I'll also need to
revalidate the performance of the patch series following recent minor
changes to the logic for choosing a split point. That can take days.
This is why I don't want to commit the first patch without committing
at least the first three all at once -- it increases the amount of
performance validation work I'll have to do considerably. (I have to
consider both v4 and v3 indexes already, which seems like enough
work.)

Two of the later patches (one of which I plan to push as part of the
first batch of commits) use heuristics to decide where to split the
page. As a Postgres contributor, I have learned to avoid inventing
heuristics, so this automatically makes me a bit uneasy. However, I
don't feel so bad about it here, on reflection. The on-disk size of
the TPC-C indexes are reduced by 35% with the 'Add "split after new
tuple" optimization' patch (I think that the entire database is
usually about 12% smaller). There simply isn't a fundamentally better
way to get the same benefit, and I'm sure that nobody will argue that
we should just accept the fact that the most influential database
benchmark of all time has a big index bloat problem with Postgres.
That would be crazy.

That said, it's not impossible that somebody will shout at me because
my heuristics made their index bloated. I can't see how that could
happen, but I am prepared. I can always adjust the heuristics when new
information comes to light. I have fairly thorough test cases that
should allow me to do this without regressing anything else. This is a
risk that can be managed sensibly.

There is no gnawing ambiguity about the on-disk changes laid down in
the second patch (nor the first patch), though. Making on-disk changes
is always a bit scary, but making the keys unique is clearly a big
improvement architecturally, as it brings nbtree closer to the Lehman
& Yao design without breaking anything for v3 indexes (v3 indexes
simply aren't allowed to use a heap TID in their scankey). Unique keys
also allow amcheck to relocate every tuple in the index from the root
page, using the same code path as regular index scans. We'll be
relying on the uniqueness of keys within amcheck from the beginning,
before anybody teaches nbtree to perform retail index tuple deletion.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

07 марта 2019 г., 09:15:47

On 06/03/2019 04:03, Peter Geoghegan wrote:
> On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I'm looking at the first patch in the series now. I'd suggest that you
>> commit that very soon. It's useful on its own, and seems pretty much
>> ready to be committed already. I don't think it will be much affected by
>> whatever changes we make to the later patches, anymore.

After staring at the first patch for bit longer, a few things started to 
bother me:

* The new struct is called BTScanInsert, but it's used for searches, 
too. It makes sense when you read the README, which explains the 
difference between "search scan keys" and "insertion scan keys", but now 
that we have a separate struct for this, perhaps we call insertion scan 
keys with a less confusing name. I don't know what to suggest, though. 
"Positioning key"?

* We store the binary search bounds in BTScanInsertData, but they're 
only used during insertions.

* The binary search bounds are specific for a particular buffer. But 
that buffer is passed around separately from the bounds. It seems easy 
to have them go out of sync, so that you try to use the cached bounds 
for a different page. The savebinsrch and restorebinsrch is used to deal 
with that, but it is pretty complicated.

I came up with the attached (against master), which addresses the 2nd 
and 3rd points. I added a whole new BTInsertStateData struct, to hold 
the binary search bounds. BTScanInsert now only holds the 'scankeys' 
array, and the 'nextkey' flag. The new BTInsertStateData struct also 
holds the current buffer we're considering to insert to, and a 
'bounds_valid' flag to indicate if the saved bounds are valid for the 
current buffer. That way, it's more straightforward to clear the 
'bounds_valid' flag whenever we move right.

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary 
search like _bt_binsrch does, but the bounds caching is only done in 
_bt_binsrch_insert. Seems more clear to have separate functions for them 
now, even though there's some duplication.

>> +/* HEIKKI:
>> +Do we need 'checkunique' as an argument? If unique checks were not
>> +performed, the insertion key will simply not have saved state.
>> +*/
> 
> We need it in the next patch in the series, because it's also useful
> for optimizing away the high key check with non-unique indexes. We
> know that _bt_moveright() was called at the leaf level, with scantid
> filled in, so there is no question of needing to move right within
> _bt_findinsertloc() (provided it's a heapkeyspace index).

Hmm. Perhaps it would be to move the call to _bt_binsrch (or 
_bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So 
that _bt_findinsertloc would only be responsible for finding the correct 
page to insert to. So the overall code, after patch #2, would be like:

/*
  * Do the insertion. First move right to find the correct page to
  * insert to, if necessary. If we're inserting to a non-unique index,
  * _bt_search() already did this when it checked if a move to the
  * right was required for leaf page.  Insertion scankey's scantid
  * would have been filled out at the time. On a unique index, the
  * current buffer is the first buffer containing duplicates, however,
  * so we may need to move right to the correct location for this
  * tuple.
  */
if (checkingunique || itup_key->heapkeyspace)
    _bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, 
newitemoff, false);

Does this make sense?

> Actually, we even need it in the first patch: we only restore a binary
> search because we know that there is something to restore, and must
> ask for it to be restored explicitly (anything else seems unsafe).
> Maybe we can't restore it because it's not a unique index, or maybe we
> can't restore it because we microvacuumed, or moved right to get free
> space. I don't think that it'll be helpful to make _bt_findinsertloc()
> pretend that it doesn't know exactly where the binary search bounds
> come from -- it already knows plenty about unique indexes
> specifically, and about how it may have to invalidate the bounds. The
> whole way that it couples buffer locks is only useful for unique
> indexes, so it already knows *plenty* about unique indexes
> specifically.

The attached new version simplifies this, IMHO. The bounds and the 
current buffer go together in the same struct, so it's easier to keep 
track whether the bounds are valid or not.

>> - * starting a regular index scan some can be omitted.  The array is used as a
>> + * starting a regular index scan, some can be omitted.  The array is used as a
>>    * flexible array member, though it's sized in a way that makes it possible to
>>    * use stack allocations.  See nbtree/README for full details.
>> +
>> +HEIKKI: I don't see anything in the README about stack allocations. What
>> +exactly does the README reference refer to? No code seems to actually allocate
>> +this in the stack, so we don't really need that.
> 
> The README discusses insertion scankeys in general, though. I think
> that you read it that way because you're focussed on my changes, and
> not because it actually implies that the README talks about the stack
> thing specifically. (But I can change it if you like.)
> 
> There is a stack allocation in _bt_first(). This was once just another
> dynamic allocation, that called _bt_mkscankey(), but that regressed
> nested loop joins, so I had to make it work the same way as before.

Ah, gotcha, I missed that.

- Heikki

Вложения

v14-heikki-2-0001-Refactor-nbtree-insertion-scankeys.patch

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

07 марта 2019 г., 09:54:48

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> After staring at the first patch for bit longer, a few things started to
> bother me:
>
> * The new struct is called BTScanInsert, but it's used for searches,
> too. It makes sense when you read the README, which explains the
> difference between "search scan keys" and "insertion scan keys", but now
> that we have a separate struct for this, perhaps we call insertion scan
> keys with a less confusing name. I don't know what to suggest, though.
> "Positioning key"?

I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.

> * We store the binary search bounds in BTScanInsertData, but they're
> only used during insertions.
>
> * The binary search bounds are specific for a particular buffer. But
> that buffer is passed around separately from the bounds. It seems easy
> to have them go out of sync, so that you try to use the cached bounds
> for a different page. The savebinsrch and restorebinsrch is used to deal
> with that, but it is pretty complicated.

That might be an improvement, but I do think that using mutable state
in the insertion scankey, to restrict a search is an idea that could
work well in at least one other way. That really isn't a once-off
thing, even though it looks that way.

> I came up with the attached (against master), which addresses the 2nd
> and 3rd points. I added a whole new BTInsertStateData struct, to hold
> the binary search bounds. BTScanInsert now only holds the 'scankeys'
> array, and the 'nextkey' flag.

It will also have to store heapkeyspace, of course. And minusinfkey.
BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

> The new BTInsertStateData struct also
> holds the current buffer we're considering to insert to, and a
> 'bounds_valid' flag to indicate if the saved bounds are valid for the
> current buffer. That way, it's more straightforward to clear the
> 'bounds_valid' flag whenever we move right.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.
For a second, I thought that you forgot to invalidate the bounds_valid
flag, because you didn't pass it directly, by value to
_bt_useduplicatepage().

> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

I'll have to think about that some more. Having a separate
_bt_binsrch_insert() may be worth it, but I'll need to do some
profiling.

> Hmm. Perhaps it would be to move the call to _bt_binsrch (or
> _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So
> that _bt_findinsertloc would only be responsible for finding the correct
> page to insert to. So the overall code, after patch #2, would be like:

Maybe, but as I said it's not like _bt_findinsertloc() doesn't know
all about unique indexes already. This is pointed out in a comment in
_bt_doinsert(), even. I guess that it might have to be changed to say
_bt_findinsertpage() instead, with your new approach.

> /*
>   * Do the insertion. First move right to find the correct page to
>   * insert to, if necessary. If we're inserting to a non-unique index,
>   * _bt_search() already did this when it checked if a move to the
>   * right was required for leaf page.  Insertion scankey's scantid
>   * would have been filled out at the time. On a unique index, the
>   * current buffer is the first buffer containing duplicates, however,
>   * so we may need to move right to the correct location for this
>   * tuple.
>   */
> if (checkingunique || itup_key->heapkeyspace)
>         _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>
> newitemoff = _bt_binsrch_insert(rel, &insertstate);
>
> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
> newitemoff, false);
>
> Does this make sense?

I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.

I suppose that this could be an improvement, provided all the
assertions that verify that the work "_bt_findinsertpage()" would have
done if called was in fact unnecessary. (e.g., check the high
key/rightmost-ness)

> The attached new version simplifies this, IMHO. The bounds and the
> current buffer go together in the same struct, so it's easier to keep
> track whether the bounds are valid or not.

I'll look into integrating this with my current draft v15 tomorrow.
Need to sleep on it.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

07 марта 2019 г., 10:23:15

On Wed, Mar 6, 2019 at 10:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It will also have to store heapkeyspace, of course. And minusinfkey.
> BTW, I would like to hear what you think of the idea of minusinfkey
> (and the !minusinfkey optimization) specifically.

> I'm not sure that that's an improvement. Moving right should be very
> rare with my patch. gcov shows that we never move right here anymore
> with the regression tests, or within _bt_check_unique() -- not once.
> For a second, I thought that you forgot to invalidate the bounds_valid
> flag, because you didn't pass it directly, by value to
> _bt_useduplicatepage().

BTW, the !minusinfkey optimization is why we literally never move
right within _bt_findinsertloc() while the regression tests run. We
always land on the correct leaf page to begin with. (It works with
unique index insertions, where scantid is NULL when we descend the
tree.)

In general, there are two good reasons for us to move right:

* There was a concurrent page split (or page deletion), and we just
missed the downlink in the parent, and need to recover.

* We omit some columns from our scan key (at least scantid), and there
are perhaps dozens of matches -- this is not relevant to
_bt_doinsert() code.

The single value strategy used by nbtsplitloc.c does a good job of
making it unlikely that _bt_check_unique()-wise duplicates will cross
leaf pages, so there will almost always be one leaf page to visit.
And, the !minusinfkey optimization ensures that the only reason we'll
move right is because of a concurrent page split, within
_bt_moveright().

The buffer lock coupling move to the right that _bt_findinsertloc()
does should be considered an edge case with all of these measures, at
least with v4 indexes.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

07 марта 2019 г., 10:41:37

On 07/03/2019 14:54, Peter Geoghegan wrote:
> On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> After staring at the first patch for bit longer, a few things started to
>> bother me:
>>
>> * The new struct is called BTScanInsert, but it's used for searches,
>> too. It makes sense when you read the README, which explains the
>> difference between "search scan keys" and "insertion scan keys", but now
>> that we have a separate struct for this, perhaps we call insertion scan
>> keys with a less confusing name. I don't know what to suggest, though.
>> "Positioning key"?
> 
> I think that insertion scan key is fine. It's been called that for
> almost twenty years. It's not like it's an intuitive concept that
> could be conveyed easily if only we came up with a new, pithy name.

Yeah. It's been like that forever, but I must confess I hadn't paid any 
attention to it, until now. I had not understood that text in the README 
explaining the difference between search and insertion scan keys, before 
looking at this patch. Not sure I ever read it with any thought. Now 
that I understand it, I don't like the "insertion scan key" name.

> BTW, I would like to hear what you think of the idea of minusinfkey
> (and the !minusinfkey optimization) specifically.

I don't understand it :-(. I guess that's valuable feedback on its own. 
I'll spend more time reading the code around that, but meanwhile, if you 
can think of a simpler way to explain it in the comments, that'd be good.

>> The new BTInsertStateData struct also
>> holds the current buffer we're considering to insert to, and a
>> 'bounds_valid' flag to indicate if the saved bounds are valid for the
>> current buffer. That way, it's more straightforward to clear the
>> 'bounds_valid' flag whenever we move right.
> 
> I'm not sure that that's an improvement. Moving right should be very
> rare with my patch. gcov shows that we never move right here anymore
> with the regression tests, or within _bt_check_unique() -- not once.

I haven't given performance much thought, really. I don't expect this 
method to be any slower, but the point of the refactoring is to make the 
code easier to understand.

>> /*
>>    * Do the insertion. First move right to find the correct page to
>>    * insert to, if necessary. If we're inserting to a non-unique index,
>>    * _bt_search() already did this when it checked if a move to the
>>    * right was required for leaf page.  Insertion scankey's scantid
>>    * would have been filled out at the time. On a unique index, the
>>    * current buffer is the first buffer containing duplicates, however,
>>    * so we may need to move right to the correct location for this
>>    * tuple.
>>    */
>> if (checkingunique || itup_key->heapkeyspace)
>>          _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>>
>> newitemoff = _bt_binsrch_insert(rel, &insertstate);
>>
>> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
>> newitemoff, false);
>>
>> Does this make sense?
> 
> I guess you're saying this because you noticed that the for (;;) loop
> in _bt_findinsertloc() doesn't do that much in many cases, because of
> the fastpath.

The idea is that _bt_findinsertpage() would not need to know whether the 
unique checks were performed or not. I'd like to encapsulate all the 
information about the "insert position we're considering" in the 
BTInsertStateData struct. Passing 'checkingunique' as a separate 
argument violates that, because when it's set, the key means something 
slightly different.

Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether 
'scantid' is set, instead of 'checkingunique'. That would seem better. 
If it looks like 'checkingunique', it's making the assumption that if 
unique checks were not performed, then we are already positioned on the 
correct page, according to the heap TID. But looking at 'scantid' seems 
like a more direct way of getting the same information. And then we 
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.

So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

        if (!checkingunique && itup_key->heapkeyspace)
            break;

With this:

        if (itup_key->scantid)
            break;

And remove the 'checkingunique' argument from _bt_findinsertloc.

- Heikki

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

07 марта 2019 г., 10:58:58

On 07/03/2019 15:41, Heikki Linnakangas wrote:
> On 07/03/2019 14:54, Peter Geoghegan wrote:
>> On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> After staring at the first patch for bit longer, a few things started to
>>> bother me:
>>>
>>> * The new struct is called BTScanInsert, but it's used for searches,
>>> too. It makes sense when you read the README, which explains the
>>> difference between "search scan keys" and "insertion scan keys", but now
>>> that we have a separate struct for this, perhaps we call insertion scan
>>> keys with a less confusing name. I don't know what to suggest, though.
>>> "Positioning key"?
>>
>> I think that insertion scan key is fine. It's been called that for
>> almost twenty years. It's not like it's an intuitive concept that
>> could be conveyed easily if only we came up with a new, pithy name.
> 
> Yeah. It's been like that forever, but I must confess I hadn't paid any
> attention to it, until now. I had not understood that text in the README
> explaining the difference between search and insertion scan keys, before
> looking at this patch. Not sure I ever read it with any thought. Now
> that I understand it, I don't like the "insertion scan key" name.
> 
>> BTW, I would like to hear what you think of the idea of minusinfkey
>> (and the !minusinfkey optimization) specifically.
> 
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.
> 
>>> The new BTInsertStateData struct also
>>> holds the current buffer we're considering to insert to, and a
>>> 'bounds_valid' flag to indicate if the saved bounds are valid for the
>>> current buffer. That way, it's more straightforward to clear the
>>> 'bounds_valid' flag whenever we move right.
>>
>> I'm not sure that that's an improvement. Moving right should be very
>> rare with my patch. gcov shows that we never move right here anymore
>> with the regression tests, or within _bt_check_unique() -- not once.
> 
> I haven't given performance much thought, really. I don't expect this
> method to be any slower, but the point of the refactoring is to make the
> code easier to understand.
> 
>>> /*
>>>     * Do the insertion. First move right to find the correct page to
>>>     * insert to, if necessary. If we're inserting to a non-unique index,
>>>     * _bt_search() already did this when it checked if a move to the
>>>     * right was required for leaf page.  Insertion scankey's scantid
>>>     * would have been filled out at the time. On a unique index, the
>>>     * current buffer is the first buffer containing duplicates, however,
>>>     * so we may need to move right to the correct location for this
>>>     * tuple.
>>>     */
>>> if (checkingunique || itup_key->heapkeyspace)
>>>           _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>>>
>>> newitemoff = _bt_binsrch_insert(rel, &insertstate);
>>>
>>> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
>>> newitemoff, false);
>>>
>>> Does this make sense?
>>
>> I guess you're saying this because you noticed that the for (;;) loop
>> in _bt_findinsertloc() doesn't do that much in many cases, because of
>> the fastpath.
> 
> The idea is that _bt_findinsertpage() would not need to know whether the
> unique checks were performed or not. I'd like to encapsulate all the
> information about the "insert position we're considering" in the
> BTInsertStateData struct. Passing 'checkingunique' as a separate
> argument violates that, because when it's set, the key means something
> slightly different.
> 
> Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether
> 'scantid' is set, instead of 'checkingunique'. That would seem better.
> If it looks like 'checkingunique', it's making the assumption that if
> unique checks were not performed, then we are already positioned on the
> correct page, according to the heap TID. But looking at 'scantid' seems
> like a more direct way of getting the same information. And then we
> won't need to pass the 'checkingunique' flag as an "out-of-band" argument.
> 
> So I'm specifically suggesting that we replace this, in _bt_findinsertloc:
> 
>         if (!checkingunique && itup_key->heapkeyspace)
>             break;
> 
> With this:
> 
>         if (itup_key->scantid)
>             break;
> 
> And remove the 'checkingunique' argument from _bt_findinsertloc.

Ah, scratch that. By the time we call _bt_findinsertloc(), scantid has 
already been restored, even if it was not set originally when we did 
_bt_search.

My dislike here is that passing 'checkingunique' as a separate argument 
acts like a "modifier", changing slightly the meaning of the insertion 
scan key. If it's not set, we know we're positioned on the correct page. 
Otherwise, we might not be. And it's a pretty indirect way of saying 
that, as it also depends 'heapkeyspace'. Perhaps add a flag to 
BTInsertStateData, to indicate the same thing more explicitly. Something 
like "bool is_final_insertion_page; /* when set, no need to move right */".

- Heikki

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

07 марта 2019 г., 11:14:23

On 05/03/2019 05:16, Peter Geoghegan wrote:
> Attached is v14, which has changes based on your feedback. 
As a quick check of the backwards-compatibility code, i.e. 
!heapkeyspace, I hacked _bt_initmetapage to force the version number to 
3, and ran the regression tests. It failed an assertion in the 
'create_index' test:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f2943f9a535 in __GI_abort () at abort.c:79
#2  0x00005622c7d9d6b4 in ExceptionalCondition 
(conditionName=0x5622c7e4cbe8 "!(_bt_check_natts(rel, key->heapkeyspace, 
page, offnum))", errorType=0x5622c7e4c62a "FailedAssertion",
     fileName=0x5622c7e4c734 "nbtsearch.c", lineNumber=511) at assert.c:54
#3  0x00005622c78627fb in _bt_compare (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, page=0x7f293d433780 "", offnum=2) at nbtsearch.c:511
#4  0x00005622c7862640 in _bt_binsrch (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, buf=4622) at nbtsearch.c:432
#5  0x00005622c7861ec9 in _bt_search (rel=0x5622c85afbe0, 
key=0x7ffd7a996db0, bufP=0x7ffd7a9976d4, access=1, 
snapshot=0x5622c8353740) at nbtsearch.c:142
#6  0x00005622c7863a44 in _bt_first (scan=0x5622c841e828, 
dir=ForwardScanDirection) at nbtsearch.c:1183
#7  0x00005622c785f8b0 in btgettuple (scan=0x5622c841e828, 
dir=ForwardScanDirection) at nbtree.c:245
#8  0x00005622c78522e3 in index_getnext_tid (scan=0x5622c841e828, 
direction=ForwardScanDirection) at indexam.c:542
#9  0x00005622c7a67784 in IndexOnlyNext (node=0x5622c83ad280) at 
nodeIndexonlyscan.c:120
#10 0x00005622c7a438d5 in ExecScanFetch (node=0x5622c83ad280, 
accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9 
<IndexOnlyRecheck>) at execScan.c:95
#11 0x00005622c7a4394a in ExecScan (node=0x5622c83ad280, 
accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9 
<IndexOnlyRecheck>) at execScan.c:145
#12 0x00005622c7a67c73 in ExecIndexOnlyScan (pstate=0x5622c83ad280) at 
nodeIndexonlyscan.c:322
#13 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83ad280) at 
execProcnode.c:445
#14 0x00005622c7a501a5 in ExecProcNode (node=0x5622c83ad280) at 
../../../src/include/executor/executor.h:231
#15 0x00005622c7a50693 in fetch_input_tuple (aggstate=0x5622c83acdd0) at 
nodeAgg.c:406
#16 0x00005622c7a529d9 in agg_retrieve_direct (aggstate=0x5622c83acdd0) 
at nodeAgg.c:1737
#17 0x00005622c7a525a9 in ExecAgg (pstate=0x5622c83acdd0) at nodeAgg.c:1552
#18 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83acdd0) at 
execProcnode.c:445
#19 0x00005622c7a3621d in ExecProcNode (node=0x5622c83acdd0) at 
../../../src/include/executor/executor.h:231
#20 0x00005622c7a38bd9 in ExecutePlan (estate=0x5622c83acb78, 
planstate=0x5622c83acdd0, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true, numberTuples=0,
     direction=ForwardScanDirection, dest=0x5622c8462088, 
execute_once=true) at execMain.c:1645
#21 0x00005622c7a36872 in standard_ExecutorRun 
(queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0, 
execute_once=true) at execMain.c:363
#22 0x00005622c7a36696 in ExecutorRun (queryDesc=0x5622c83a9eb8, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:307
#23 0x00005622c7c357dc in PortalRunSelect (portal=0x5622c8336778, 
forward=true, count=0, dest=0x5622c8462088) at pquery.c:929
#24 0x00005622c7c3546f in PortalRun (portal=0x5622c8336778, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x5622c8462088, altdest=0x5622c8462088,
     completionTag=0x7ffd7a997d50 "") at pquery.c:770
#25 0x00005622c7c2f029 in exec_simple_query (query_string=0x5622c82cf508 
"SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 
IS NULL;") at postgres.c:1215
#26 0x00005622c7c3369a in PostgresMain (argc=1, argv=0x5622c82faee0, 
dbname=0x5622c82fac50 "regression", username=0x5622c82c81e8 "heikki") at 
postgres.c:4256
#27 0x00005622c7b8bcf2 in BackendRun (port=0x5622c82f3d80) at 
postmaster.c:4378
#28 0x00005622c7b8b45b in BackendStartup (port=0x5622c82f3d80) at 
postmaster.c:4069
#29 0x00005622c7b87633 in ServerLoop () at postmaster.c:1699
#30 0x00005622c7b86e61 in PostmasterMain (argc=3, argv=0x5622c82c6160) 
at postmaster.c:1372
#31 0x00005622c7aa9925 in main (argc=3, argv=0x5622c82c6160) at main.c:228

I haven't investigated any deeper, but apparently something's broken. 
This was with patch v14, without any further changes.

- Heikki

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

07 марта 2019 г., 11:23:59

On Thu, Mar 7, 2019 at 12:14 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I haven't investigated any deeper, but apparently something's broken.
> This was with patch v14, without any further changes.

Try it with my patch -- attached.

I think that you missed that the INCLUDE indexes thing within
nbtsort.c needs to be changed back.

-- 
Peter Geoghegan

Вложения

0008-DEBUG-Force-version-3-artificially.patch

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

07 марта 2019 г., 11:37:43

On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > BTW, I would like to hear what you think of the idea of minusinfkey
> > (and the !minusinfkey optimization) specifically.
>
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

Here is another way of explaining it:

When I drew you that picture while we were in Lisbon, I mentioned to
you that the patch sometimes used a sentinel scantid value that was
greater than minus infinity, but less than any real scantid. This
could be used to force an otherwise-equal-to-pivot search to go left
rather than uselessly going right. I explained this about 30 minutes
in, when I was drawing you a picture.

Well, that sentinel heap TID thing doesn't exist any more, because it
was replaced by the !minusinfkey optimization, which is a
*generalization* of the same idea, which extends it to all columns
(not just the heap TID column). That way, you never have to go to two
pages just because you searched for a value that happened to be at the
"right at the edge" of a leaf page.

Page deletion wants to assume that truncated attributes from the high
key of the page being deleted have actual negative infinity values --
negative infinity is a value, just like any other, albeit one that can
only appear in pivot tuples. This is simulated by VACUUM using
"minusinfkey = true". We go left in the parent, not right, and land on
the correct leaf page. Technically we don't compare the negative
infinity values in the pivot to the negative infinity values in the
scankey, but we return 0 just as if we had, and found them equal.
Similarly, v3 indexes specify "minusinfkey = true" in all cases,
because they always want to go left -- just like in old Postgres
versions. They don't have negative infinity values (matches can be on
either side of the all-equal pivot, so they must go left).

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

07 марта 2019 г., 11:40:36

On Thu, Mar 7, 2019 at 12:37 AM Peter Geoghegan <pg@bowt.ie> wrote:
> When I drew you that picture while we were in Lisbon, I mentioned to
> you that the patch sometimes used a sentinel scantid value that was
> greater than minus infinity, but less than any real scantid. This
> could be used to force an otherwise-equal-to-pivot search to go left
> rather than uselessly going right. I explained this about 30 minutes
> in, when I was drawing you a picture.

I meant the opposite: it could be used to go right, instead of going
left when descending the tree and unnecessarily moving right on the
leaf level.

As I said, moving right on the leaf level (rather than during the
descent) should only happen when it's necessary, such as when there is
a concurrent page split. It shouldn't happen reliably when searching
for the same value, unless there really are matches across multiple
leaf pages, and that's just what we have to do.

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

07 марта 2019 г., 12:06:24

On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I don't understand it :-(. I guess that's valuable feedback on its own.
> I'll spend more time reading the code around that, but meanwhile, if you
> can think of a simpler way to explain it in the comments, that'd be good.

One more thing on this: If you force bitmap index scans (by disabling
index-only scans and index scans with the "enable_" GUCs), then you
get EXPLAIN (ANALYZE, BUFFERS) instrumentation for the index alone
(and the heap, separately). No visibility map accesses, which obscure
the same numbers for a similar index-only scan.

You can then observe that most searches of a single value will touch
the bare minimum number of index pages. For example, if there are 3
levels in the index, you should access only 3 index pages total,
unless there are literally hundreds of matches, and cannot avoid
storing them on more than one leaf page. You'll see that the scan
touches the minimum possible number of index pages, because of:

* Many duplicates strategy. (Not single value strategy, which I
incorrectly mentioned in relation to this earlier.)

* The !minusinfykey optimization, which ensures that we go to the
right of an otherwise-equal pivot tuple in an internal page, rather
than left.

* The "continuescan" high key patch, which ensures that the scan
doesn't go to the right from the first leaf page to try to find even
more matches. The high key on the same leaf page will indicate that
the scan is over, without actually visiting the sibling. (Again, I'm
assuming that your search is for a single value.)

-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

08 марта 2019 г., 07:22:54

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I came up with the attached (against master), which addresses the 2nd
> and 3rd points. I added a whole new BTInsertStateData struct, to hold
> the binary search bounds. BTScanInsert now only holds the 'scankeys'
> array, and the 'nextkey' flag. The new BTInsertStateData struct also
> holds the current buffer we're considering to insert to, and a
> 'bounds_valid' flag to indicate if the saved bounds are valid for the
> current buffer. That way, it's more straightforward to clear the
> 'bounds_valid' flag whenever we move right.
>
> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

Attached is v15, which does not yet integrate these changes. However,
it does integrate earlier feedback that you posted for v14. I also
cleaned up some comments within nbtsplitloc.c.

I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.

The !minusinfkey optimization is more than just a "nice to have".
Suffix truncation makes pivot tuples less restrictive about what can
go on each page, but that might actually hurt performance if we're not
also careful to descend directly to the leaf page where matches will
first appear (rather than descending to a page to its left). If we
needlessly descend to a page that's to the left of the leaf page we
really ought to go straight to, then there are cases that are
regressed rather than helped -- especially cases where splits use the
"many duplicates" strategy. You continue to get correct answers when
the !minusinfkey optimization is ripped out, but it seems almost
essential that we include it. While it's true that we've always had to
move left like this, it's also true that suffix truncation will make
it happen much more frequently. It would happen (without the
!minusinfkey optimization) most often where suffix truncation makes
pivot tuples smallest.

Once you grok the minusinfkey stuff, then we'll be in a better
position to work through the feedback about _bt_binsrch_insert() and
so on, I think. You may lack all of the context of how the second
patch goes on to use the new insertion scan key struct, so it will
probably make life easier if we're both on the same page. (Pun very
much intended.)

Thanks again!
-- 
Peter Geoghegan

On Sun, Mar 10, 2019 at 1:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
> > > they too have a heap TID attribute. nbtsearch.c code is not allowed to
> > > rely on its value, though, and must use
> > > minusinfkey/searching_for_pivot_tuple semantics (relying on its value
> > > being minus infinity is still relying on its value being something).
> >
> > Yeah. I find that's a complicated way to think about it. My mental model
> > is that v4 indexes store heap TIDs, and every tuple is unique thanks to
> > that. But on v3, we don't store heap TIDs, and duplicates are possible.
>
> I'll try it that way, then.

Attached is v16, which does it that way instead. There are simpler
comments, still located within _bt_compare(). These are based on your
suggested wording, with some changes. I think that I prefer it this
way too. Please let me know what you think.

Other changes:

* nbtsplitloc.c failed to consider the full range of values in the
split interval when deciding perfect penalty. It considered from the
middle to the left or right edge, rather than from the left edge to
the right edge. This didn't seem to really effect the quality of its
decisions very much, but it was still wrong. This is fixed by a new
function that determines the left and right edges of the split
interval -- _bt_interval_edges().

* We now record the smallest observed tuple during our pass over the
page to record split points. This is used by internal page splits, to
get a more useful "perfect penalty", saving cycles in the common case
where there isn't much variability in the size of tuples on the page
being split. The same field is used within the "split after new item"
optimization as a further crosscheck -- it's now impossible to fool it
into thinking that the page has equisized tuples.

The regression that I mentioned earlier isn't in pgbench type
workloads (even when the distribution is something more interesting
that the uniform distribution default). It is only in workloads with
lots of page splits and lots of index churn, where we get most of the
benefit of the patch, but also where the costs are most apparent.
Hopefully it can be fixed, but if not I'm inclined to think that it's
a price worth paying. This certainly still needs further analysis and
discussion, though. This revision of the patch does not attempt to
address that problem in any way.

-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

От

Peter Geoghegan

Дата:

12 марта 2019 г., 05:47:29

On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The regression that I mentioned earlier isn't in pgbench type
> workloads (even when the distribution is something more interesting
> that the uniform distribution default). It is only in workloads with
> lots of page splits and lots of index churn, where we get most of the
> benefit of the patch, but also where the costs are most apparent.
> Hopefully it can be fixed, but if not I'm inclined to think that it's
> a price worth paying. This certainly still needs further analysis and
> discussion, though. This revision of the patch does not attempt to
> address that problem in any way.

I believe that I've figured out what's going on here.

At first, I thought that this regression was due to the cycles that
have been added to page splits, but that doesn't seem to be the case
at all. Nothing that I did to make page splits faster helped (e.g.
temporarily go back to doing them "bottom up" made no difference). CPU
utilization was consistently slightly *higher* with the master branch
(patch spent slightly more CPU time idle). I now believe that the
problem is with LWLock/buffer lock contention on index pages, and that
that's an inherent cost with a minority of write-heavy high contention
workloads. A cost that we should just accept.

Making the orderline primary key about 40% smaller increases
contention when BenchmarkSQL is run with this particular
configuration. The latency for the NEW_ORDER transaction went from
~4ms average on master to ~5ms average with the patch, while the
latency for other types of transactions was either unchanged or
improved. It's noticeable, but not that noticeable. This needs to be
put in context. The final transactions per minute for this
configuration was 250,000, with a total of only 100 warehouses. What
this boils down to is that the throughput per warehouse is about 8000%
of the maximum valid NOPM specified by the TPC-C spec [1]. In other
words, the database is too small relative to the machine, by a huge
amount, making the result totally and utterly invalid if you go on
what the TPC-C spec says. This exaggerates the LWLock/buffer lock
contention on index pages.

TPC-C is supposed to simulate a real use case with a plausible
configuration, but the details here are totally unrealistic. For
example, there are 3 million customers here (there are always 30k
customers per warehouse). 250k TPM means that there were about 112k
new orders per minute. It's hard to imagine a population of 3 million
customers making 112k orders per minute. That's over 20 million orders
in the first 3 hour long run that I got these numbers from. Each of
these orders has an average of about 10 line items. These people must
be very busy, and must have an awful lot of storage space in their
homes! (There are various other factors here, such as skew, and the
details will never be completely realistic anyway, but you take my
point. TPC-C is *designed* to be a realistic distillation of a real
use case, going so far as to require usable GUI input terminals when
evaluating a formal benchmark submission.)

The benchmark that I posted in mid-February [2] (which showed better
performance across the board) was much closer to what the TPC-C spec
requires -- that was only ~400% of maximum valid NOPM (the
BenchmarkSQL html reports will tell you this if you download the
archive I posted), and had 2,000 warehouses. TPC-C is *supposed* to be
I/O bound, and I/O bound workloads are what the patch helps with the
most. The general idea with TPC-C's NOPM is that you're required to
increase the number of warehouses as throughput increases. This stops
you from getting an unrealistically favorable result by churning
through a small amount of data, from the same few warehouses.

The only benchmark that I ran that actually satisfied TPC-C's NOPM
requirements had a total of 7,000 warehouses, and was almost a full
terabyte in size on the master branch. This was run on an i3.4xlarge
high I/O AWS ec2 instance. That was substantially I/O bound, and had
an improvement in throughput that was very similar to the mid-February
results which came from my home server -- we see a ~7.5% increase in
transaction throughput after a few hours. I attach a graph of block
device reads/writes for the second 4 hour run for this same 7,000
warehouse benchmark (master and patch). This shows a substantial
reduction in I/O according to OS-level instrumentation. (Note that the
same FS/logical block device was used for both WAL and database
files.)

In conclusion: I think that this regression is a cost worth accepting.
The regression in throughput is relatively small (2% - 3%), and the
NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
to be used for 45% of all transactions with TPC-C, and inserts into
the largest index by far, without reading much). The patch overtakes
master after a few hours anyway -- the patch will still win after
about 6 hours, once the database gets big enough, despite all the
contention. As I said, I think that we see a regression *because* the
indexes are much smaller, not in spite of the fact that they're
smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
smaller than they are on master, no matter the details, even after
many hours.

I'm not seeing the problem when pgbench is run with a small scale
factor but with a high client count. pgbench doesn't have the benefit
of much smaller indexes, so it also doesn't bear any cost when
contention is ramped up. The pgbench_accounts primary key (which is by
far the largest index) is *precisely* the same size as it is on
master, though the other indexes do seem to be a lot smaller. They
were already tiny, though. OTOH, the TPC-C NEW_ORDER transaction does
a lot of straight inserts, localized by warehouse, with skewed access.

[1] https://youtu.be/qYeRHK6oq7g?t=1340
[2] https://www.postgresql.org/message-id/CAH2-WzmsK-1qVR8xC86DXv8U0cHwfPcuH6hhA740fCeEu3XsVg@mail.gmail.com

--
Peter Geoghegan

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think it's pretty clear that we have to view that as acceptable. I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization -- a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

14 марта 2019 г., 12:20:31

On 13/03/2019 03:28, Peter Geoghegan wrote:
> On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
>> search like _bt_binsrch does, but the bounds caching is only done in
>> _bt_binsrch_insert. Seems more clear to have separate functions for them
>> now, even though there's some duplication.
> 
>> /*
>>    * Do the insertion. First move right to find the correct page to
>>    * insert to, if necessary. If we're inserting to a non-unique index,
>>    * _bt_search() already did this when it checked if a move to the
>>    * right was required for leaf page.  Insertion scankey's scantid
>>    * would have been filled out at the time. On a unique index, the
>>    * current buffer is the first buffer containing duplicates, however,
>>    * so we may need to move right to the correct location for this
>>    * tuple.
>>    */
>> if (checkingunique || itup_key->heapkeyspace)
>>          _bt_findinsertpage(rel, &insertstate, stack, heapRel);
>>
>> newitemoff = _bt_binsrch_insert(rel, &insertstate);
> 
>> The attached new version simplifies this, IMHO. The bounds and the
>> current buffer go together in the same struct, so it's easier to keep
>> track whether the bounds are valid or not.
> 
> Now that you have a full understanding of how the negative infinity
> sentinel values work, and how page deletion's leaf page search and
> !heapkeyspace indexes need to be considered, I think that we should
> come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
> that you now have a full understanding of all the subtleties of the
> patch, including those that that affect unique index insertion. That
> will make it much easier to talk about these unresolved questions.
> 
> My current sense is that it isn't useful to store the current buffer
> alongside the binary search bounds/hint. It'll hardly ever need to be
> invalidated, because we'll hardly ever have to move right within
> _bt_findsplitloc() when doing unique index insertion (as I said
> before, the regression tests *never* have to do this according to
> gcov).

It doesn't matter how often it happens, the code still needs to deal 
with it. So let's try to make it as readable as possible.

> We're talking about a very specific set of conditions here, so
> I'd like something that's lightweight and specialized. I agree that
> the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
> think of anything that's better offhand. Perhaps you can suggest
> something that is both lightweight, and an improvement on
> savebinsrch/restorebinsrch.

Well, IMHO holding the buffer and the bounds in the new struct is more 
clean than the savebinsrc/restorebinsrch flags. That's exactly why I 
suggested it. I don't know what else to suggest. I haven't done any 
benchmarking, but I doubt there's any measurable difference.

> I'm of the opinion that having a separate _bt_binsrch_insert() does
> not make anything clearer. Actually, I think that saving the bounds
> within the original _bt_binsrch() makes the design of that function
> clearer, not less clear. It's all quite confusing at the moment, given
> the rightmost/!leaf/page empty special cases. Seeing how the bounds
> are reused (or not reused) outside of _bt_binsrch() helps with that.

Ok. I think having some code duplication is better than one function 
that tries to do many things, but I'm not wedded to that.

- Heikki

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

14 марта 2019 г., 14:00:25

On 13/03/2019 03:28, Peter Geoghegan wrote:
> It would be great if you could take a look at the 'Add high key
> "continuescan" optimization' patch, which is the only one you haven't
> commented on so far (excluding the amcheck "relocate" patch, which is
> less important). I can put that one off for a while after the first 3
> go in. I will also put off the "split after new item" commit for at
> least a week or two. I'm sure that the idea behind the "continuescan"
> patch will now seem pretty obvious to you -- it's just taking
> advantage of the high key when an index scan on the leaf level (which
> uses a search style scankey, not an insertion style scankey) looks
> like it may have to move to the next leaf page, but we'd like to avoid
> it where possible. Checking the high key there is now much more likely
> to result in the index scan not going to the next page, since we're
> more careful when considering a leaf split point these days. The high
> key often looks like the items on the page to the right, not the items
> on the same page.

Oh yeah, that makes perfect sense. I wonder why we haven't done it like 
that before? The new page split logic makes it more likely to help, but 
even without that, I don't see any downside.

I find it a bit confusing, that the logic is now split between 
_bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() 
does the high-key check, but the corresponding "first-key" check in a 
backward scan is done in _bt_checkkeys(). I'd suggest moving the logic 
completely to _bt_readpage(), so that it's in one place. With that, 
_bt_checkkeys() can always check the keys as it's told, without looking 
at the LP_DEAD flag. Like the attached.

- Heikki

Вложения

v16-heikki-0001-Add-high-key-continuescan-optimization.patch

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

14 марта 2019 г., 21:51:03

On Thu, Mar 14, 2019 at 4:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Oh yeah, that makes perfect sense. I wonder why we haven't done it like
> that before? The new page split logic makes it more likely to help, but
> even without that, I don't see any downside.

The only downside is that we spend a few extra cycles, and that might
not work out. This optimization would have always worked, though. The
new page split logic clearly makes checking the high key in the
"continuescan" path an easy win.

> I find it a bit confusing, that the logic is now split between
> _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage()
> does the high-key check, but the corresponding "first-key" check in a
> backward scan is done in _bt_checkkeys(). I'd suggest moving the logic
> completely to _bt_readpage(), so that it's in one place. With that,
> _bt_checkkeys() can always check the keys as it's told, without looking
> at the LP_DEAD flag. Like the attached.

I'm convinced. I'd like to go a bit further, and also pass tupnatts to
_bt_checkkeys().  That makes it closer to the similar
_bt_check_rowcompare() function that _bt_checkkeys() must sometimes
call. It also allows us to only call BTreeTupleGetNAtts() for the high
key, while passes down a generic, loop-invariant
IndexRelationGetNumberOfAttributes() value for non-pivot tuples.

I'll do it that way in the next revision.

Thanks
-- 
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Peter Geoghegan

Дата:

16 марта 2019 г., 07:16:23

On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It doesn't matter how often it happens, the code still needs to deal
> with it. So let's try to make it as readable as possible.

> Well, IMHO holding the buffer and the bounds in the new struct is more
> clean than the savebinsrc/restorebinsrch flags. That's exactly why I
> suggested it. I don't know what else to suggest. I haven't done any
> benchmarking, but I doubt there's any measurable difference.

Fair enough. Attached is v17, which does it using the approach taken
in your earlier prototype. I even came around to your view on
_bt_binsrch_insert() -- I kept that part, too. Note, however, that I
still pass checkingunique to _bt_findinsertloc(), because that's a
distinct condition to whether or not bounds were cached (they happen
to be the same thing right now, but I don't want to assume that).

This revision also integrates your approach to the "continuescan"
optimization patch, with the small tweak I mentioned yesterday (we
also pass ntupatts). I also prefer this approach.

I plan on committing the first few patches early next week, barring
any objections, or any performance problems noticed during an
additional, final round of performance validation. I won't expect
feedback from you until Monday at the earliest. It would be nice if
you could take a look at the amcheck "relocate" patch. My intention is
to push patches up to and including the amcheck "relocate" patch on
the same day (I'll leave a few hours between the first two patches, to
confirm that the first patch doesn't break the buildfarm).

BTW, my multi-day, large BenchmarkSQL benchmark continues, with some
interesting results. The first round of 12 hour long runs showed the
patch nearly 6% ahead in terms of transaction throughput, with a
database that's almost 1 terabyte. The second round, which completed
yesterday and reuses the database initialized for the first round
showed that the patch had 10.7% higher throughput. That's a new record
for the patch. I'm going to leave this benchmark running for a few
more days, at least until it stops being interesting. I wonder how
long it will be before the master branch throughput stops declining
relative to throughput with the patched version. I expect that the
master branch will reach "index bloat saturation point" sooner or
later. Indexes in the patch's data directory continue to get larger,
as expected, but the amount of bloat accumulated over time is barely
noticeable (i.e. the pages are packed tight with tuples, which barely
declines over time).

This version of the patch series has attributions/credits at the end
of the commit messages. I have listed you as a secondary author on a
couple of the patches, where code was lifted from your feedback
patches. Let me know if you think that I have it right.

Thanks
-- 
Peter Geoghegan

On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Actually, how about "rootsearch", or "rootdescend"? You're supposed to
> > hyphenate "re-find", and so it doesn't really work as a function
> > argument name.
>
> Works for me.

Attached is v18 of patch series, which calls the new verification
option "rootdescend" verification.

As previously stated, I intend to commit the first 4 patches (up to
and including this amcheck "rootdescend" patch) during the workday
tomorrow, Pacific time.

Other changes:

* Further consolidation of the nbtree.h comments from second patch, so
that the on-disk representation overview that you requested a while
back has all the details. A couple of these were moved from macro
comments also in nbtree.h, and were missed earlier.

* Tweaks to comments on _bt_binsrch_insert() and its callers.
Streamlined to reflect the fact that it doesn't need to talk so much
about cases that only apply to internal pages. Explicitly stated
requirements for caller.

* Made _bt_binsrch_insert() set InvalidOffsetNumber for bounds in case
were valid bounds cannot be established initially. This seemed like a
good idea.

* A few more defensive assertion were added to nbtinsert.c (also
related to _bt_binsrch_insert()).

Thanks
-- 
Peter Geoghegan

Вложения

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

От

Heikki Linnakangas

Дата:

18 марта 2019 г., 14:59:45

On 18/03/2019 02:59, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Actually, how about "rootsearch", or "rootdescend"? You're supposed to
>>> hyphenate "re-find", and so it doesn't really work as a function
>>> argument name.
>>
>> Works for me.
> 
> Attached is v18 of patch series, which calls the new verification
> option "rootdescend" verification.

Thanks!

I'm getting a regression failure from the 'create_table' test with this:

> --- /home/heikki/git-sandbox/postgresql/src/test/regress/expected/create_table.out      2019-03-11 14:41:41.382759197
+0200
> +++ /home/heikki/git-sandbox/postgresql/src/test/regress/results/create_table.out       2019-03-18 13:49:49.480249055
+0200
> @@ -413,18 +413,17 @@
>         c text,
>         d text
>  ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d collate "C");
> +ERROR:  function plusone(integer) does not exist
> +HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Are you seeing that?

Looking at the patches 1 and 2 again:

I'm still not totally happy with the program flow and all the conditions 
in _bt_findsplitloc(). I have a hard time following which codepaths are 
taken when. I refactored that, so that there is a separate copy of the 
loop for V3 and V4 indexes. So, when the code used to be something like 
this:

_bt_findsplitloc(...)
{
     ...

     /* move right, if needed */
     for(;;)
     {
         /*
          * various conditions for when to stop. Different conditions
          * apply depending on whether it's a V3 or V4 index.
          */
     }

     ...
}

it is now:

_bt_findsplitloc(...)
{
     ...

     if (heapkeyspace)
     {
         /*
          * If 'checkingunique', move right to the correct page.
          */
         for (;;)
         {
             ...
         }
     }
     else
     {
         /*
          * Move right, until we find a page with enough space or "get
          * tired"
          */
         for (;;)
         {
             ...
         }
     }

     ...
}

I think this makes the logic easier to understand. Although there is 
some commonality, the conditions for when to move right on a V3 vs V4 
index are quite different, so it seems better to handle them separately. 
There is some code duplication, but it's not too bad. I moved the common 
code to step to the next page to a new helper function, _bt_stepright(), 
which actually seems like a good idea in any case.

See attached patches with those changes, plus some minor comment 
kibitzing. It's still failing the 'create_table' regression test, though.

- Heikki

PS. The commit message of the first patch needs updating, now that 
BTInsertState is different from BTScanInsert.

On Fri, Mar 22, 2019 at 2:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I'll likely push the remaining two patches on Sunday or Monday.
>
> I noticed that if I initidb and run "make installcheck" with and
> without the "split after new tuple" optimization patch, the largest
> system catalog indexes shrink quite noticeably:

I pushed this final patch a week ago, as commit f21668f3, concluding
work on integrating the patch series.

I have some closing thoughts that I would like to close out the
project on. I was casually discussing this project over IM with Robert
the other day. I was asked a question I'd often asked myself about the
"split after new item" heuristics: What if you're wrong? What if some
"black swan" type workload fools your heuristics into bloating an
index uncontrollably?

I gave an answer to his question that may have seemed kind of
inscrutable. My intuition about the worst case for the heuristics is
based on its similarity to the worst case for quicksort. Any
real-world instance of quicksort going quadratic is essentially a case
where we *consistently* do the wrong thing when selecting a pivot. A
random pivot selection will still perform reasonably well, because
we'll still choose the median pivot on average. A malicious actor will
always be able to fool any quicksort implementation into going
quadratic [1] in certain circumstances. We're defending against
Murphy, not Machiavelli, though, so that's okay.

I think that I can produce a more tangible argument than this, though.
Attached patch removes every heuristic that limits the application of
the "split after new item" optimization (it doesn't force the
optimization in the case of rightmost splits, or in the case where the
new item happens to be first on the page, since caller isn't prepared
for that). This is an attempt to come up with a wildly exaggerated
worst case. Nevertheless, the consequences are not actually all that
bad. Summary:

* The "UK land registry" test case that I leaned on a lot for the
patch has a final index that's about 1% larger. However, it was about
16% smaller compared to Postgres without the patch, so this is not a
problem.

* Most of the TPC-C indexes are actually slightly smaller, because we
didn't quite go as far as we could have (TPC-C strongly rewards this
optimization). 8 out of the 10 indexes are either smaller or
unchanged. The customer name index is about 28% larger, though. The
oorder table index is also about 28% larger.

* TPC-E never benefits from the "split after new item" optimization,
and yet the picture isn't so bad here either. The holding history PK
is about 40% bigger, which is quite bad, and the biggest regression
overall. However, in other affected cases indexes are about 15%
larger, which is not that bad.

Also attached are the regressions from my test suite in the form of
diff files -- these are the full details of the regression, just in
case that's interesting to somebody.

This isn't the final word. I'm not asking anybody to accept with total
certainty that there can never be a "black swan" workload that the
heuristics consistently mishandle, leading to pathological
performance. However, I think it's fair to say that the risk of that
happening has been managed well. The attached test patch literally
removes any restraint on applying the optimization, and yet we
arguably do no worse than Postgres 11 would overall.

Once again, I would like to thank my collaborators for all their help,
especially Heikki.

[1] https://www.cs.dartmouth.edu/~doug/mdmspe.pdf
-- 
Peter Geoghegan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Making all nbtree entries unique by having heap TIDs participate in comparisons

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения