Обсуждение: Enabling B-Tree deduplication by default

Поиск
Список
Период
Сортировка

Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
There are some outstanding questions about how B-Tree deduplication
[1] should be configured, and whether or not it should be enabled by
default. I'm starting this new thread in the hopes of generating
discussion on these high level questions.

The commit message of the latest version of the patch has a reasonable
summary of its overall design, that might be worth reviewing before
reading on:

https://www.postgresql.org/message-id/CAH2-Wz=Tr6mxMsKRmv_=9-05_O9QWqOzQ8GweRV2DXS6+Y38QQ@mail.gmail.com

(If you want to go deeper than that, check out the nbtree README changes.)

B-Tree deduplication comes in two basic varieties:

* Unique index deduplication.

* Non-unique index deduplication.

Each variety works in essentially the same way. However, they differ
in how and when deduplication is actually applied. We can infer quite
a lot about versioning within a unique index, and we know that version
churn is the only problem that we should try to address -- it's not
really about space efficiency, since we know that there won't be any
duplicates in the long run. Also, workloads that have a lot of version
churn in unique indexes are generally quite reliant on LP_DEAD bit
setting within _bt_check_unique() (this is not to be confused with the
similar kill_prior_tuple LP_DEAD bit optimization) -- we need to be
sensitive to that.

Despite the fact that there are many more similarities than
differences here, I would like to present each variety as a different
thing to users (actually, I don't really want users to have to think
about unique index deduplication at all).

I believe that the best path forward for users is to make
deduplication in non-unique indexes a user-visible feature with
documented knobs (a GUC and an index storage parameter), while leaving
unique index deduplication as an internal thing that is barely
documented in the sgml user docs (I'll have a paragraph about it in
the B-Tree internals chapter). Unique index deduplication won't care
about the GUC, and will only trigger a deduplication pass when the
incoming tuple is definitely a duplicate of an existing tuple (this
signals version churn cheaply but reliably). The index storage
parameter will be respected with unique indexes, but only as a
debugging option -- our presumption is that nobody will want to
disable deduplication in unique indexes, since leaving it on has
virtually no downside (because we know exactly when to trigger it, and
when not to). Unique index deduplication is an internal thing that
users benefit from, but aren't really aware of, much like
opportunistic deletion of LP_DEAD items. (A secondary benefit of this
approach is that we don't have to have an awkward section in the
documentation that explains why deduplication in unique indexes isn't
actually an oxymoron.)

Thoughts on presenting unique index deduplication a separate
internal-only optimization, that doesn't care about the GUC?

I also want to talk about a related but separate topic. I propose that
the deduplicate_btree_items GUC (which only affects non-unique
indexes) be turned on by default in Postgres 13. This can be reviewed
at the end of the beta period, just like it was with parallelism and
with JIT. Note that this means that we'd opportunistically perform a
deduplication pass at the point that we usually have to split the
page, even though in general there is no reason to think that that
will work out. (We have no better way of applying deduplication than
"try it and see", unless it's a unique index that has the version
churn trigger heuristic that I just described.)

Append-only tables with multiple non-unique indexes that happen to
have only unique key values will pay a cost without seeing a benefit.
I believe that I saw a 3% throughput cost when I assessed this using
something called the "insert benchmark" [2] a couple of months ago
(this is maintained by Mark Callaghan, the Facebook MyRocks guy). I
need to do more work on quantifying the cost with a recent version of
the patch, especially the cost of only having one LP_DEAD bit for an
entire posting list tuple, but in general I think that this is likely
to be worth it. Consider these points in favor of enabling
deduplication by default:

* This insert-only workload is something that Postgres does
particularly well with -- append-only tables are reported to have
greater throughput in Postgres than in other comparable RDBMSs.
Presumably this is because we don't need UNDO logs, and don't do index
key locking in the same way (apparently even Oracle has locks in
indexes that protect logical transaction state). We are paying a small
cost by enabling deduplication by default, but it is paid in an area
where we already do particularly well.

* Deduplication can be turned off at any time. nbtree posting list
tuples are really not like the ones in GIN, in that we're not
maintaining them over time. A deduplication pass fundamentally works
by generating an alternative physical representation for the same
logical contents -- the resulting posting list tuples work in almost
the same way as the original tuples.

While it's true that a posting list split will be needed on occasion,
posting list splits work in a way that allows almost all code to
pretend that the incoming item never overlapped with an existing
posting list in the first place (we conceptually "rewrite" an incoming
tuple insert to not overlap). So this apparent exception isn't much of
an exception.

* Posting list tuples do not need to be decompressed, so reads pay no penalty.

* Even if all keys are unique, in practice UPDATEs will tend to
generate duplicates. Even with a HOT safe workload. Deduplicating
physical tuples for the logical row can help in non-unique indexes
just as much as in unique indexes.

* Only the kill_prior_tuple optimization can set LP_DEAD bits within a
non-unique index. There is evidence that kill_prior_tuple is much less
valuable than the _bt_check_unique() stuff -- it took years before
anybody noticed that 9.5 had significantly regressed the
kill_prior_tuple optimization [3].

Setting LP_DEAD bits is useful primarily because it prevents page
splits, but we always prefer to clear LP_DEAD bit set tuples to
deduplication, and a deduplication pass can only occur because the
only alternative is to split the page. So, workloads that pay a cost
due to not being able to set LP_DEAD bits in a granular fashion will
still almost always do fewer splits anyway -- which is what really
mattered all along. (Not having to visit the heap due to the LP_DEAD
bit being set in some index tuple is also nice, but standbys never get
that benefit, so it's hard to see it as more important than
nice-to-have.)

* The benefits are huge with a workload consisting of several indexes
and not many HOT updates. As measured by index size, transaction
throughput, query latency, and the ongoing cost of vacuuming -- you
name it.

This is an area where we're not so competitive with other RDBMSs
currently. In general, it seems impossible to target workloads like
this without accepting some wasted cycles. Low cardinality indexes are
usually indexes that follow a power law distribution, where some leaf
pages consist of unique values, while others consist of only a single
value -- very hard to imagine an algorithm that works much better than
always trying at the point that it looks like we have to split the
page. I'll go into an example of a workload like this, where
deduplication does very well.

Recent versions of the patch manage about a 60% increase in
transaction throughput with pgbench, scale 1000, a skewed
pgbench_accounts UPDATE + SELECT (:aid) distribution, and two "extra"
indexes on pgbench_accounts (an index on "abalance", and another on
"bid"). This was with 16 clients, over 4 hours on my home workstation,
which is nothing special. The extra indexes are consistently about 3x
smaller, and pgbench_branches_pkey and pgbench_tellers_pkey are about
4x - 5x smaller. This is fairly representative of some workloads,
especially workloads that we're known to not do so well on.

It's certainly possible for index size differences to be more extreme
than this. When I create an index on pgbench_accounts.filler, the
difference is much bigger still. The filler index is about 11x larger
on master compared with a patched Postgres, since the keys from the
filler column happen to be very big. This ~11x difference is
consistent and robust. While that workload is kind of silly, the fact
that the key size of an indexed column is vitally important with
UPDATEs that have no logical need to change the indexed column is very
counterintuitive. In my experience, indexes are often defined on large
text keys with a lot of redundancy, even if that is kind of a slipshod
design -- that's the reality with most real world applications.

Thoughts on enabling deduplication by default (by having the
deduplicate_btree_items GUC default to 'on')?

[1] https://commitfest.postgresql.org/24/2202/
[2] https://github.com/mdcallag/mytools/blob/master/bench/ibench/iibench.py
[3]
https://www.postgresql.org/message-id/flat/CAH2-Wz%3DSfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA%40mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d
-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Robert Haas
Дата:
On Wed, Jan 15, 2020 at 6:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There are some outstanding questions about how B-Tree deduplication
> [1] should be configured, and whether or not it should be enabled by
> default. I'm starting this new thread in the hopes of generating
> discussion on these high level questions.

It seems like the issue here is that you're pretty confident that
deduplication will be a win for unique indexes, but not so confident
that this will be true for non-unique indexes. I don't know that I
understand why.

It does seem odd to me to treat them differently, but it's possible
that this is a reflection of my own lack of understanding. What do
other database systems do?

I wonder whether we could avoid the downside of having only one
LP_DEAD bit per line pointer by having a bit per TID within the
compressed tuples themselves. I assume you already thought about that,
though.

What are the characteristics of this system if you have an index that
is not declared as UNIQUE but actually happens to be UNIQUE?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jan 16, 2020 at 10:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 15, 2020 at 6:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > There are some outstanding questions about how B-Tree deduplication
> > [1] should be configured, and whether or not it should be enabled by
> > default. I'm starting this new thread in the hopes of generating
> > discussion on these high level questions.
>
> It seems like the issue here is that you're pretty confident that
> deduplication will be a win for unique indexes, but not so confident
> that this will be true for non-unique indexes.

Right.

> I don't know that I understand why.

The main reason that I am confident about unique indexes is that we
only do a deduplication pass in a unique index when we observe that
the incoming tuple (the one that might end up splitting the page) is a
duplicate of some existing tuple. Checking that much is virtually
free, since we already have the information close at hand today (we
cache the _bt_check_unique() binary search bounds for reuse within
_bt_findinsertloc() today). This seems to be an excellent heuristic,
since we really only want to target unique index leaf pages where all
or almost all insertions must be duplicates caused by non-HOT updates
-- this category includes all the pgbench indexes, and includes all of
the unique indexes in TPC-C. Whereas with non-unique indexes, we
aren't specifically targeting version churn (though it will help with
that too).

Granted, the fact that the incoming tuple happens to be a duplicate is
not a sure sign that the index is in this informal "suitable for
deduplication" category of mine. The incoming duplicate could just be
a once off. Even still, it's extremely unlikely to matter -- a failed
deduplication pass really isn't that expensive anyway, since it takes
place just before we split the page (we'll need the page in L1 cache
anyway). If we consistently attempt deduplication in a unique index,
then we're virtually guaranteed to consistently benefit from it.

In general, the way that deduplication is only considered at the point
where we'd otherwise have to split the page buys *a lot*. The idea of
delaying page splits by doing something like load balancing or
compression in a lazy fashion has a long history -- it was not my
idea. I'm not talking about the LP_DEAD bit set deletion stuff here --
this goes back to the 1970s.

> It does seem odd to me to treat them differently, but it's possible
> that this is a reflection of my own lack of understanding. What do
> other database systems do?

Other database systems treat unique indexes very differently, albeit
in a way that we're not really in a position to take too much away
from -- other than the general fact that unique indexes can be thought
of as very different things.

In general, the unique indexes in other systems are expected to be
unique in every sense, even during an "UPDATE foo SET unique_key =
unique_key + 1" style query. Index tuples are slightly smaller in a
unique index compared to an equivalent non-unique index in the case of
one such system. Also, that same system has something called a "unique
index scan" that can only be used with a unique index (and only when
all columns appear in the query qual).

> I wonder whether we could avoid the downside of having only one
> LP_DEAD bit per line pointer by having a bit per TID within the
> compressed tuples themselves. I assume you already thought about that,
> though.

So far, this lack of LP_DEAD bit granularity issue is only a
theoretical problem. I haven't been able to demonstrate in any
meaningful way. Setting LP_DEAD bits is bound to be correlated, and we
only deduplicate to avoid a page split.

Just last night I tried a variant pgbench workload with a tiny
accounts table, an extremely skewed Zipf distribution, and lots of
clients relative to the size of the machine. I used a non-unique index
instead of a unique index, since that is likely to be where the patch
was weakest (no _bt_check_unique() LP_DEAD bit setting that way). The
patch still came out ahead of the master branch by about 3%. It's very
hard to prove that there is no real downside to having only one
LP_DEAD bit per posting list tuple, since absence of evidence isn't
evidence of absence. I believe that it's much easier to make the
argument that it's okay to one have one LP_DEAD bit per posting list
within unique indexes specifically, though (because we understand that
there can be no duplicates in the long run there).

Throughout this work, and the v12 B-Tree work, I consistently made
conservative decisions about space accounting in code like
nbtsplitloc.c (the new nbtdedup.c code has to think about space in
about the same way). My intuition is that space accounting is one area
where we really ought to be conservative, since it's so hard to test.
That's the main reason why I find the idea of having LP_DEAD bits
within posting list tuples unappealing, whatever the benefits may be
-- it adds complexity in the one area that I really don't want to add
complexity.

> What are the characteristics of this system if you have an index that
> is not declared as UNIQUE but actually happens to be UNIQUE?

I believe that the only interesting characteristic is that it is
append-only, and has no reads. The variant of the insert benchmark
that does updates and deletes will still come out ahead, because then
version churn comes in to play -- just like with the pgbench unique
indexes (even though these aren't unique indexes).

-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jan 16, 2020 at 12:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > It does seem odd to me to treat them differently, but it's possible
> > that this is a reflection of my own lack of understanding. What do
> > other database systems do?
>
> Other database systems treat unique indexes very differently, albeit
> in a way that we're not really in a position to take too much away
> from -- other than the general fact that unique indexes can be thought
> of as very different things.

I should point out here that I've just posted v31 of the patch, which
changes things for unique indexes. Our strategy during deduplication
is now the same for unique indexes, since the original,
super-incremental approach doesn't seem to make sense anymore. Further
optimization work in the patch eliminated problems that made this
approach seem like it might be worthwhile.

Note, however, that v31 changes nothing about how we think about
deduplication in unique indexes in general, nor how it is presented to
users. There is still special criteria around how deduplication is
*triggered* in unique indexes. We continue to trigger a deduplication
pass based on seeing a duplicate within _bt_check_unique() +
_bt_findinsertloc() -- otherwise we never attempt deduplication in a
unique index (same as before). Plus the GUC still doesn't affect
unique indexes, unique index deduplication still isn't really
documented in the user docs (it just gets a passing mention in B-Tree
internals section), etc. This seems like the right way to go, since
deduplication in unique indexes can only make sense on leaf pages
where most or all new items are duplicates of existing items, a
situation that is already easy to detect.

It wouldn't be that bad if we always attempted deduplication in a
unique index, but it's easy to only do it when we're pretty confident
we'll get a benefit -- why not save a few cycles?

--
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Robert Haas
Дата:
On Thu, Jan 16, 2020 at 3:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The main reason that I am confident about unique indexes is that we
> only do a deduplication pass in a unique index when we observe that
> the incoming tuple (the one that might end up splitting the page) is a
> duplicate of some existing tuple. Checking that much is virtually
> free, since we already have the information close at hand today (we
> cache the _bt_check_unique() binary search bounds for reuse within
> _bt_findinsertloc() today). This seems to be an excellent heuristic,
> since we really only want to target unique index leaf pages where all
> or almost all insertions must be duplicates caused by non-HOT updates
> -- this category includes all the pgbench indexes, and includes all of
> the unique indexes in TPC-C. Whereas with non-unique indexes, we
> aren't specifically targeting version churn (though it will help with
> that too).

This (and the rest of the explanation) don't really address my
concern. I understand that deduplicating in lieu of splitting a page
in a unique index is highly likely to be a win. What I don't
understand is why it shouldn't just be a win, period. Not splitting a
page seems like it has a big upside regardless of whether the index is
unique -- and in fact, the upside could be a lot bigger for a
non-unique index. If the coarse-grained LP_DEAD thing is the problem,
then I can grasp that issue, but you don't seem very worried about
that.

Generally, I think it's a bad idea to give the user an "emergency off
switch" and then sometimes ignore it. If the feature seems to be
generally beneficial, but you're worried that there might be
regressions in obscure cases, then turn it on by default, and give the
user the ability to forcibly turn it off. But don't give the the
opportunity to forcibly turn it off sometimes. Nobody's going to run
around setting a reloption just for fun -- they're going to do it
because they hit a problem.

I guess I'm also saying here that a reloption seems like a much better
idea than a GUC. I don't see much reason to believe that a system-wide
setting will be useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Wed, Jan 29, 2020 at 6:56 AM Robert Haas <robertmhaas@gmail.com> wrote:
> This (and the rest of the explanation) don't really address my
> concern. I understand that deduplicating in lieu of splitting a page
> in a unique index is highly likely to be a win. What I don't
> understand is why it shouldn't just be a win, period. Not splitting a
> page seems like it has a big upside regardless of whether the index is
> unique -- and in fact, the upside could be a lot bigger for a
> non-unique index. If the coarse-grained LP_DEAD thing is the problem,
> then I can grasp that issue, but you don't seem very worried about
> that.

You're right that I'm not worried about the coarse-grained LP_DEAD
thing here. What I'm concerned about is cases where we attempt
deduplication, but it doesn't work out because there are no duplicates
-- that means we waste some cycles. Or cases where we manage to delay
a split, but only for a very short period of time -- in theory it
would be preferable to just accept the page split up front. However,
in practice we can't make these distinctions, since it would hinge
upon predicting the future, and we don't have a good heuristic. The
fact that a deduplication pass barely manages to prevent an immediate
page split isn't a useful proxy for how likely it is that the page
will split in any timeframe. We might have prevented it from happening
for another 2 milliseconds, or we might have prevented it forever.
It's totally workload dependent.

The good news is that these extra cycles aren't very noticeable even
with a workload where deduplication doesn't help at all (e.g. with
several indexes an append-only table, and few or no duplicates). The
cycles are generally a fixed cost. Furthermore, it seems to be
possible to virtually avoid the problem in the case of unique indexes
by applying the incoming-item-is-duplicate heuristic. Maybe I am
worrying over nothing.

> Generally, I think it's a bad idea to give the user an "emergency off
> switch" and then sometimes ignore it. If the feature seems to be
> generally beneficial, but you're worried that there might be
> regressions in obscure cases, then turn it on by default, and give the
> user the ability to forcibly turn it off. But don't give the the
> opportunity to forcibly turn it off sometimes. Nobody's going to run
> around setting a reloption just for fun -- they're going to do it
> because they hit a problem.

Actually, we do. There is both a reloption and a GUC. The GUC only
works with non-unique indexes, where the extra cost I describe might
be an issue (it can at least be demonstrated in a benchmark).The
reloption works with both unique and non-unique indexes. It will be
useful for turning off deduplication selectively in non-unique
indexes. In the case of unique indexes, it can be thought of as a
debugging thing (though we really don't want users to think about
deduplication in unique indexes at all).

I'm really having a hard time imagining or demonstrating any downside
with unique indexes, given the heuristic, so ISTM that turning off
deduplication really is just a debugging thing there.

My general assumption is that 99%+ of users will want to use
deduplication everywhere. I am concerned about the remaining ~1% of
users who might have a workload that is slightly regressed by
deduplication. Even this small minority of users will still want to
use deduplication in unique indexes. Plus we don't really want to talk
about deduplication in unique indexes to users, since it'll probably
confuse them. That's another reason to treat each case differently.

Again, maybe I'm making an excessively thin distinction. I really want
to be able to enable the feature everywhere, while also not getting
even one complaint about it. Perhaps that's just not a realistic or
useful goal.

-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Robert Haas
Дата:
On Wed, Jan 29, 2020 at 1:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The good news is that these extra cycles aren't very noticeable even
> with a workload where deduplication doesn't help at all (e.g. with
> several indexes an append-only table, and few or no duplicates). The
> cycles are generally a fixed cost. Furthermore, it seems to be
> possible to virtually avoid the problem in the case of unique indexes
> by applying the incoming-item-is-duplicate heuristic. Maybe I am
> worrying over nothing.

Yeah, maybe. I'm tempted to advocate for dropping the GUC and keeping
the reloption. If the worst case is a 3% regression and you expect
that to be rare, I don't think a GUC is really worth it, especially
given that the proposed semantics seem somewhat confusing. The
reloption can be used in a pinch to protect against either bugs or
performance regressions, whichever may occur, and it doesn't seem like
you need a second mechanism.

> Again, maybe I'm making an excessively thin distinction. I really want
> to be able to enable the feature everywhere, while also not getting
> even one complaint about it. Perhaps that's just not a realistic or
> useful goal.

One thing that you could do is try to learn whether deduplication (I
really don't like that name, but here we are) seems to be working for
a given index, perhaps even in a given session. For instance, suppose
you keep track of what happened the last ten times the current session
attempted deduplication within a given index. Store the state in the
relcache. If all of the last ten tries were failures, then only try
1/4 of the time thereafter. If you have a success, go back to trying
every time. That's pretty crude, but it would might be good enough to
blunt the downsides to the point where you can stop worrying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Wed, Jan 29, 2020 at 10:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, maybe. I'm tempted to advocate for dropping the GUC and keeping
> the reloption. If the worst case is a 3% regression and you expect
> that to be rare, I don't think a GUC is really worth it, especially
> given that the proposed semantics seem somewhat confusing. The
> reloption can be used in a pinch to protect against either bugs or
> performance regressions, whichever may occur, and it doesn't seem like
> you need a second mechanism.

I like the idea of getting rid of the GUC.

I should stop talking about it for now, and go back to reassessing the
extent of the regression in highly unsympathetic cases. The patch has
become faster in a couple of different ways since I last looked at
this question, and it's entirely possible that the regression is even
smaller than it was before.

> One thing that you could do is try to learn whether deduplication (I
> really don't like that name, but here we are) seems to be working for
> a given index, perhaps even in a given session. For instance, suppose
> you keep track of what happened the last ten times the current session
> attempted deduplication within a given index. Store the state in the
> relcache.

It's tempting to try to reason about the state of an index over time
like this, but I don't think that it's ever going to work well.
Imagine a unique index where 50% of all values are NULLs, on an
append-only table. Actually, let's say it's a non-unique index with
unique integers, and NULL values for the remaining 50% of rows -- that
way we don't get the benefit of the incoming-item-is-duplicate
heuristic.

There will be two contradictory tendencies within this particular
index. We might end up ping-ponging between each behavior. It seems
better to just accept a small performance hit.

Sometimes we can add heuristics to things like the split point
location choice logic (nbtsplitloc.c) that behave as if we were
keeping track of how things progress across successive, related page
splits in the same index -- though we don't actually keep track of
anything. I'm thinking of things like Postgres v12's "split after new
tuple" optimization, which makes the TPC-C indexes so much smaller. We
can do these things statelessly and safely only because the heuristics
have little to lose and much to gain. To a lesser extent, it's okay
because the heuristics are self-limiting -- we can only make an
incorrect inference about what to do because we were unlucky, but
there is no reason to think that we'll consistently make the wrong
choice. It feels a little bit like quicksort to me.

-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Robert Haas
Дата:
On Wed, Jan 29, 2020 at 2:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It's tempting to try to reason about the state of an index over time
> like this, but I don't think that it's ever going to work well.
> Imagine a unique index where 50% of all values are NULLs, on an
> append-only table. Actually, let's say it's a non-unique index with
> unique integers, and NULL values for the remaining 50% of rows -- that
> way we don't get the benefit of the incoming-item-is-duplicate
> heuristic.

I mean, if you guess wrong and deduplicate less frequently, you are no
worse off than today.

But it depends, too, on the magnitude.  If a gain is both large and
probable and a loss is both unlikely and improbable, then accepting a
bit of slowdown when it happens may be the right call.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Wed, Jan 29, 2020 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I should stop talking about it for now, and go back to reassessing the
> extent of the regression in highly unsympathetic cases. The patch has
> become faster in a couple of different ways since I last looked at
> this question, and it's entirely possible that the regression is even
> smaller than it was before.

I revisited the insert benchmark as a way of assessing the extent of
the regression from deduplication with a very unsympathetic case.
Background:

https://smalldatum.blogspot.com/2017/06/insert-benchmark-in-memory-intel-nuc.html

https://github.com/mdcallag/mytools/blob/master/bench/ibench/iibench.py

This workload consists of serial inserts into a table with a primary
key, plus three additional non-unique indexes. A low concurrency
benchmark seemed more likely to be regressed by the patch, so that's
what I focussed on. The indexes used have very few duplicates, and so
don't benefit from deduplication at all, with the exception of the
purchases_index_marketsegment index, which is a bit smaller (see log
files for precise details). The table (which is confusingly named
"purchases_index") had a total of three non-unique indexes, plus a
standard serial primary key. We insert 100,000,000 rows in total,
which takes under 30 minutes in each case. There are no reads, and no
updates or deletes.

There is a regression that is just shy of 2% here, as measured in
insert benchmark "rows/sec" -- this metric goes from "62190.0"
rows/sec on master to "60986.2 rows/sec" with the patch. I think that
this is an acceptable price to pay for the benefits -- this is a small
regression for a particularly unfavorable case. Also, I suspect that
this result is still quite a bit better than what you'd get with
either InnoDB or MyRocks on the same hardware (these systems were the
original targets of the insert benchmark, which was only recently
ported over to Postgres). At least, Mark Callaghan reports getting
only about 40k rows/sec inserted in 2017 with roughly comparable
hardware and test conditions (we're both running with
synchronous_commit=off, or the equivalent). We're paying a small cost
in an area where Postgres can afford to take a hit, in order to gain a
much larger benefit in an area where Postgres is much less
competitive.

I attach detailed output from runs for both master and patch.

The shell script that I used to run the benchmark is as follows:

#!/bin/sh
psql -c "create database test;"

cd $HOME/code/mytools/bench/ibench
python2 iibench.py --dbms=postgres --setup | tee iibench-output.log
python2 iibench.py --dbms=postgres --max_rows=100000000 | tee -a
iibench-output.log
psql -d test -c "SELECT pg_relation_size(oid),
pg_size_pretty(pg_relation_size(oid)),
relname FROM pg_class WHERE relnamespace = 'public'::regnamespace
ORDER BY 1 DESC LIMIT 15;" | tee -a iibench-output.log

-- 
Peter Geoghegan

Вложения

Re: Enabling B-Tree deduplication by default

От
Robert Haas
Дата:
On Thu, Jan 30, 2020 at 1:45 AM Peter Geoghegan <pg@bowt.ie> wrote:
> There is a regression that is just shy of 2% here, as measured in
> insert benchmark "rows/sec" -- this metric goes from "62190.0"
> rows/sec on master to "60986.2 rows/sec" with the patch. I think that
> this is an acceptable price to pay for the benefits -- this is a small
> regression for a particularly unfavorable case. Also, I suspect that
> this result is still quite a bit better than what you'd get with
> either InnoDB or MyRocks on the same hardware (these systems were the
> original targets of the insert benchmark, which was only recently
> ported over to Postgres). At least, Mark Callaghan reports getting
> only about 40k rows/sec inserted in 2017 with roughly comparable
> hardware and test conditions (we're both running with
> synchronous_commit=off, or the equivalent). We're paying a small cost
> in an area where Postgres can afford to take a hit, in order to gain a
> much larger benefit in an area where Postgres is much less
> competitive.

How do things look in a more sympathetic case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jan 30, 2020 at 9:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
> How do things look in a more sympathetic case?

I prefer to think of the patch as being about improving the stability
and predictability of Postgres with certain workloads, rather than
being about overall throughput. Postgres has an ungoing need to VACUUM
indexes, so making indexes smaller is generally more compelling than
it would be with another system. That said, there are certainly quite
a few cases that have big improvements in throughput and latency.

As I mentioned in my opening e-mail, there is a 60% increase in
transaction throughput when there are 3 extra indexes on the
pgbench_accounts table, at scale 1000 (so every column has an index).
That's a little bit silly, but I think that the extreme cases are
interesting. More recently, I found a 19% increase in throughput for a
similar pgbench workload at scale 5000, with only one extra index, on
pgbench_accounts.abalance (so lots of non-HOT updates). I think that
there were 12 + 16 clients in both cases. We can reliably keep (say)
the extra index on pgbench_accounts.abalance 3x smaller with the
patch, even though there is constant update churn. The difference in
index size between master and patch doesn't depend on having pristine
indexes. There is also about a 15% reduction in transaction latency in
these cases.

We usually manage to keep pgbench_accounts_pkey a lot smaller -- it
depends on the exact distribution of values. Skew that isn't all
concentrated in one part of the index (e.g. because we hash the value
generated by the pgbench PRNG) works best when it comes to controlling
pgbench_accounts_pkey bloat. I have seen plenty of cases where it was
about 50% - 95% smaller after several hours. OTOH, a less favorable
distribution of update values will tend to overwhelm the patch's
ability to soak up extra bloat in pgbench_accounts_pkey, though in a
way that is less abrupt. Deduplication of pgbench_accounts_pkey never
seems to have any downside.

You can even have cases like the insert benchmark that still come out
ahead -- despite having no reads or updates. This will tend to happen
when all non-unique indexes are on lower cardinality columns. Maybe 5
- 10 tuples for each distinct key value on average. I would even say
that this is the common case. If I look at the join benchmark data, or
the mouse genome database, or the TPC-E schema, then the patch tends
to leave non-unique indexes a lot smaller than they'd be on master, by
enough to pay for the cycles of deduplication and then some. The patch
makes all indexes taken together (including all unique indexes) about
60% of their original size with the join benchmark database and with
the mouse genome database.  Also, all of the larger TPC-E non-unique
indexes are at least low cardinality enough to be somewhat smaller. If
you can make indexes 2x - 3x smaller, then even inserts will be a lot
faster.

Back in 2008, Jignesh Shah reported that the TPC-E trade table's indexes
were the source of a lot of problems:

https://www.slideshare.net/jkshah/postgresql-and-benchmarks (See slide 19)

All of the three non-unique indexes on the trade table are only about
30% of their original size with deduplication (e.g. i_t_st_id goes
from 1853 MB to only 571 MB). I haven't been able to run the DBT-5
implementation of TPC-E, since it has severely bitrot, but I imagine
that deduplication would help a lot. I did manage to get DBT-5 to produce
initial test data, and have been in touch with Mark Wong about it. That's how
I know that all three extra indexes are 30% of their original size.

--
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jan 30, 2020 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I prefer to think of the patch as being about improving the stability
> and predictability of Postgres with certain workloads, rather than
> being about overall throughput. Postgres has an ungoing need to VACUUM
> indexes, so making indexes smaller is generally more compelling than
> it would be with another system. That said, there are certainly quite
> a few cases that have big improvements in throughput and latency.

I also reran TPC-C/benchmarksql with the patch (v30). TPC-C has hardly
any non-unique indexes, which is a little unrealistic. I found that
the patch was up to 7% faster in the first few hours, since it can
control the bloat from certain non-HOT updates. This isn't a
particularly relevant workload, since almost all UPDATEs don't affect
indexed columns. The incoming-item-is-duplicate heuristic works well
with TPC-C, so there is probably hardly any possible downside there.

I think that I should commit the patch without the GUC tentatively.
Just have the storage parameter, so that everyone gets the
optimization without asking for it. We can then review the decision to
enable deduplication generally after the feature has been in the tree
for several months.

There is no need to make a final decision about whether or not the
optimization gets enabled before committing the patch.

-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Robert Haas
Дата:
On Thu, Jan 30, 2020 at 2:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jan 30, 2020 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I prefer to think of the patch as being about improving the stability
> > and predictability of Postgres with certain workloads, rather than
> > being about overall throughput. Postgres has an ungoing need to VACUUM
> > indexes, so making indexes smaller is generally more compelling than
> > it would be with another system. That said, there are certainly quite
> > a few cases that have big improvements in throughput and latency.
>
> I also reran TPC-C/benchmarksql with the patch (v30). TPC-C has hardly
> any non-unique indexes, which is a little unrealistic. I found that
> the patch was up to 7% faster in the first few hours, since it can
> control the bloat from certain non-HOT updates. This isn't a
> particularly relevant workload, since almost all UPDATEs don't affect
> indexed columns. The incoming-item-is-duplicate heuristic works well
> with TPC-C, so there is probably hardly any possible downside there.
>
> I think that I should commit the patch without the GUC tentatively.
> Just have the storage parameter, so that everyone gets the
> optimization without asking for it. We can then review the decision to
> enable deduplication generally after the feature has been in the tree
> for several months.
>
> There is no need to make a final decision about whether or not the
> optimization gets enabled before committing the patch.

That seems reasonable.

I suspect that you're right that the worst-case downside is not big
enough to really be a problem given all the upsides. But the advantage
of getting things committed is that we can find out what users think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jan 30, 2020 at 12:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> That seems reasonable.

My approach to showing the downsides of the patch wasn't particularly
obvious, or easy to come up with. I could have contrived a case like
the insert benchmark, but with more low cardinality non-unique
indexes. That would also have the effect of increasing the memory
bandwidth/latency bottleneck, which was the big bottleneck already.
It's not clear that if that makes the patch look worse or better. You
end up executing more instructions that go to waste, but with a
workload where the CPU stalls on memory even more than the original
insert benchmark.

OTOH, it's possible to contrive a case that makes the patch look
better than the master branch to an extreme extent. Just keep adding
low cardinality columns that each get an index, on a table that gets
many non-HOT updates. The effect isn't even linear, because VACUUM has
a harder time with keeping up as you add columns/indexes, making the
bloat situation worse, in turn making it harder for VACUUM to keep up.
For bonus points, make sure that the tuples are nice and wide -- that
also "amplifies" bloat in a non-intuitive way (which is an effect that
is also ameliorated by the patch).

> I suspect that you're right that the worst-case downside is not big
> enough to really be a problem given all the upsides. But the advantage
> of getting things committed is that we can find out what users think.

It's certainly impossible to predict everything. On the upside, I
suspect that the patch makes VACUUM easier to tune with certain real
world workloads, though that is hard to prove.

I've always disliked the way that autovacuum gets triggered by fairly
generic criteria. Timeliness can matter a lot when it comes to index
bloat, but that isn't taken into account. I think that the patch will
tend to bring B-Tree indexes closer to heap tables in terms of their
overall sensitivity to how frequently VACUUM runs.

-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jan 30, 2020 at 2:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My approach to showing the downsides of the patch wasn't particularly
> obvious, or easy to come up with. I could have contrived a case like
> the insert benchmark, but with more low cardinality non-unique
> indexes.

Sorry. I meant with more *high* cardinality indexes. An exaggerated
version of the original insert benchmark.

-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jan 30, 2020 at 11:40 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that I should commit the patch without the GUC tentatively.
> Just have the storage parameter, so that everyone gets the
> optimization without asking for it. We can then review the decision to
> enable deduplication generally after the feature has been in the tree
> for several months.

This is how things work in the committed patch (commit 0d861bbb):
There is a B-Tree storage parameter named deduplicate_items, which is
enabled by default. In general, users will get deduplication unless
they opt out, including in unique indexes (though note that we're more
selective about triggering a deduplication patch in unique indexes
[1]).

> There is no need to make a final decision about whether or not the
> optimization gets enabled before committing the patch.

It's now time to make a final decision on this. Does anyone have any
reason to believe that leaving deduplication enabled by default is the
wrong way to go?

Note that using deduplication isn't strictly better than not using
deduplication for all indexes in all workloads; that's why it's
possible to disable the optimization. This thread has lots of
information about the reasons why enabling deduplication by default
seems appropriate, all of which still apply. Note that there have been
no bug reports involving deduplication since it was committed on
February 26th, with the exception of some minor issues that I reported
and fixed.

The view of the RMT is that the feature should remain enabled by
default (i.e. no changes are required). Of course, I am a member of
the RMT this year, as well as one of the authors of the patch. I am
hardly an impartial voice here. I believe that that did not sway the
decision making process of the RMT in this instance. If there are no
objections in the next week or so, then I'll close out the relevant
open item.

[1] https://www.postgresql.org/docs/devel/btree-implementation.html#BTREE-DEDUPLICATION
-- See "Tip"
-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Peter Geoghegan
Дата:
On Thu, Jun 25, 2020 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It's now time to make a final decision on this. Does anyone have any
> reason to believe that leaving deduplication enabled by default is the
> wrong way to go?

I marked the open item resolved just now -- B-Tree deduplication will
remain enabled by default in Postgres 13.

-- 
Peter Geoghegan



Re: Enabling B-Tree deduplication by default

От
Bruce Momjian
Дата:
On Thu, Jul  2, 2020 at 02:59:47PM -0700, Peter Geoghegan wrote:
> On Thu, Jun 25, 2020 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > It's now time to make a final decision on this. Does anyone have any
> > reason to believe that leaving deduplication enabled by default is the
> > wrong way to go?
> 
> I marked the open item resolved just now -- B-Tree deduplication will
> remain enabled by default in Postgres 13.

+1

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee