Enabling B-Tree deduplication by default

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Enabling B-Tree deduplication by default
Дата
Msg-id CAH2-Wzn4x896BYyvTONEQCXHV=QsPU-qa1qib-x=JHXK-m-cEQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Enabling B-Tree deduplication by default
Список pgsql-hackers
There are some outstanding questions about how B-Tree deduplication
[1] should be configured, and whether or not it should be enabled by
default. I'm starting this new thread in the hopes of generating
discussion on these high level questions.

The commit message of the latest version of the patch has a reasonable
summary of its overall design, that might be worth reviewing before
reading on:

https://www.postgresql.org/message-id/CAH2-Wz=Tr6mxMsKRmv_=9-05_O9QWqOzQ8GweRV2DXS6+Y38QQ@mail.gmail.com

(If you want to go deeper than that, check out the nbtree README changes.)

B-Tree deduplication comes in two basic varieties:

* Unique index deduplication.

* Non-unique index deduplication.

Each variety works in essentially the same way. However, they differ
in how and when deduplication is actually applied. We can infer quite
a lot about versioning within a unique index, and we know that version
churn is the only problem that we should try to address -- it's not
really about space efficiency, since we know that there won't be any
duplicates in the long run. Also, workloads that have a lot of version
churn in unique indexes are generally quite reliant on LP_DEAD bit
setting within _bt_check_unique() (this is not to be confused with the
similar kill_prior_tuple LP_DEAD bit optimization) -- we need to be
sensitive to that.

Despite the fact that there are many more similarities than
differences here, I would like to present each variety as a different
thing to users (actually, I don't really want users to have to think
about unique index deduplication at all).

I believe that the best path forward for users is to make
deduplication in non-unique indexes a user-visible feature with
documented knobs (a GUC and an index storage parameter), while leaving
unique index deduplication as an internal thing that is barely
documented in the sgml user docs (I'll have a paragraph about it in
the B-Tree internals chapter). Unique index deduplication won't care
about the GUC, and will only trigger a deduplication pass when the
incoming tuple is definitely a duplicate of an existing tuple (this
signals version churn cheaply but reliably). The index storage
parameter will be respected with unique indexes, but only as a
debugging option -- our presumption is that nobody will want to
disable deduplication in unique indexes, since leaving it on has
virtually no downside (because we know exactly when to trigger it, and
when not to). Unique index deduplication is an internal thing that
users benefit from, but aren't really aware of, much like
opportunistic deletion of LP_DEAD items. (A secondary benefit of this
approach is that we don't have to have an awkward section in the
documentation that explains why deduplication in unique indexes isn't
actually an oxymoron.)

Thoughts on presenting unique index deduplication a separate
internal-only optimization, that doesn't care about the GUC?

I also want to talk about a related but separate topic. I propose that
the deduplicate_btree_items GUC (which only affects non-unique
indexes) be turned on by default in Postgres 13. This can be reviewed
at the end of the beta period, just like it was with parallelism and
with JIT. Note that this means that we'd opportunistically perform a
deduplication pass at the point that we usually have to split the
page, even though in general there is no reason to think that that
will work out. (We have no better way of applying deduplication than
"try it and see", unless it's a unique index that has the version
churn trigger heuristic that I just described.)

Append-only tables with multiple non-unique indexes that happen to
have only unique key values will pay a cost without seeing a benefit.
I believe that I saw a 3% throughput cost when I assessed this using
something called the "insert benchmark" [2] a couple of months ago
(this is maintained by Mark Callaghan, the Facebook MyRocks guy). I
need to do more work on quantifying the cost with a recent version of
the patch, especially the cost of only having one LP_DEAD bit for an
entire posting list tuple, but in general I think that this is likely
to be worth it. Consider these points in favor of enabling
deduplication by default:

* This insert-only workload is something that Postgres does
particularly well with -- append-only tables are reported to have
greater throughput in Postgres than in other comparable RDBMSs.
Presumably this is because we don't need UNDO logs, and don't do index
key locking in the same way (apparently even Oracle has locks in
indexes that protect logical transaction state). We are paying a small
cost by enabling deduplication by default, but it is paid in an area
where we already do particularly well.

* Deduplication can be turned off at any time. nbtree posting list
tuples are really not like the ones in GIN, in that we're not
maintaining them over time. A deduplication pass fundamentally works
by generating an alternative physical representation for the same
logical contents -- the resulting posting list tuples work in almost
the same way as the original tuples.

While it's true that a posting list split will be needed on occasion,
posting list splits work in a way that allows almost all code to
pretend that the incoming item never overlapped with an existing
posting list in the first place (we conceptually "rewrite" an incoming
tuple insert to not overlap). So this apparent exception isn't much of
an exception.

* Posting list tuples do not need to be decompressed, so reads pay no penalty.

* Even if all keys are unique, in practice UPDATEs will tend to
generate duplicates. Even with a HOT safe workload. Deduplicating
physical tuples for the logical row can help in non-unique indexes
just as much as in unique indexes.

* Only the kill_prior_tuple optimization can set LP_DEAD bits within a
non-unique index. There is evidence that kill_prior_tuple is much less
valuable than the _bt_check_unique() stuff -- it took years before
anybody noticed that 9.5 had significantly regressed the
kill_prior_tuple optimization [3].

Setting LP_DEAD bits is useful primarily because it prevents page
splits, but we always prefer to clear LP_DEAD bit set tuples to
deduplication, and a deduplication pass can only occur because the
only alternative is to split the page. So, workloads that pay a cost
due to not being able to set LP_DEAD bits in a granular fashion will
still almost always do fewer splits anyway -- which is what really
mattered all along. (Not having to visit the heap due to the LP_DEAD
bit being set in some index tuple is also nice, but standbys never get
that benefit, so it's hard to see it as more important than
nice-to-have.)

* The benefits are huge with a workload consisting of several indexes
and not many HOT updates. As measured by index size, transaction
throughput, query latency, and the ongoing cost of vacuuming -- you
name it.

This is an area where we're not so competitive with other RDBMSs
currently. In general, it seems impossible to target workloads like
this without accepting some wasted cycles. Low cardinality indexes are
usually indexes that follow a power law distribution, where some leaf
pages consist of unique values, while others consist of only a single
value -- very hard to imagine an algorithm that works much better than
always trying at the point that it looks like we have to split the
page. I'll go into an example of a workload like this, where
deduplication does very well.

Recent versions of the patch manage about a 60% increase in
transaction throughput with pgbench, scale 1000, a skewed
pgbench_accounts UPDATE + SELECT (:aid) distribution, and two "extra"
indexes on pgbench_accounts (an index on "abalance", and another on
"bid"). This was with 16 clients, over 4 hours on my home workstation,
which is nothing special. The extra indexes are consistently about 3x
smaller, and pgbench_branches_pkey and pgbench_tellers_pkey are about
4x - 5x smaller. This is fairly representative of some workloads,
especially workloads that we're known to not do so well on.

It's certainly possible for index size differences to be more extreme
than this. When I create an index on pgbench_accounts.filler, the
difference is much bigger still. The filler index is about 11x larger
on master compared with a patched Postgres, since the keys from the
filler column happen to be very big. This ~11x difference is
consistent and robust. While that workload is kind of silly, the fact
that the key size of an indexed column is vitally important with
UPDATEs that have no logical need to change the indexed column is very
counterintuitive. In my experience, indexes are often defined on large
text keys with a lot of redundancy, even if that is kind of a slipshod
design -- that's the reality with most real world applications.

Thoughts on enabling deduplication by default (by having the
deduplicate_btree_items GUC default to 'on')?

[1] https://commitfest.postgresql.org/24/2202/
[2] https://github.com/mdcallag/mytools/blob/master/bench/ibench/iibench.py
[3]
https://www.postgresql.org/message-id/flat/CAH2-Wz%3DSfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA%40mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d
-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: making the backend's json parser work in frontend code
Следующее
От: Andres Freund
Дата:
Сообщение: Re: making the backend's json parser work in frontend code