Обсуждение: IoT/sensor data and B-Tree page splits

Поиск
Список
Период
Сортировка

IoT/sensor data and B-Tree page splits

От
Peter Geoghegan
Дата:
The well known rightmost page split optimization (where we apply leaf
page fill factor) usually does a great job of maximizing space
utilization with indexes on a single auto-incrementing column or
timestamp column, by packing the left half of the rightmost page full.
Splitting the rightmost page predicts further splits of the new
rightmost page. If we are inserting ever-increasing values into the
index, then we win by packing the left half full -- we get very high
space utilization. If we're wrong to assume that another rightmost
page split is sure to follow soon, then we lose on space utilization,
though only by a tiny amount. We may occasionally lose out on space
utilization because our assumption that the rightmost page is special
was wrong. But, since it isn't special, we only inappropriately share
space unevenly once in a blue moon, which doesn't matter at all.

This seems to me to be fundamentally based on the assumption that you
either have random insertions or sequential insertions, without any
gray area between the two. Commit f21668f328c more or less generalized
this long established optimization to work with cases where it makes
sense to split at the rightmost page of a *grouping* within the index,
rather than the *entire* index, but that is almost unrelated to what I
want to bring up now. My concern right now is sensor data and IoT
data, which seem to call the "random or sequential" assumption into
question. Such data often consists of timestamps from a large number
of low cost devices -- event data that arrives *approximately* in
order. This is more or less the problem that the TimescaleDB extension
targets, so it seems likely that a fair number of users care about
getting it right, even if they don't know it.

I happened to notice that the largest TPC-E index has this exact
problem (it was the largest back when VMware ran TPC-E on Postgres
[1], which they specifically complained about at the time). The
"trade_history" table's primary key has tuples inserted
almost-but-not-quite in order. Sometimes packing the left half of the
rightmost page 90% full works out, because the remaining space is
enough to absorb all future insertions that arrive "slightly late"
(though barely). More often it turns out that that isn't enough space,
and the almost-rightmost page is split a second time, which is very
inefficient in lots of ways, even compared to the case where we reduce
fill factor to 50.

I have found that reducing trade_history_pkey's fill factor to about
70 increases leaf page space utilization, leaving it at about 85% (it
was under 50% with a fill factor of 90). I tend to doubt that this is
a practical tuning strategy in the real world, though, since varying
TPC-E scale alters which exact setting is effective. Ideally,
nbtpslitloc.c could intelligently adapt to slightly out of order
insertions in order to maximize space utilization -- it could
*dynamically* tune fill factor in response to ongoing observations
about page splits. Importantly, this would allow nbtree to avoid
unnecessarily splitting the same page twice in a row, having packed it
almost full on the first split -- this situation seems truly
pathological to me (think about the WAL overhead).

ISTM that adding such an optimization would require maintaining
dedicated book keeping metadata, which doesn't seem particularly
appealing. It might have acceptable overhead within a single backend.
I'm thinking of something like the RelationGetTargetBlock() stuff. Has
anybody else thought about this? What would a robust algorithm look
like?

Perhaps this is a problem that isn't worth solving right now, but it
is definitely a real problem.

[1] https://www.postgresql.org/message-id/66CE997FB523C04E9749452273184C6C137CB88A86@exch-mbx-113.vmware.com
--
Peter Geoghegan



Re: IoT/sensor data and B-Tree page splits

От
Arcadiy Ivanov
Дата:
On 8/26/19 6:48 PM, Peter Geoghegan wrote:
> Such data often consists of timestamps from a large number
> of low cost devices -- event data that arrives *approximately* in
> order. This is more or less the problem that the TimescaleDB extension
> targets, so it seems likely that a fair number of users care about
> getting it right, even if they don't know it.

This problem is not limited to IoT but to RT financial transaction 
ingestion as well.
I found BRIN indices to work exceptionally well for that, while B-tree 
taking enormous amounts of space with no performance difference or win 
going to BRIN.
The situation gets even worse when B-tree index is subjected to 
identical tuples which often happens when you have an avalanche of 
timestamps that are within less than 1ms of each other (frequent TS 
rounding resolution).

-- 
Arcadiy Ivanov
arcadiy@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov




Re: IoT/sensor data and B-Tree page splits

От
Peter Geoghegan
Дата:
On Mon, Aug 26, 2019 at 4:29 PM Arcadiy Ivanov <arcadiy@gmail.com> wrote:
> This problem is not limited to IoT but to RT financial transaction
> ingestion as well.

Not surprising, since the TPC-E benchmark models a financial trading
application. Perhaps it exhibits this behavior because it is actually
representative of a real trading application.

Note that pg_stats.correlation is 1.0 for the leading indexed column
(in the trade_history PK index), indicating *perfect* correlation.
It's not perfectly correlated when you look at it under a microscope,
though.

> I found BRIN indices to work exceptionally well for that, while B-tree
> taking enormous amounts of space with no performance difference or win
> going to BRIN.

That won't work with the TPC-E example, though, since it's a primary key index.

> The situation gets even worse when B-tree index is subjected to
> identical tuples which often happens when you have an avalanche of
> timestamps that are within less than 1ms of each other (frequent TS
> rounding resolution).

The good news there is that that will almost certainly be a lot better
in Postgres 12. TPC-E also has a number of very low cardinality
indexes, despite being an OLTP benchmark. Some of these indexes are
also listed in the 2012 problem report I linked to. Those same indexes
will be a lot smaller on Postgres 12. It should also generate a lot
less WAL compared to previous versions. (Plus we may get dynamic
B-Tree deduplication in Postgres 13, which would improve matters
further with low cardinality indexes.)

-- 
Peter Geoghegan



Re: IoT/sensor data and B-Tree page splits

От
Arcadiy Ivanov
Дата:
On 8/26/19 7:49 PM, Peter Geoghegan wrote:
> Not surprising, since the TPC-E benchmark models a financial trading
> application.

> The good news there is that that will almost certainly be a lot better
> in Postgres 12. TPC-E also has a number of very low cardinality
> indexes, despite being an OLTP benchmark.


But apart from TPC-E and having to perform to it, is there any practical 
real world usefulness in trying to have a B-tree index on TS-based data 
just to have a PK on it, as opposed to having a BRIN on a TS field and 
calling it a day?

-- 
Arcadiy Ivanov
arcadiy@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov




Re: IoT/sensor data and B-Tree page splits

От
Peter Geoghegan
Дата:
On Mon, Aug 26, 2019 at 4:59 PM Arcadiy Ivanov <arcadiy@gmail.com> wrote:
> But apart from TPC-E and having to perform to it, is there any practical
> real world usefulness in trying to have a B-tree index on TS-based data
> just to have a PK on it, as opposed to having a BRIN on a TS field and
> calling it a day?

The index in question isn't an index on a timestamp. Its leading
column is an integer that is almost monotonically increasing, but not
quite.

If a BRIN index works for you, then that's probably what you should
use. I don't think that that's relevant, though. BRIN indexes have
clear disadvantages, including the fact that you have to know that
your data is amenable to BRIN indexing in order to use a BRIN index.

-- 
Peter Geoghegan