Обсуждение: IoT/sensor data and B-Tree page splits
The well known rightmost page split optimization (where we apply leaf page fill factor) usually does a great job of maximizing space utilization with indexes on a single auto-incrementing column or timestamp column, by packing the left half of the rightmost page full. Splitting the rightmost page predicts further splits of the new rightmost page. If we are inserting ever-increasing values into the index, then we win by packing the left half full -- we get very high space utilization. If we're wrong to assume that another rightmost page split is sure to follow soon, then we lose on space utilization, though only by a tiny amount. We may occasionally lose out on space utilization because our assumption that the rightmost page is special was wrong. But, since it isn't special, we only inappropriately share space unevenly once in a blue moon, which doesn't matter at all. This seems to me to be fundamentally based on the assumption that you either have random insertions or sequential insertions, without any gray area between the two. Commit f21668f328c more or less generalized this long established optimization to work with cases where it makes sense to split at the rightmost page of a *grouping* within the index, rather than the *entire* index, but that is almost unrelated to what I want to bring up now. My concern right now is sensor data and IoT data, which seem to call the "random or sequential" assumption into question. Such data often consists of timestamps from a large number of low cost devices -- event data that arrives *approximately* in order. This is more or less the problem that the TimescaleDB extension targets, so it seems likely that a fair number of users care about getting it right, even if they don't know it. I happened to notice that the largest TPC-E index has this exact problem (it was the largest back when VMware ran TPC-E on Postgres [1], which they specifically complained about at the time). The "trade_history" table's primary key has tuples inserted almost-but-not-quite in order. Sometimes packing the left half of the rightmost page 90% full works out, because the remaining space is enough to absorb all future insertions that arrive "slightly late" (though barely). More often it turns out that that isn't enough space, and the almost-rightmost page is split a second time, which is very inefficient in lots of ways, even compared to the case where we reduce fill factor to 50. I have found that reducing trade_history_pkey's fill factor to about 70 increases leaf page space utilization, leaving it at about 85% (it was under 50% with a fill factor of 90). I tend to doubt that this is a practical tuning strategy in the real world, though, since varying TPC-E scale alters which exact setting is effective. Ideally, nbtpslitloc.c could intelligently adapt to slightly out of order insertions in order to maximize space utilization -- it could *dynamically* tune fill factor in response to ongoing observations about page splits. Importantly, this would allow nbtree to avoid unnecessarily splitting the same page twice in a row, having packed it almost full on the first split -- this situation seems truly pathological to me (think about the WAL overhead). ISTM that adding such an optimization would require maintaining dedicated book keeping metadata, which doesn't seem particularly appealing. It might have acceptable overhead within a single backend. I'm thinking of something like the RelationGetTargetBlock() stuff. Has anybody else thought about this? What would a robust algorithm look like? Perhaps this is a problem that isn't worth solving right now, but it is definitely a real problem. [1] https://www.postgresql.org/message-id/66CE997FB523C04E9749452273184C6C137CB88A86@exch-mbx-113.vmware.com -- Peter Geoghegan
On 8/26/19 6:48 PM, Peter Geoghegan wrote: > Such data often consists of timestamps from a large number > of low cost devices -- event data that arrives *approximately* in > order. This is more or less the problem that the TimescaleDB extension > targets, so it seems likely that a fair number of users care about > getting it right, even if they don't know it. This problem is not limited to IoT but to RT financial transaction ingestion as well. I found BRIN indices to work exceptionally well for that, while B-tree taking enormous amounts of space with no performance difference or win going to BRIN. The situation gets even worse when B-tree index is subjected to identical tuples which often happens when you have an avalanche of timestamps that are within less than 1ms of each other (frequent TS rounding resolution). -- Arcadiy Ivanov arcadiy@gmail.com | @arcivanov | https://ivanov.biz https://github.com/arcivanov
On Mon, Aug 26, 2019 at 4:29 PM Arcadiy Ivanov <arcadiy@gmail.com> wrote: > This problem is not limited to IoT but to RT financial transaction > ingestion as well. Not surprising, since the TPC-E benchmark models a financial trading application. Perhaps it exhibits this behavior because it is actually representative of a real trading application. Note that pg_stats.correlation is 1.0 for the leading indexed column (in the trade_history PK index), indicating *perfect* correlation. It's not perfectly correlated when you look at it under a microscope, though. > I found BRIN indices to work exceptionally well for that, while B-tree > taking enormous amounts of space with no performance difference or win > going to BRIN. That won't work with the TPC-E example, though, since it's a primary key index. > The situation gets even worse when B-tree index is subjected to > identical tuples which often happens when you have an avalanche of > timestamps that are within less than 1ms of each other (frequent TS > rounding resolution). The good news there is that that will almost certainly be a lot better in Postgres 12. TPC-E also has a number of very low cardinality indexes, despite being an OLTP benchmark. Some of these indexes are also listed in the 2012 problem report I linked to. Those same indexes will be a lot smaller on Postgres 12. It should also generate a lot less WAL compared to previous versions. (Plus we may get dynamic B-Tree deduplication in Postgres 13, which would improve matters further with low cardinality indexes.) -- Peter Geoghegan
On 8/26/19 7:49 PM, Peter Geoghegan wrote: > Not surprising, since the TPC-E benchmark models a financial trading > application. > The good news there is that that will almost certainly be a lot better > in Postgres 12. TPC-E also has a number of very low cardinality > indexes, despite being an OLTP benchmark. But apart from TPC-E and having to perform to it, is there any practical real world usefulness in trying to have a B-tree index on TS-based data just to have a PK on it, as opposed to having a BRIN on a TS field and calling it a day? -- Arcadiy Ivanov arcadiy@gmail.com | @arcivanov | https://ivanov.biz https://github.com/arcivanov
On Mon, Aug 26, 2019 at 4:59 PM Arcadiy Ivanov <arcadiy@gmail.com> wrote: > But apart from TPC-E and having to perform to it, is there any practical > real world usefulness in trying to have a B-tree index on TS-based data > just to have a PK on it, as opposed to having a BRIN on a TS field and > calling it a day? The index in question isn't an index on a timestamp. Its leading column is an integer that is almost monotonically increasing, but not quite. If a BRIN index works for you, then that's probably what you should use. I don't think that that's relevant, though. BRIN indexes have clear disadvantages, including the fact that you have to know that your data is amenable to BRIN indexing in order to use a BRIN index. -- Peter Geoghegan