Обсуждение: Re: Expanding HOT updates for expression and partial indexes
On Jul 2 2025, at 2:10 pm, Greg Burd <greg@burd.me> wrote:
> The goal is to allow HOT updates under two new conditions:
> * when an indexed expression has not changed
> * when possible for a partial index
This is still true. :)
This patch has languished for nearly 2+ months at v17. Why? Primarily
due to feedback that although the idea had merit, there was a critical
flaw in the approach that made it a non-starter. The flaw was that I'd
been executing expressions while holding both the pin and a lock on the
buffer, which is not a great idea (self dead lock, etc.). This was
pointed out to me (thanks Robert Haas!) and so I needed to re-think my approach.
I put the patch aside for a while, then this past week at PGConf.dev/NYC
I heard interest from a few people (Jeff Davis, Nathan Bossart) who
encouraged me to move the code executing the expressions to just before
acquiring the lock but after pinning the buffer. The theory being that
my new code using the old/new tts to form and test the index tuples
resulting from executing expressions was using the resultsRelInfo struct
created during plan execution, not the information found on the page,
and so was safe without the lock.
This proved tricky because I had been using the modified_attrs and
expr_attrs as a test to avoid exercising expressions when unnecessary.
Calling HeapDetermineColumnsInfo() outside the buffer lock to get
modified_attrs proved to be a problem as it examines an oldtup that is
cobbled together from the elements on the page, requiring the lock I was
trying to avoid.
After reviewing how updates work in the executor, I discovered that
during execution the new tuple slot is populated with the information
from ExecBuildUpdateProjection() and the old tuple, but that most
importantly for this use case that function created a bitmap of the
modified columns (the columns specified in the update). This bitmap
isn't the same as the one produced by HeapDetermineColumnsInfo() as the
latter excludes attributes that are not changed after testing equality
with the helper function heap_attr_equals() where as the former will
include attributes that appear in the update but are the same value as
before. This, happily, is immaterial for the purposes of my function
ExecExprIndexesRequireUpdates() which simply needs to check to see if
index tuples generated are unchanged. So I had all I needed to run the
checks ahead of acquiring the lock on the buffer.
So, this led to v18 (attached), which passes test wold including a
number of new tests for the various corner cases relative to HOT updates
for expressions.
There is much room for improvement, and your suggestions are welcome.
I'll find time to quantify the benefit of this patch for the targeted
use cases and to ensure that all other cases see no regressions.
I need to review the tests I've added to ensure that they are the
minimal set required, that they communicate effectively their purpose,
etc. For now, it's more shotgun than scalpel, .... I'll get to it.
I added a reloption "expression_checks" to disable this new code path.
Good idea or bad precedent?
In execIndexing I special case for IsolationIsSerializable() and I can't
remember why now but I do recall one isolation test failing... I'll
check on this and get back to the thread. Or maybe you know why that
From small things like naming to larger questions like hijacking the
modified columns bitmapset for use later in the heap, I need feedback
and guidance.
I'm using UpdateContext for estate/resultRelInfo and I've added to that
lockmode, these change the table am API a tad, I'm open to better ideas
that accomplish the same.
I'd like not to build, then rebuild index tuples for these expressions
but I can't think of a way to do that without a palloc(), this is
avoided today.
Example:
CREATE TABLE t (
x INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
y JSONB
);
CREATE INDEX t_age_idx ON t (((y->>'age')::int));
INSERT INTO t (y)
VALUES ('{"name": "John", "age": 30, "city": "New York"}');
-- Update an indexed field in the JSONB document, should not be HOT
UPDATE t
SET y = jsonb_set(y, '{age}', '31')
WHERE x = 1;
SELECT pg_stat_force_next_flush();
SELECT relname, n_tup_upd, n_tup_hot_upd FROM pg_stat_user_tables
WHERE relname = 't';
-- Update a non-indexed field in the JSONB document, new HOT update
UPDATE t
SET y = jsonb_set(y, '{city}', '"Boston"')
WHERE x = 1;
SELECT pg_stat_force_next_flush();
SELECT relname, n_tup_upd, n_tup_hot_upd FROM pg_stat_user_tables
WHERE relname = 't';
This patch is far from "commit ready", but it does accomplish the
$subject and pass tests.
I look forward to any/all constructive feedback.
best.
-greg
Вложения
On Tue, Oct 07, 2025 at 05:36:11PM -0400, Greg Burd wrote: > I put the patch aside for a while, then this past week at PGConf.dev/NYC > I heard interest from a few people (Jeff Davis, Nathan Bossart) who > encouraged me to move the code executing the expressions to just before > acquiring the lock but after pinning the buffer. The theory being that > my new code using the old/new tts to form and test the index tuples > resulting from executing expressions was using the resultsRelInfo struct > created during plan execution, not the information found on the page, > and so was safe without the lock. An open question (at least from me) is whether this is safe. I'm not familiar enough with this area of code yet to confidently determine that. > After reviewing how updates work in the executor, I discovered that > during execution the new tuple slot is populated with the information > from ExecBuildUpdateProjection() and the old tuple, but that most > importantly for this use case that function created a bitmap of the > modified columns (the columns specified in the update). This bitmap > isn't the same as the one produced by HeapDetermineColumnsInfo() as the > latter excludes attributes that are not changed after testing equality > with the helper function heap_attr_equals() where as the former will > include attributes that appear in the update but are the same value as > before. This, happily, is immaterial for the purposes of my function > ExecExprIndexesRequireUpdates() which simply needs to check to see if > index tuples generated are unchanged. So I had all I needed to run the > checks ahead of acquiring the lock on the buffer. Nice. > There is much room for improvement, and your suggestions are welcome. A general and predictable suggestion is to find ways to break this into smaller pieces. As-is, this patch would take me an enormous amount of time to review in any depth. If we can break off some smaller pieces that we can scrutinize and commit independently, we can start making forward progress sooner. The UpdateContext and reloption stuff are examples of things that might be possible to split into independent patches. > I'll find time to quantify the benefit of this patch for the targeted > use cases and to ensure that all other cases see no regressions. Looking forward to these results. This should also help us decide whether to set expression_checks by default. > I added a reloption "expression_checks" to disable this new code path. > Good idea or bad precedent? If there are cases where the added overhead outweighs the benefits (which seems like it must be true some of the time), then I think we must have a way to opt-out (or maybe even opt-in). In fact, I'd advise adding a GUC to complement the reloption so that users can configure it at higher levels. > In execIndexing I special case for IsolationIsSerializable() and I can't > remember why now but I do recall one isolation test failing... I'll > check on this and get back to the thread. Or maybe you know why that I didn't follow this. > I'd like not to build, then rebuild index tuples for these expressions > but I can't think of a way to do that without a palloc(), this is > avoided today. Is the avoidance of palloc() a strict rule? Is this discussed in the code anywhere? -- nathan
On Wed, 2025-10-08 at 15:48 -0500, Nathan Bossart wrote:
> > The theory being that
> > my new code using the old/new tts to form and test the index tuples
> > resulting from executing expressions was using the resultsRelInfo
> > struct
> > created during plan execution, not the information found on the
> > page,
> > and so was safe without the lock.
>
> An open question (at least from me) is whether this is safe. I'm not
> familiar enough with this area of code yet to confidently determine
> that.
The optimization requires that the expression evaluates to the same
thing on the old and new tuples. That determination doesn't have
anything to do with a lock on the buffer, so long as the old tuple
isn't pruned away or something. And clearly it won't be pruned, because
we're in the process of updating it, so we have a snapshot that can see
it.
There might be subtleties in other parts of the proposal, but the above
determination can be made safely without a buffer lock.
>
> > I added a reloption "expression_checks" to disable this new code
> > path.
> > Good idea or bad precedent?
>
> If there are cases where the added overhead outweighs the benefits
> (which
> seems like it must be true some of the time), then I think we must
> have a
> way to opt-out (or maybe even opt-in). In fact, I'd advise adding a
> GUC to
> complement the reloption so that users can configure it at higher
> levels.
I'll push back against this. For now I'm fine with developer options to
make testing easier, but we should find a way to make this work well
without tuning.
Regards,
Jeff Davis
On Tue, 2025-10-07 at 17:36 -0400, Greg Burd wrote:
> After reviewing how updates work in the executor, I discovered that
> during execution the new tuple slot is populated with the information
> from ExecBuildUpdateProjection() and the old tuple, but that most
> importantly for this use case that function created a bitmap of the
> modified columns (the columns specified in the update). This bitmap
> isn't the same as the one produced by HeapDetermineColumnsInfo() as
> the
> latter excludes attributes that are not changed after testing
> equality
> with the helper function heap_attr_equals() where as the former will
> include attributes that appear in the update but are the same value
> as
> before. This, happily, is immaterial for the purposes of my function
> ExecExprIndexesRequireUpdates() which simply needs to check to see if
> index tuples generated are unchanged. So I had all I needed to run
> the
> checks ahead of acquiring the lock on the buffer.
You're still calling ExecExprIndexesRequireUpdates() from within
heap_update(). Can't you do that inside of ExecUpdatePrologue() or
thereabouts?
Regards,
Jeff Davis
> On Oct 8, 2025, at 4:48 PM, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Oct 07, 2025 at 05:36:11PM -0400, Greg Burd wrote: >> I put the patch aside for a while, then this past week at PGConf.dev/NYC >> I heard interest from a few people (Jeff Davis, Nathan Bossart) who >> encouraged me to move the code executing the expressions to just before >> acquiring the lock but after pinning the buffer. The theory being that >> my new code using the old/new tts to form and test the index tuples >> resulting from executing expressions was using the resultsRelInfo struct >> created during plan execution, not the information found on the page, >> and so was safe without the lock. Thanks for taking a look Nathan. > > An open question (at least from me) is whether this is safe. I'm not > familiar enough with this area of code yet to confidently determine that. My read is that it is safe because we're testing the content of two TupleTableSlots both formed in the executor. The function uses only that information and doesn't reference data on the page at all. >> After reviewing how updates work in the executor, I discovered that >> during execution the new tuple slot is populated with the information >> from ExecBuildUpdateProjection() and the old tuple, but that most >> importantly for this use case that function created a bitmap of the >> modified columns (the columns specified in the update). This bitmap >> isn't the same as the one produced by HeapDetermineColumnsInfo() as the >> latter excludes attributes that are not changed after testing equality >> with the helper function heap_attr_equals() where as the former will >> include attributes that appear in the update but are the same value as >> before. This, happily, is immaterial for the purposes of my function >> ExecExprIndexesRequireUpdates() which simply needs to check to see if >> index tuples generated are unchanged. So I had all I needed to run the >> checks ahead of acquiring the lock on the buffer. > > Nice. Handy indeed. I'm not at all a fan of increasing the size of a plan node but it's only by a little... and for a good cause. >> There is much room for improvement, and your suggestions are welcome. > > A general and predictable suggestion is to find ways to break this into > smaller pieces. As-is, this patch would take me an enormous amount of time > to review in any depth. If we can break off some smaller pieces that we > can scrutinize and commit independently, we can start making forward > progress sooner. The UpdateContext and reloption stuff are examples of > things that might be possible to split into independent patches. Fair, I'll try. >> I'll find time to quantify the benefit of this patch for the targeted >> use cases and to ensure that all other cases see no regressions. > > Looking forward to these results. This should also help us decide whether > to set expression_checks by default. In past my test results were very positive for cases where this helped avoid heap and index bloat and almost immeasurably small even for cases where we were doing the work to test but ultimately unable to take the HOT path. This will require new tests as the code has changed quite a bit. >> I added a reloption "expression_checks" to disable this new code path. >> Good idea or bad precedent? > > If there are cases where the added overhead outweighs the benefits (which > seems like it must be true some of the time), then I think we must have a > way to opt-out (or maybe even opt-in). In fact, I'd advise adding a GUC to > complement the reloption so that users can configure it at higher levels. This evolved from a GUC to a reloption and I'd rather it go away entirely. I hear your concern, but I've yet to measure a perceptable impact and I'll try hard to keep it that way as this matures. Assuming that's the case, I'd like to eliminate the potentially confusing tuning knob. >> In execIndexing I special case for IsolationIsSerializable() and I can't >> remember why now but I do recall one isolation test failing... I'll >> check on this and get back to the thread. Or maybe you know why that > > I didn't follow this. More later if/when I can reproduce it and understand it better myself. >> I'd like not to build, then rebuild index tuples for these expressions >> but I can't think of a way to do that without a palloc(), this is >> avoided today. > > Is the avoidance of palloc() a strict rule? Is this discussed in the code > anywhere? Not that I know of, just my paranoid self trying to avoid it on a path that didn't have it before. > -- > nathan best. -greg
> On Oct 9, 2025, at 3:08 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2025-10-08 at 15:48 -0500, Nathan Bossart wrote: >>> The theory being that >>> my new code using the old/new tts to form and test the index tuples >>> resulting from executing expressions was using the resultsRelInfo >>> struct >>> created during plan execution, not the information found on the >>> page, >>> and so was safe without the lock. >> >> An open question (at least from me) is whether this is safe. I'm not >> familiar enough with this area of code yet to confidently determine >> that. Hey Jeff, Thanks for the nudge at PGConf.dev in NYC and for the follow-up here. > The optimization requires that the expression evaluates to the same > thing on the old and new tuples. That determination doesn't have > anything to do with a lock on the buffer, so long as the old tuple > isn't pruned away or something. And clearly it won't be pruned, because > we're in the process of updating it, so we have a snapshot that can see > it. Right, I test that the expression on the index evaluates to the same value when forming an index tuple for old/new slots. > There might be subtleties in other parts of the proposal, but the above > determination can be made safely without a buffer lock. > >> >>> I added a reloption "expression_checks" to disable this new code >>> path. >>> Good idea or bad precedent? >> >> If there are cases where the added overhead outweighs the benefits >> (which >> seems like it must be true some of the time), then I think we must >> have a >> way to opt-out (or maybe even opt-in). In fact, I'd advise adding a >> GUC to >> complement the reloption so that users can configure it at higher >> levels. > > I'll push back against this. For now I'm fine with developer options to > make testing easier, but we should find a way to make this work well > without tuning. I'm aligned with this, the reloption evolved from a GUC and I'm more of the opinion that neither should exist and that the overhead of this be minimized and so require no tuning or consideration by the end user. best. -greg > Regards, > Jeff Davis
> On Oct 9, 2025, at 3:27 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Tue, 2025-10-07 at 17:36 -0400, Greg Burd wrote:
>> After reviewing how updates work in the executor, I discovered that
>> during execution the new tuple slot is populated with the information
>> from ExecBuildUpdateProjection() and the old tuple, but that most
>> importantly for this use case that function created a bitmap of the
>> modified columns (the columns specified in the update). This bitmap
>> isn't the same as the one produced by HeapDetermineColumnsInfo() as
>> the
>> latter excludes attributes that are not changed after testing
>> equality
>> with the helper function heap_attr_equals() where as the former will
>> include attributes that appear in the update but are the same value
>> as
>> before. This, happily, is immaterial for the purposes of my function
>> ExecExprIndexesRequireUpdates() which simply needs to check to see if
>> index tuples generated are unchanged. So I had all I needed to run
>> the
>> checks ahead of acquiring the lock on the buffer.
>
> You're still calling ExecExprIndexesRequireUpdates() from within
> heap_update(). Can't you do that inside of ExecUpdatePrologue() or
> thereabouts?
Hey Jeff,
I'm trying to knit this into the executor layer but that is tricky because
the concept of HOT is very heap-specific, so the executor should be
ignorant of the heap's specific needs (right?). Right now, I am considering
adding a step in ExecUpdatePrologue() just after opening the indexes.
The idea I'm toying with is to have a new function on all TupleTableSlots
that examines the before/after slots for an update and the set of updated
attributes and returns a Bitmapset of the changed attributes that overlap
with indexes and so should trigger index updates in ExecUpdateEpilogue().
That way for heap we'd have something like:
Bitmapset *tts_heap_getidxattr(ResultRelInfo *info,
TupleTableSlot *updated,
TupleTableSlot *existing,
Bitmapset *updated_attrs)
{
some combo of HeapDeterminColumnsInfo() and
ExecExprIndexesRequireUpdates()
returns the set of indexed attrs that this update changed
}
So, attributes only referenced by expressions where the expression
produces the same value for the updated and existing slots would be
removed from the set.
Interestingly, summarizing indexes that don't overlap with changed
attributes won't be updated (and that's a good thing).
Problem is we're not yet accounting for what is about to happen in
ExecUpdateAct() when calling into the heap_update(). That's where
heap tries to fit the new tuple onto the same page. That might be
possible with large tuples thanks to TOAST, it's impossible to say
before getting into this function with the page locked.
So, for updates we include the modified_attrs in the UpdateContext
which is available to heap_update(). If the heap code decides to
go HOT, great unset all attributes in the modified_attrs except any
that are only summarizing. If the heap can't go HOT, fine, add
the indexed attrs back into modified_attrs which should trigger all
indexes to be updated.
This gets rid of TU_UpdateIndexes enum and allows only modified
summarizing indexes to be updated on the HOT path. Two additional
benefits IMO.
at least, that's what I'm trying out now,
-greg
> Regards,
> Jeff Davis
On Tue, 2025-10-14 at 13:46 -0400, Greg Burd wrote:
> I'm trying to knit this into the executor layer but that is tricky
> because
> the concept of HOT is very heap-specific, so the executor should be
> ignorant of the heap's specific needs (right?).
It's wrong for the executor to say "do a HOT update" but it's OK for
the executor to say "this is the set of indexes that might have a new
key after the update". If that set is empty, then the heap can choose
to do a HOT update.
> Right now, I am considering
> adding a step in ExecUpdatePrologue() just after opening the indexes.
Seems like a reasonable place.
> The idea I'm toying with is to have a new function on all
> TupleTableSlots...
>
> That way for heap we'd have something like:
> Bitmapset *tts_heap_getidxattr(ResultRelInfo *info,
> TupleTableSlot *updated,
> TupleTableSlot *existing,
> Bitmapset *updated_attrs)
> {
> some combo of HeapDeterminColumnsInfo() and
> ExecExprIndexesRequireUpdates()
>
> returns the set of indexed attrs that this update changed
> }
Why is this a generic method for all slots? Do we need to reuse it
somewhere else? I would have expected just a static method in
nodeModifyTable.c that does just what's needed.
And to be precise, it's the set of indexed attrs where the update might
have created a new key, right? The whole point is that we don't care if
the indexed attr has been changed, so long as it doesn't create a new
index key.
> Interestingly, summarizing indexes that don't overlap with changed
> attributes won't be updated (and that's a good thing).
Nice.
> Problem is we're not yet accounting for what is about to happen in
> ExecUpdateAct() when calling into the heap_update(). That's where
> heap tries to fit the new tuple onto the same page. That might be
> possible with large tuples thanks to TOAST, it's impossible to say
> before getting into this function with the page locked.
I don't see why that's a problem. The executor can pass down the list
of indexed attrs that might have created new keys after the update,
then heap_update uses that information (along with other factors, like
if it fits on the same page) to determine whether to perform a HOT
update or not.
> So, for updates we include the modified_attrs in the UpdateContext
> which is available to heap_update().
It doesn't look like UpdateContext is currently available to
heap_update(). We might need to change the signature. But I think it's
fine to change the signature if it results in a cleaner design --
tableam extensions often need source changes when new major versions
are released.
> If the heap code decides to
> go HOT, great unset all attributes in the modified_attrs except any
> that are only summarizing. If the heap can't go HOT, fine, add
> the indexed attrs back into modified_attrs which should trigger all
> indexes to be updated.
IIUC, that sounds like a good plan.
> This gets rid of TU_UpdateIndexes enum and allows only modified
> summarizing indexes to be updated on the HOT path. Two additional
> benefits IMO.
I'm not sure that I understand, but I'll look at that after we sort out
some of the other details.
Regards,
Jeff Davis
Hello again. This idea started over a year ago for me while working on a project that used JSONB and had horrible "bloat" with update times that were not fantastic. The root cause, expression indexes prevent HOT updates and all indexes on JSONB are by definition, expressions. That's the backstory. The idea for the solution came from a patch [1] applied [2], then later reverted [3], that basically evaluated the before/after tuple expressions in heap_update() at the point where newbuff == buffer just before deciding to use_hot_update or not. When the evaluated expressions produced equal results using a binary comparison the index didn't need to be updated. While this approach sorta worked, but it was reverted for a few reasons, here's Tom's summary: "The problem here is that [the code that checks for equality] thinks that the type of the index column is identical to the type of the source datum for it, which is not true for any opclass making use of the opckeytype property. [4]" Still, expanding the domain of what might go HOT seems like a good goal to me and it was at the heart of the issues I was facing on the project using JSONB, so I kept at it. The patches I've sent on this thread have evolved from that first idea. First they addressed the specific issue raised by Tom. Then they expanded to include partial indexes as well. This worked, and I helped to ship a fork of Postgres used this approach and solved customer issues with the JSONB use case. Customer experience was better, no unnecessary index updates when indexed data wasn't modified meant no more heap/index bloat and faster update times. Vacuum could finally keep up and performance and storage overhead was much better. For this patch set v1 through v17 were essentially work that refined/reworked that original approach, but there was a serious flaw that I didn't fully appreciate until around v16/17. I was evaluating the expressions while holding a lock on the buffer page which a) expands the time the lock is held, and b) opens the door to self-deadlock. No bueno. Then there was v18, a quick work-around for that. I moved the call that invokes the executor to the beginning of heap_update() before taking the lock on the page. To do this I had to find the set of updated attributes, which I discovered was available in the executor as the updated tuple is created. This was a viable fix, but didn't really go far enough and was a bit hackish IMO. Jeff Davis and others challenged me to move the work to identify what's changed into the executor and clean it up. I'm a sucker for a challenge. More generally, this idea made sense. While Postgres has many places where logic is tightly coupled to the way the heap works this didn't have to be one. To make this work I needed the update path in the executor to be interested in: a) knowing what columns were specified in the UPDATE statement and those impacted by before/after triggers, b) reducing that set to those attributes known to be both indexed and to have changed value, c) finding which of those (and possibly other) attributes that force new index updates. Why? We'll, that code already exists in a few places and in some cases is replicated; for (a) there is ExecGetAllUpdatedCols(), for (b) HeapDetermineColumnsInfo() and index_unchanged_by_update(). An interesting thing to note is that HeapDetermineColumnsInfo() might return a set that includes columns not returned by ExecGetAllUpdatedCols() because HeapDetermineColumnsInfo() iterates over all indexed attributes looking for changes and that might find an indexed attribute that was changed by heap_modify_tuple() but not knowable by ExecGetAllUpdatedCols(). This happens in tsvector code, see tsvector_op.c tsvector_update_trigger() where if (update_needed) heap_modify_tuple_by_cols(). That column isn't known to ExecGetAllUpdatedCols(). HeapDetermineColumnsInfo() is also critical when modifying catalog tuples. Catalog tuples are modified using either Form/GETSTRUCT or values/nulls/replaces then using heap_modify_tuple() and calling into CatalogTupleUpdate() which calls simple_heap_update() that calls heap_update() where we find HeapDetermineColumnsInfo(). The interesting thing here is that when modifying catalog tuples there is knowledge of what attributes are changed, but that knowledge isn't preserved and passed into CatalogTupleUpdate(), rather it is re-discovered in HeapDetermineColumnsInfo(). That's how catalog tuples are able to take the HOT path, they re-use that same logic. There is a fix for that [5] too (and I really hope that lands in master ASAP), but that's not the subject of this thread. HeapDetermineColumnsInfo() also helps inform a few other decisions in heap_update(), but these have to happen after taking the buffer lock and are very heap-specific, namely: 1. Do either the replica identity key attributes overlap with the modified index attributes or do they need to be stored externally, this is passed on to ExtractReplicaIdentity() to find out if we must augment the WAL log or not. 2. Are there any modified indexed attributes that intersect with the primary keys of the relation, if not lower the lock mode to enable multixact to work. HeapDetermineColumnsInfo() also takes a pragmatic approach to testing for equality when looking for modified indexed attributes, it uses datumIsEqual() which boils down to a simple memcmp() of the before/after HeapTuple datum. This is fine in most cases, but limits the scope of what can be HOT. Interestingly, this requirement of binary equality has leaked into other parts of the code, namely nbtree's deduplication of TIDs on page split. That code uses binary equality as well. A nbtree index with collation for case insensitive must store both "A" and "a" despite those being type-equal because they are not binary equivalent. More on this later. At this point, I had my sights set on HeapDetermineColumnsInfo(). I felt that what it was doing should move into the executor, well as much of that work as possible, and outside of the buffer lock. This would also open the door for removal of redundant code. My thought was that the table AM update API should have an additional argument, the "modified indexed attributes" or "mix_attrs", passed in. So, here we are at the door of v19... let's begin. 0001 - Reorganize heap update logic This is preparatory work for the larger goal in that heap_update() serves two masters: CatalogTupleUpdate()/simple_heap_update() and heap_tuple_update() and in reality, they were different but needed most of the same logic that happens at the start of heap_update(). This patch splits that logic out and moves it into heap_tuple_update() and simple_heap_update(). Functionally nothing changes. That's the meat of this patch.Reorganize heap update logic 0002 - Track changed indexed columns in the executor during UPDATEs This is the first core set of changes, it doesn't expand HOT updates but it does restructure where HeapDetermineColumnsInfo()'s core work happens. A new function ExecCheckIndexedAttrsForChanges() in nodeModifyTable.c is now responsible for checking for changes between Datum in the old/new TupleTableSlots. This is different from before in that we're not checking the new HeapTuple Datum verses the HeapTuple we read from the buffer page while holding the lock on that page. An update starts off by reading the existing tuple using the table AM. Then a new updated tuple is created as the set of changes to the old. Then the new TupleTableSlot is the combination of the existing one we just read and the changes we just recorded. So, in the executor before calling into the table AM's update function we have a pin on the buffer and the before/after TupleTableSlots for this update. So, I've put the call to my new ExecCheckIndexedAttrsForChanges() function just before calling table_tuple_update() and I've added the "mix_attrs" into that call which get passed on to the heap in heap_tuple_update() and then heap_update() and all is well. Why is this safe? The way I read heap_update() is that it has always historically had code to deal with cases where the tuple is concurrently updated and react accordingly thanks to HeapTupleSatisfiesUpdate() which remains where it was in heap_update(). Visibility checks happened when we first read the tuple to form the updated tuple and later in heap_update() when we call HeapTupleSatisfiesVisibility() to check for transaction-snapshot mode RI updates. So, this new update path from the executor into the table AM seems to me to be okay and almost functionally equivalent. But there is one big change to discuss before moving to the simple_heap_update() path. In nodeModifyTable.c tts_attr_equal() replaces heap_attr_equal() changing the test for equality when calling into heap_tuple_update(). In the past we used datumIsEqual(), essentially a binary comparison using memcmp(), now the comparison code in tts_attr_equal uses type-specific equality function when available and falls back to datumIsEqual() when not. The other parts of HeapDetermineColumnsInfo() remain in the code, but they still happen within the simple_heap_update() and heap_tuple_update() code. That's where you'll find that after the buffer is locked we do (1) and (2) from above. This keeps the heap-specific work in the heap, but we've moved some work up into the executor and outside the buffer lock. While this accomplishes the goal of removing HeapDetermineColumnsInfo() from heap_update() on the path that uses the table AM API heap_tuple_update() but it doesn't on the simple_heap_update() path. That remains the same as it was in the previous patch. Ideally, my patch to restructure how catalog tuples are updated [5] is committed and we can fully remove HeapDetermineColumnsInfo() and likely speed up all catalog updates in the process. That's what motivated [5], please take a look, it required a huge number of changes so I thought it deserved a life/thread of its own. Finally, there is the ExecSimpleRelationUpdate() path and slot_modify_data(). On this path we know what attributes are being updated, so we just check to see if they changed and then intersect that with the set of indexed attributes and we have our modified indexed attributes set to pass into simple_heap_update(). 0003 - Replace index_unchanged_by_update with ri_ChangedIndexedCols This patch removes the function index_unchanged_by_update() in execIndexing.c and simply re-uses the modified indexed attributes that we've stashed away in ResultRelInfo as ri_ChangedIndexedCols. This provides a hint when calling into the index AM's index_insert() function indicating if the UPDATE was without logical change to the data or not. We've done that check, we don't need to do it again. 0004 - Enable HOT updates for expression and partial indexes This finally gets us back to where this project started, but on much more firm ground than before because we're not going to self-deadlock. The idea has grown from a small function into something larger, but only out of necessity. In this patch I add ExecWhichIndexesRequireUpdates() in execIndexing.c which implements (c) finding the set of attributes that force new index updates. This set can be very different from the modified indexed attributes. We know that some attributes are not equal to their previous versions, but does that mean that the index that references that attribute needs a new index tuple? It may, or it may not. Here's the comment on that function that explains: /* * ExecWhichIndexesRequireUpdates * * Determine which indexes need updating given modified indexed attributes. * This function is a companion to ExecCheckIndexedAttrsForChanges(). On the * surface, they appear similar but they are doing two very different things. * * For a standard index on a set of attributes this is the intersection of * the mix_attrs and the index attrs (key, expression, but not predicate). * * For expression indexes and indexes which implement the amcomparedatums() * index AM API we'll need to form index datum and compare each attribute to * see if any actually changed. * * For expression indexes the result of the expression might not change at all, * this is common with JSONB columns which require expression indexes and where * it is commonplace to index a field within a document and have updates that * generally don't update that field. * * Partial indexes won't trigger index tuples when the old/new tuples are both * outside of the predicate range. * * For nbtree the amcomparedatums() API is critical as it requires that key * attributes are equal when they memcmp(), which might not be the case when * using type-specific comparison or factoring in collation which might make * an index case insensitive. * * All of this is to say that the goal is for the executor to know, ahead of * calling into the table AM for the update and before calling into the index * AM for inserting new index tuples, which attributes at a minimum will * necessitate a new index tuple. * ... */ Whereas before we were comparing Datum in the table relation, now we're comparing Datum in the index relation. Index AMs are free to store what they want, we need to know if what's changed and referenced by the index means that the index needs a new tuple or not. In the case of a JSONB expression index (or any expression index) the expression is evaluated when calling FormIndexDatum(). The result of the expression is the Datum in the values/isnull arrays. Then we need to compare them to see if they changed. This can be done using the same tts_attr_equal() function, but with the attribute desc of the index, not table relation. In some cases that's not enough, for instance nbtree and it's more stringent equality requirements. For that reason (and another coming up in a second) we need a new optional index AM API, amcomparedatums(). Indexes implementing this function have the ability to compare two Datums for equality in what ever way they want. For nbtree, that's a binary comparison. The new case here that this also supports is for indexes like GIN/RUM where there is an opclass that can extract zero or more pieces from the attribute and form multiple index entries when needed. Those extracted pieces might have odd equality rules as well. This opens the door for index implementations to provide information that will help inform heap when making HOT decisions that didn't exist before. Take for example the Linux Foundation's DocumentDB [6] project [7] which aims to be an open source alternative to MongoDB built on top of PostgreSQL. One of the pieces of that project is its "extended RUM" index AM implementation. This index extracts portions of the BSON documents stored and forms index keys from that. Here is an example: CREATE INDEX documents_rum_index_14002 ON documentdb_data.documents_14001 USING documentdb_rum (document bson_rum_single_path_ops (path=a, iswildcard='true', tl='2699')) An index on the "document" column which uses the "bson_rum_single_path_ops" opclass to extract a portion of the BSON document that matches "path=a". For this to potentially be stored by heap as a HOT update we need to know that what changed within that document didn't intersect with the "path=a" and if it did that the new value(s) were all equal to the old values. Equality in DocumentDB isn't what you think, it's quite odd and specific to BSON and rules defined by MongoDB so it's important to allow the index AM to execute what's necessary for it's use case. For the more common JSONB use case we can now have heap make an informed decision about HOT updates after evaluating the expression. For GIN/RUM/etc. implementation there is a path to HOT. For nbtree there is a way to maintain its requirements. Is there a cost to all this? Yes, of course. There is net new work being done on some paths. IndexFormDatum() will be called more frequently and sometimes twice for the same thing. This could be improved, there might be a way to cache that information. But to have tests of old/new we'll have to do that work at least twice. Is there a benefit? Yes, of course. Some redundant code paths are gone and in the end we've increased the number of cases where HOT updates are a possibility. This especially helps out users of JSONB, but not only them. What's left undone? * I need to check code coverage so that I might * create tests covering all the new cases * update the README.HOT documentation, wiki, etc. * performance... For performance I'd like to examine some worst cases as in lots of indexes that have a lot of new code to exercise and all but the last index would allow for a HOT update. That should represent the maximum amount of new overhead for this code. Then, the other side of the equation, how much does this help JSONB? I think that is something to measure in terms of TPS as well as "bloat" avoided and time spent vacuuming. I also don't like the TU_Updating enum, I think it's a leaky abstraction and really pointless now. I'd like to remove it in favor of the bitmap of attributes known to force index tuples to be inserted. Maybe I'll layer that into the next set. In the end, this is a lot of work and I believe that it moves the ball forward. I'll have more metrics on that soon I hope, but I wanted to get the conversation re-started ASAP as we're late in the v19 cycle. Finally, the "elephant" in the room (ha!) is PHOT[9]/WARM[10][11]. Yes, some of this work does help make a solution to allow HOT updates when only updating a subset of indexes closer to reality, I'd be lying not to mention that here, it is a big part of my overall plan for my next year and (I hope) v20 and I need this work to get there. Specifically, PHOT is in part based on HeapDetermineColumnsInfo() and I needed to make that more truthful for it to work. I hope you see the value in this work and will partner with me to finalize it and get it into master. best. -greg [1] https://www.postgresql.org/message-id/flat/4d9928ee-a9e6-15f9-9c82-5981f13ffca6%40postgrespro.ru [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c203d6cf8 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=05f84605dbeb9cf8279a157234b24bbb706c5256 [4] https://www.postgresql.org/message-id/2877.1541538838%40sss.pgh.pa.us [5] https://www.postgresql.org/message-id/flat/2C5C8B8D-8B36-4547-88EB-BDCF9A7C8D94@greg.burd.me [6] https://www.linuxfoundation.org/press/linux-foundation-welcomes-documentdb-to-advance-open-developer-first-nosql-innovation [7] https://github.com/documentdb/documentdb [8] https://github.com/documentdb/documentdb/tree/main/pg_documentdb_extended_rum [9] https://www.postgresql.org/message-id/flat/2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2%40amazon.com [10] https://www.postgresql.org/message-id/flat/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw%40mail.gmail.com [11] https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com
Вложения
On Nov 16 2025, at 1:53 pm, Greg Burd <greg@burd.me> wrote: > 0004 - Enable HOT updates for expression and partial indexes > > This finally gets us back to where this project started, but on much > more firm ground than before because we're not going to self-deadlock. > The idea has grown from a small function into something larger, but only > out of necessity. > > In this patch I add ExecWhichIndexesRequireUpdates() in execIndexing.c > which implements (c) finding the set of attributes that force new index > updates. This set can be very different from the modified indexed > attributes. We know that some attributes are not equal to their > previous versions, but does that mean that the index that references > that attribute needs a new index tuple? It may, or it may not. Here's > the comment on that function that explains: > > /* > * ExecWhichIndexesRequireUpdates > * > * Determine which indexes need updating given modified indexed attributes. > * This function is a companion to ExecCheckIndexedAttrsForChanges(). > On the > * surface, they appear similar but they are doing two very different things. > * > * For a standard index on a set of attributes this is the intersection of > * the mix_attrs and the index attrs (key, expression, but not predicate). > * > * For expression indexes and indexes which implement the amcomparedatums() > * index AM API we'll need to form index datum and compare each > attribute to > * see if any actually changed. > * > * For expression indexes the result of the expression might not change > at all, > * this is common with JSONB columns which require expression indexes > and where > * it is commonplace to index a field within a document and have > updates that > * generally don't update that field. > * > * Partial indexes won't trigger index tuples when the old/new tuples > are both > * outside of the predicate range. > * > * For nbtree the amcomparedatums() API is critical as it requires that key > * attributes are equal when they memcmp(), which might not be the case when > * using type-specific comparison or factoring in collation which might make > * an index case insensitive. > * > * All of this is to say that the goal is for the executor to know, > ahead of > * calling into the table AM for the update and before calling into the index > * AM for inserting new index tuples, which attributes at a minimum will > * necessitate a new index tuple. > * > ... > */ Attached are rebased (d5b4f3a6d4e) patches with the only changes happening in the last patch in the series. 0004 - Enable HOT updates for expression and partial indexes I was never happy with the dual functions ExecCheckIndexedAttrsForChanges() and ExecWhichIndexesRequireUpdates(), it felt like too much overhead and duplication of effort. While updating my tests, adding a few cases, I found that there was also a flaw in the logic. So, time to rewrite and combine them. What did I discover? Before the logic was to find the set of modified indexed attributes then review all the indexes for changed attributes using FormIndexDatum() and comparing before/after to see if expressions really changed the value to be indexed or not. The first pass didn't take into account expressions, the second did. So, an expression index over JSONB data wouldn't extract and test the field within the document, it was just comparing the entire document before/after using the jsonb comparison function, no bueno. This approach wraps both functions into one somewhat simplified function. The logic is basically, iterate over the indexes reviewing indexed attributes for changes. Along the way we call into the new index AM's comparison function when present, otherwise we find and use the proper type-specific comparison function for the datum. At the end of the function we have our Bitmapset of attributes that should trigger new index tuples. > What's left undone? > > * I need to check code coverage so that I might I did this and it was quite good, I'll do it again for this new series but it's nice to see that the tests are exercising the vast majority of the code paths. > * create tests covering all the new cases I think the coverage is good, maybe even redundant or overly complex in places. > * update the README.HOT documentation, wiki, etc. Soon, I hope to have this approach solid and under review before solidifying the docs. > * performance... Still as yet unmeasured, I know that there is more work per-update to perform these checks, so some overhead, but I don't know if that overhead is more than before with HeapDetermineColumnsInfo() and index_unchanged_by_update(). Those two functions did essentially the same thing, only with binary comparison (datumIsEqual()). I need to measure that. What about doing all this work outside of the buffer lock in heap_update()? Surely that'll give back a bit or at least add to concurrency. Forming index tuples a few extra times and evaluating the expressions 3 times rather than 1 is going to hurt, I think I can come up with a way to cache the formed datum and use it later on, but is that worth it? Complex expressions, yes. Also, what about expressions that expect to be executed once... and now are 3x? That's what forced my update to the insert-conflict-specconflict.out test, but AFAICT there is no way to test if an expression's value is going to change on update without exercising it once for the old tuple and once for the new tuple. Even if it were possible for an index to provide the key it might have changed after the expression evaluation (as is the case in hash), so I don't think this is avoidable. Maybe that's reason enough to add a reloption to disable the expression evaluation piece of this? Given that it might create a logic or performance regression. The flip side is the potential to use the HOT path, that's a real savings. One concerning thing is that nbtree's assumption that key attributes for TIDs must use binary comparison for equality. This means that for our common case (heap/btree) there is more work per-update than before, which is why I need to measure. I could look into eliminating the nbtree requirement, I don't understand it too well as yet by I believe that on page split there is an attempt to deduplicate TIDs into a TIDBitmap and the test for when that's possible is datumIsEqual(). If that were the same as in this new code, possibly evening using tts_attr_equal(), then... I don't know, I'll have to investigate. Chime in here if you can educate me on this one. :) best. -greg
Вложения
On Nov 19 2025, at 1:00 pm, Greg Burd <greg@burd.me> wrote: > > On Nov 16 2025, at 1:53 pm, Greg Burd <greg@burd.me> wrote: > >> 0004 - Enable HOT updates for expression and partial indexes >> >> This finally gets us back to where this project started, but on much >> more firm ground than before because we're not going to >> self-deadlock. >> The idea has grown from a small function into something larger, but only >> out of necessity. >> >> In this patch I add ExecWhichIndexesRequireUpdates() in execIndexing.c >> which implements (c) finding the set of attributes that force new index >> updates. This set can be very different from the modified indexed >> attributes. We know that some attributes are not equal to their >> previous versions, but does that mean that the index that references >> that attribute needs a new index tuple? It may, or it may not. Here's >> the comment on that function that explains: >> >> /* >> * ExecWhichIndexesRequireUpdates >> * >> * Determine which indexes need updating given modified indexed attributes. >> * This function is a companion to ExecCheckIndexedAttrsForChanges(). >> On the >> * surface, they appear similar but they are doing two very different things. >> * >> * For a standard index on a set of attributes this is the >> intersection of >> * the mix_attrs and the index attrs (key, expression, but not predicate). >> * >> * For expression indexes and indexes which implement the amcomparedatums() >> * index AM API we'll need to form index datum and compare each >> attribute to >> * see if any actually changed. >> * >> * For expression indexes the result of the expression might not change >> at all, >> * this is common with JSONB columns which require expression indexes >> and where >> * it is commonplace to index a field within a document and have >> updates that >> * generally don't update that field. >> * >> * Partial indexes won't trigger index tuples when the old/new tuples >> are both >> * outside of the predicate range. >> * >> * For nbtree the amcomparedatums() API is critical as it requires >> that key >> * attributes are equal when they memcmp(), which might not be the >> case when >> * using type-specific comparison or factoring in collation which >> might make >> * an index case insensitive. >> * >> * All of this is to say that the goal is for the executor to know, >> ahead of >> * calling into the table AM for the update and before calling into >> the index >> * AM for inserting new index tuples, which attributes at a minimum will >> * necessitate a new index tuple. >> * >> ... >> */ > > Attached are rebased (d5b4f3a6d4e) patches with the only changes > happening in the last patch in the series. > > 0004 - Enable HOT updates for expression and partial indexes > > I was never happy with the dual functions > ExecCheckIndexedAttrsForChanges() and ExecWhichIndexesRequireUpdates(), > it felt like too much overhead and duplication of effort. While > updating my tests, adding a few cases, I found that there was also a > flaw in the logic. So, time to rewrite and combine them. > > What did I discover? Before the logic was to find the set of modified > indexed attributes then review all the indexes for changed attributes > using FormIndexDatum() and comparing before/after to see if expressions > really changed the value to be indexed or not. The first pass didn't > take into account expressions, the second did. So, an expression index > over JSONB data wouldn't extract and test the field within the document, > it was just comparing the entire document before/after using the jsonb > comparison function, no bueno. > > This approach wraps both functions into one somewhat simplified > function. The logic is basically, iterate over the indexes reviewing > indexed attributes for changes. Along the way we call into the new > index AM's comparison function when present, otherwise we find and use > the proper type-specific comparison function for the datum. At the end > of the function we have our Bitmapset of attributes that should trigger > new index tuples. > >> What's left undone? >> >> * I need to check code coverage so that I might > > I did this and it was quite good, I'll do it again for this new series > but it's nice to see that the tests are exercising the vast majority of > the code paths. > >> * create tests covering all the new cases > > I think the coverage is good, maybe even redundant or overly complex > in places. > >> * update the README.HOT documentation, wiki, etc. > > Soon, I hope to have this approach solid and under review before > solidifying the docs. > >> * performance... > > Still as yet unmeasured, I know that there is more work per-update to > perform these checks, so some overhead, but I don't know if that > overhead is more than before with HeapDetermineColumnsInfo() and > index_unchanged_by_update(). Those two functions did essentially the > same thing, only with binary comparison (datumIsEqual()). I need to > measure that. What about doing all this work outside of the buffer lock > in heap_update()? Surely that'll give back a bit or at least add to > concurrency. Forming index tuples a few extra times and evaluating the > expressions 3 times rather than 1 is going to hurt, I think I can come > up with a way to cache the formed datum and use it later on, but is that > worth it? Complex expressions, yes. Also, what about expressions that > expect to be executed once... and now are 3x? That's what forced my > update to the insert-conflict-specconflict.out test, but AFAICT there is > no way to test if an expression's value is going to change on update > without exercising it once for the old tuple and once for the new tuple. > Even if it were possible for an index to provide the key it might have > changed after the expression evaluation (as is the case in hash), so I > don't think this is avoidable. Maybe that's reason enough to add a > reloption to disable the expression evaluation piece of this? Given > that it might create a logic or performance regression. The flip side > is the potential to use the HOT path, that's a real savings. > > One concerning thing is that nbtree's assumption that key attributes for > TIDs must use binary comparison for equality. This means that for our > common case (heap/btree) there is more work per-update than before, > which is why I need to measure. I could look into eliminating the > nbtree requirement, I don't understand it too well as yet by I believe > that on page split there is an attempt to deduplicate TIDs into a > TIDBitmap and the test for when that's possible is datumIsEqual(). If > that were the same as in this new code, possibly evening using > tts_attr_equal(), then... I don't know, I'll have to investigate. Chime > in here if you can educate me on this one. :) > > best. > > -greg Doh! I forgot to commit the fixed regression test expected output before formatting the patch set, here it is. -greg
Вложения
On Wed, 19 Nov 2025 at 19:00, Greg Burd <greg@burd.me> wrote:
>
> Attached are rebased (d5b4f3a6d4e) patches with the only changes
> happening in the last patch in the series.
Here's a high-level review of the patchset, extending on what I shared
offline. I haven't looked too closely at the code changes.
Re: Perf testing
Apart from the workload we've discussed offline, there's another
workload to consider: Right now, we only really consider HOT when we
know there's space on the page. This patch, however, will front-load a
lot more checks before we have access to the page, and that will
~always impact update performance.
I'm a bit worried that the cost of off-page updates (when the page of
the old tuple can't fit the new tuple) will be too significantly
increased, especially considering that we have a default fillfactor of
100 -- if a table's tuples only grow, it's quite likely the table
frequently can't apply HOT regardless of the updated columns. So, a
workload that's tuned to only update tuples in a way that excercises
'can we HOT or not' code for already-full pages would be appreciated.
A solution to the issue might be needed if we lose too much
performance on expensive checks; a solution like passing page space of
the old tuple to the checks, and short-circuiting the non-HOT path if
that page space is too small for the new tuple.
0001:
I'm not sure I understand why the code is near completely duplicated
here. Maybe you can rename the resulting heap_update to
heap_update_ext, and keep a heap_update around which wraps this
heap_update_ext, allowing old callers to keep their signature until
they need to use _ext's features? Then you can introduce the
duplication if and when needed in later patches -- though I don't
expect a lot of this duplication to be strictly necessary, though that
may need some new helper functions.
I also see that for every update we're now copying, passing 5
bitmapsets individually and then freeing those bitmaps just a moment
later. I'd like to avoid that overhead and duplication if possible.
Maybe we can store these in an 'update context' struct, passed by
reference down to table_tuple_update() from the calling code, and then
onward to heap_update? That might then also be a prime candidate to
contain the EState * of ExecCheckIndexedAttrsForChanges.
0002:
This patch seems to have some formatting updates to changes you made
in 0001, without actually changing the code (e.g. at heap_update's
definition). When updating code, please put it in the expected
formatting in the same patch.
---
In the patch subject:
> For instance,
> indexes with collation information allowing more HOT updates when the
> index is specified to be case insensitive.
It is incorrect to assume that indexed "btree-equal" datums allow HOT
updates. The user can not be returned an old datum in index-only
scans, even if it's sorted the same as the new datum -- after all, a
function on the datum may return different results even if the datums
are otherwise equal. Think: count_uppercase(string). See also below at
"HOT, datum compare, etc".
---
With the addition of rd_indexedattr we now have 6 bitmaps in
RelationData, which generally only get accessed through
RelationGetIndexAttrBitmap by an enum value. Maybe it's now time to
bite the bullet and change that to a more general approach with
`Bitmapset *rd_bitmaps[NUM_INDEX_ATTR_BITMAP]`? That way, the hot
path in RelationGetIndexAttrBitmap would not depend on the compiler to
determine that the fast path can do simple offset arithmatic to get
the requested bitmap.
0003:
This looks like it's a cleanup patch for 0002, and doesn't have much
standing on its own. Maybe the changes for ri_ChangedIndexedCols can
all be moved into 0003? I think that gives this patch more weight and
standing.
0004:
This does two things:
1. Add index expression evaluation to the toolset to determine which
indexes were unchanged, and
2. Allow index access methods to say "this value has not changed"
even if the datum itself may have changed.
Could that be split up into two different patches?
(aside: 2 makes a lot of sense in some cases, like trgm indexes
strings if no new trigrams are added/removed, so I really like the
idea behind this change)
HOT, datum compare, etc.:
Note that for index-only scans on an index to return correct results,
you _must_ update the index (and thus, do a non-HOT update) whenever a
value changes its binary datum, even if the value has the same btree
sort location as the old value. Even for non-IOS-supporting indexes,
the index may need more information than what's used in btree
comparisons when it has a btree opclass.
As SP/GIST have IOS support, they also need to compare the image of
the datum and not use ordinary equality as defined in nbtree's compare
function: the value must be exactly equal to what the table AM
would've provided.
Primary example: Btree compares the `numeric` datums of `1.0` and
`1.00` as equal, but for a user there is an observable difference; the
following SQL must return `1.0` in every valid plan:
BEGIN;
INSERT INTO mytab (mynumeric) VALUES ('1.00');
UPDATE mytab SET mynumeric = '1.0';
SELECT mynumeric FROM mytab;
So, IMO, the default datum compare in ExecCheckIndexedAttrsForChanges
and friends should just use datumIsEqual, and not this new
tts_attr_equal.
Indexes without IOS support might be able to opt into using a more lax
datum comparator, but 1.) it should never be the default, as it'd be a
loaded footgun for IndexAM implementers, and 2.) should not depend on
another AM's understanding of attributes, as that is a very leaky
abstraction.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
On Nov 21 2025, at 10:25 am, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> On Wed, 19 Nov 2025 at 19:00, Greg Burd <greg@burd.me> wrote:
>>
>> Attached are rebased (d5b4f3a6d4e) patches with the only changes
>> happening in the last patch in the series.
>
> Here's a high-level review of the patchset, extending on what I shared
> offline. I haven't looked too closely at the code changes.
Matthias, thanks for spending a bit of time writing up your thoughts and
for chatting a bit with me before doing so. I really appreciate your
point of view.
> Re: Perf testing
>
> Apart from the workload we've discussed offline, there's another
> workload to consider: Right now, we only really consider HOT when we
> know there's space on the page.
Yes, and no. In master every UPDATE triggers a call into
HeapDetermineColumnsInfo() in heap_update(). That function's job is to
examine all the indexed attributes checking each attribute for changes.
The set of indexed attributes is generated by combining a set of
Bitmapsets fetched (copied) from the relcache:
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_HOT_BLOCKING);
sum_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_SUMMARIZED);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_IDENTITY_KEY);
interesting_attrs = NULL;
interesting_attrs = bms_add_members(interesting_attrs, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, sum_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
These "interesting attributes" are each checked comparing the newly
formed updated HeapTuple passed in from the executor against a HeapTuple
read from the page. This happens after taking the buffer lock on that
page. The comparison is done with heap_attr_compare() which calls
datumIsEqual() which boils down to a memcmp(). The function returns the
"modified_attrs".
This is the primary "work" that happens before considering HOT that I've
moved outside the buffer lock and into the executor. The rest of the
HOT decision is 1) will it fit (possibly after being TOASTed), and b) do
the HOT blocking attributes overlap with the modified_attrs? If they
do, unless they are all summarizing you can't go HOT.
Net new work with my approach is only related to the more complicated
checks that have to be done when there are expressions, partial indexes,
or an index implements the new index AM API for comparing datums. In
those cases there is a bit more work, especially when it comes to
expression indexes. Evaluating partial index predicate twice rather
than once is a bit more overhead. Using type-specific comparison when
the index doesn't support index-only scans is a tad more. Overall, not
a whole lot of new work. Certainly a much more complex code path, and
maybe that'll show up in performance tests. I don't know yet.
> This patch, however, will front-load a
> lot more checks before we have access to the page, and that will
> ~always impact update performance.
Disagree, 90% of that work happens today on that same path after the
page lock. Even in the case that the HeapTuple can't fit into the page
that work will happen, there's not much net new "wasted" effort.
Concurrency under load may be better because buffer locks will be held
for less time.
> I'm a bit worried that the cost of off-page updates (when the page of
> the old tuple can't fit the new tuple) will be too significantly
> increased, especially considering that we have a default fillfactor of
> 100 -- if a table's tuples only grow, it's quite likely the table
> frequently can't apply HOT regardless of the updated columns. So, a
> workload that's tuned to only update tuples in a way that excercises
> 'can we HOT or not' code for already-full pages would be appreciated.
> A solution to the issue might be needed if we lose too much
> performance on expensive checks; a solution like passing page space of
> the old tuple to the checks, and short-circuiting the non-HOT path if
> that page space is too small for the new tuple.
Here's the problem, there's not much that can be known about what
ultimate size the HeapTuple will take or if the page can hold it until
after the lock.
Second, I feel that this signal would be specific to the heap, it's
particular MVCC implementation, and it's optimization (HOT). I really
wanted this solution to be non-heap-specific, but heap-enabling.
For me that meant that in the general case the executor should be
concerned with updating only the set of indexes that it must and no
more. So performing this work in the executor ahead of calling into the
table AM or index AM makes sense.
It is only due to heap's "special" model that we have other concerns
related to HOT. Sure, everyone/everything today uses heap so we should
pay attention to this, but I set out not to create yet another thing
that depends on heap's specific operational model. I think I did that.
> 0001:
>
> I'm not sure I understand why the code is near completely duplicated
> here. Maybe you can rename the resulting heap_update to
> heap_update_ext, and keep a heap_update around which wraps this
> heap_update_ext, allowing old callers to keep their signature until
> they need to use _ext's features? Then you can introduce the
> duplication if and when needed in later patches -- though I don't
> expect a lot of this duplication to be strictly necessary, though that
> may need some new helper functions.
This is just a split where I move the top portion of heap_update() into
the two paths that use it. Sure I could have pulled that into another
function, but in this series the next step is to obliterate one half and
(if I get my other patch in cf-6221) then I can completely remove
HeapDetermineColumnsInfo() and vastly simplify simple_heap_update().
> I also see that for every update we're now copying, passing 5
> bitmapsets individually and then freeing those bitmaps just a moment
> later. I'd like to avoid that overhead and duplication if possible.
We do this today, nothing new here. They are passed by reference, not
value into heap_update().
> Maybe we can store these in an 'update context' struct, passed by
> reference down to table_tuple_update() from the calling code, and then
> onward to heap_update? That might then also be a prime candidate to
> contain the EState * of ExecCheckIndexedAttrsForChanges.
I've explored using UpdateContext before, not a bad idea but again this
is just a setup commit. It could be cleaner on it's own, but it doesn't
really take a step backward on any dimension.
> 0002:
>
> This patch seems to have some formatting updates to changes you made
> in 0001, without actually changing the code (e.g. at heap_update's
> definition). When updating code, please put it in the expected
> formatting in the same patch.
I'll find/fix those after v23 attached, sorry for the noise.
> ---
> In the patch subject:
>> For instance,
>> indexes with collation information allowing more HOT updates when the
>> index is specified to be case insensitive.
>
> It is incorrect to assume that indexed "btree-equal" datums allow HOT
> updates. The user can not be returned an old datum in index-only
> scans, even if it's sorted the same as the new datum -- after all, a
> function on the datum may return different results even if the datums
> are otherwise equal. Think: count_uppercase(string). See also below at
> "HOT, datum compare, etc".
Thanks for pointing out the oversight for index-oriented scans (IOS),
you're right that the code in v22 doesn't handle that correctly. I'll
fix that. I still think that indexes that don't support IOS can and
should use the type-specific equality checks. This opens the door to
HOT with custom types that have unusual equality rules (see BSON).
> With the addition of rd_indexedattr we now have 6 bitmaps in
> RelationData, which generally only get accessed through
> RelationGetIndexAttrBitmap by an enum value. Maybe it's now time to
> bite the bullet and change that to a more general approach with
> `Bitmapset *rd_bitmaps[NUM_INDEX_ATTR_BITMAP]`? That way, the hot
> path in RelationGetIndexAttrBitmap would not depend on the compiler to
> determine that the fast path can do simple offset arithmatic to get
> the requested bitmap.
More than a few of those bitmaps in RelationData are purely for the HOT
tests, yes. I'll review those and see if I can shrink the set
meaningfully. It was something that had occurred to me too. I'll
review and consider ways to consolidate them.
> 0003:
> This looks like it's a cleanup patch for 0002, and doesn't have much
> standing on its own. Maybe the changes for ri_ChangedIndexedCols can
> all be moved into 0003? I think that gives this patch more weight and
> standing.
The goal in this patch was to show that we could eliminate the redundant
set work of index_unchanged_by_update() in execIndexing's update path.
Separating it out was done to make it more easily reviewed and to prove
that before/after tests passed making the change safe.
> 0004:
> This does two things:
> 1. Add index expression evaluation to the toolset to determine which
> indexes were unchanged, and
> 2. Allow index access methods to say "this value has not changed"
> even if the datum itself may have changed.
Yes, that's what it does. :)
> Could that be split up into two different patches?
Maybe, I might be able to add the index AM piece first and then the
expression piece.
> (aside: 2 makes a lot of sense in some cases, like trgm indexes
> strings if no new trigrams are added/removed, so I really like the
> idea behind this change).
Nice, I appreciate that.
> HOT, datum compare, etc.:
>
> Note that for index-only scans on an index to return correct results,
> you _must_ update the index (and thus, do a non-HOT update) whenever a
> value changes its binary datum, even if the value has the same btree
> sort location as the old value.
Yes, I get this now. I also wasn't testing the INCLUDING (non-key)
columns. I've fixed both of those in v23.
> Even for non-IOS-supporting indexes,
> the index may need more information than what's used in btree
> comparisons when it has a btree opclass.
It's not always just the btree opclass for equality, but that is common.
> As SP/GIST have IOS support, they also need to compare the image of
> the datum and not use ordinary equality as defined in nbtree's compare
> function: the value must be exactly equal to what the table AM
> would've provided.
In v23 I've changed the logic to use datumIsEqual() for any index that
supports IOS and doesn't supply a custom amcomparedatums() function.
> Primary example: Btree compares the `numeric` datums of `1.0` and
> `1.00` as equal, but for a user there is an observable difference; the
> following SQL must return `1.0` in every valid plan:
>
> BEGIN;
> INSERT INTO mytab (mynumeric) VALUES ('1.00');
> UPDATE mytab SET mynumeric = '1.0';
> SELECT mynumeric FROM mytab;
I'll try to reproduce this and add a test if I can.
> So, IMO, the default datum compare in ExecCheckIndexedAttrsForChanges
> and friends should just use datumIsEqual, and not this new
> tts_attr_equal.
Sure, and that's the case in v23. tts_attr_equal() still has value in
other cases so it's not gone.
> Indexes without IOS support might be able to opt into using a more lax
> datum comparator, but 1.) it should never be the default, as it'd be a
> loaded footgun for IndexAM implementers, and 2.) should not depend on
> another AM's understanding of attributes, as that is a very leaky
> abstraction.
Agree, which is why the default for IOS indexes is datumIsEqual() now.
> Kind regards,
>
> Matthias van de Meent
> Databricks (https://www.databricks.com)
Thanks again for the time and renewed interest in the patch! I've also
added a lot more tests into the heap_hot_updates.sql regression suite,
likely too many, but for now it's good to be testing the corners.
v23 attached, changes are all in 0004, best.
-greg
Вложения
On Sat, 22 Nov 2025, 22:30 Greg Burd, <greg@burd.me> wrote: > Thanks for pointing out the oversight for index-oriented scans (IOS), > you're right that the code in v22 doesn't handle that correctly. I'll > fix that. I still think that indexes that don't support IOS can and > should use the type-specific equality checks. This opens the door to > HOT with custom types that have unusual equality rules (see BSON). Do you have specific examples why it would be safe to default to "unusual equality rules" for generally any index's data ingestion needs? Why e.g. BSON must always be compared with their special equality test (and not datumIsEqual), and why IOS-less indexes in general are never going to distinguish between binary distinct but btree-equal values, and why exact equality is the special case here? I understand that you want to maximize optimization for specific workloads that you have in mind, but lacking evidence to the contrary I am really not convinced that your workloads are sufficiently generalizable that they can (and should) be the baseline for these new HOT rules: I have not yet seen good arguments why we could relax "datum equality" to "type equality" without potentially breaking existing indexes. HOT was implemented quite conservatively to make sure that there are no issues where changed values are not reflected in indexes: each indexed TID represents a specific and unchanging set of indexed values, in both the key and non-key attributes of indexes. If a value changes however so slightly, that may be a cause for indexes to treat it differently, and thus HOT must not be used. Aside: The amsummarizing optimization gets around that check by realizing the TID itself isn't really indexed, so the rules can be relaxed around that, but it still needs to go through the effort to update the summarizing indexes if the relevant attributes were ever so slightly updated. This patch right now wants to change these rules and behaviour of HOT in two ways: 1.) Instead of only testing attributes mentioned by indexed expressions for changes, it wants to test the output of the indexed expressions. I would consider this to be generally safe, as long as the expressions comply with the rules we have for indexed expressions. [Which, if not held, would break IOS and various other things, too, so relying on these rules isn't new or special]. 2.) Instead of datumIsEqual, it (by default) wants to do equality checks as provided by the type's default btree opclass' = operator. I have not seen evidence that this is safe. I have even explained with an example that IOS will return distinctly wrong results if only btree's = operator is used to determine if HOT can be applied, and that doesn't even begin to cover the issues related to indexes that may handle data differently from Btree. I also don't want indexes that at some point in the future invent support for IOS to return subtly incorrect results due to HOT checks that depended on the output of a previous version's amcanreturn output. So, IMV, tts_attr_equal is just a rather expensive version of datumIsEqual: the fast path cases would've been handled by datumIsEqual at least as fast (without a switch() statement with 18 specific cases and a default branch); if there is no btree operator it'll still default to btree compare, and if not then if the slow path uses correctly implemented compare operators (for HOT, and potentially all other possible indexes), then these would have an output that is indistinguishable from datumIsEqual, with the only difference the address and performance of the called function and a lot of added catalog lookups. All together, I think it's best to remove the second component of the changes to the HOT rules (changing the type of matching done for indexed values with tts_attr_compare) from this patchset. If you believe this should be added regardless, I think it's best discussed separately in its own thread and patchset -- it should be relatively easy to introduce in both current and future versions of this code, and (if you're correct and this is safe) it would have some benefits even when committed on its own. Kind regards, Matthias van de Meent