Обсуждение: Re: Expanding HOT updates for expression and partial indexes
On Jul 2 2025, at 2:10 pm, Greg Burd <greg@burd.me> wrote: > The goal is to allow HOT updates under two new conditions: > * when an indexed expression has not changed > * when possible for a partial index This is still true. :) This patch has languished for nearly 2+ months at v17. Why? Primarily due to feedback that although the idea had merit, there was a critical flaw in the approach that made it a non-starter. The flaw was that I'd been executing expressions while holding both the pin and a lock on the buffer, which is not a great idea (self dead lock, etc.). This was pointed out to me (thanks Robert Haas!) and so I needed to re-think my approach. I put the patch aside for a while, then this past week at PGConf.dev/NYC I heard interest from a few people (Jeff Davis, Nathan Bossart) who encouraged me to move the code executing the expressions to just before acquiring the lock but after pinning the buffer. The theory being that my new code using the old/new tts to form and test the index tuples resulting from executing expressions was using the resultsRelInfo struct created during plan execution, not the information found on the page, and so was safe without the lock. This proved tricky because I had been using the modified_attrs and expr_attrs as a test to avoid exercising expressions when unnecessary. Calling HeapDetermineColumnsInfo() outside the buffer lock to get modified_attrs proved to be a problem as it examines an oldtup that is cobbled together from the elements on the page, requiring the lock I was trying to avoid. After reviewing how updates work in the executor, I discovered that during execution the new tuple slot is populated with the information from ExecBuildUpdateProjection() and the old tuple, but that most importantly for this use case that function created a bitmap of the modified columns (the columns specified in the update). This bitmap isn't the same as the one produced by HeapDetermineColumnsInfo() as the latter excludes attributes that are not changed after testing equality with the helper function heap_attr_equals() where as the former will include attributes that appear in the update but are the same value as before. This, happily, is immaterial for the purposes of my function ExecExprIndexesRequireUpdates() which simply needs to check to see if index tuples generated are unchanged. So I had all I needed to run the checks ahead of acquiring the lock on the buffer. So, this led to v18 (attached), which passes test wold including a number of new tests for the various corner cases relative to HOT updates for expressions. There is much room for improvement, and your suggestions are welcome. I'll find time to quantify the benefit of this patch for the targeted use cases and to ensure that all other cases see no regressions. I need to review the tests I've added to ensure that they are the minimal set required, that they communicate effectively their purpose, etc. For now, it's more shotgun than scalpel, .... I'll get to it. I added a reloption "expression_checks" to disable this new code path. Good idea or bad precedent? In execIndexing I special case for IsolationIsSerializable() and I can't remember why now but I do recall one isolation test failing... I'll check on this and get back to the thread. Or maybe you know why that From small things like naming to larger questions like hijacking the modified columns bitmapset for use later in the heap, I need feedback and guidance. I'm using UpdateContext for estate/resultRelInfo and I've added to that lockmode, these change the table am API a tad, I'm open to better ideas that accomplish the same. I'd like not to build, then rebuild index tuples for these expressions but I can't think of a way to do that without a palloc(), this is avoided today. Example: CREATE TABLE t ( x INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, y JSONB ); CREATE INDEX t_age_idx ON t (((y->>'age')::int)); INSERT INTO t (y) VALUES ('{"name": "John", "age": 30, "city": "New York"}'); -- Update an indexed field in the JSONB document, should not be HOT UPDATE t SET y = jsonb_set(y, '{age}', '31') WHERE x = 1; SELECT pg_stat_force_next_flush(); SELECT relname, n_tup_upd, n_tup_hot_upd FROM pg_stat_user_tables WHERE relname = 't'; -- Update a non-indexed field in the JSONB document, new HOT update UPDATE t SET y = jsonb_set(y, '{city}', '"Boston"') WHERE x = 1; SELECT pg_stat_force_next_flush(); SELECT relname, n_tup_upd, n_tup_hot_upd FROM pg_stat_user_tables WHERE relname = 't'; This patch is far from "commit ready", but it does accomplish the $subject and pass tests. I look forward to any/all constructive feedback. best. -greg
Вложения
On Tue, Oct 07, 2025 at 05:36:11PM -0400, Greg Burd wrote: > I put the patch aside for a while, then this past week at PGConf.dev/NYC > I heard interest from a few people (Jeff Davis, Nathan Bossart) who > encouraged me to move the code executing the expressions to just before > acquiring the lock but after pinning the buffer. The theory being that > my new code using the old/new tts to form and test the index tuples > resulting from executing expressions was using the resultsRelInfo struct > created during plan execution, not the information found on the page, > and so was safe without the lock. An open question (at least from me) is whether this is safe. I'm not familiar enough with this area of code yet to confidently determine that. > After reviewing how updates work in the executor, I discovered that > during execution the new tuple slot is populated with the information > from ExecBuildUpdateProjection() and the old tuple, but that most > importantly for this use case that function created a bitmap of the > modified columns (the columns specified in the update). This bitmap > isn't the same as the one produced by HeapDetermineColumnsInfo() as the > latter excludes attributes that are not changed after testing equality > with the helper function heap_attr_equals() where as the former will > include attributes that appear in the update but are the same value as > before. This, happily, is immaterial for the purposes of my function > ExecExprIndexesRequireUpdates() which simply needs to check to see if > index tuples generated are unchanged. So I had all I needed to run the > checks ahead of acquiring the lock on the buffer. Nice. > There is much room for improvement, and your suggestions are welcome. A general and predictable suggestion is to find ways to break this into smaller pieces. As-is, this patch would take me an enormous amount of time to review in any depth. If we can break off some smaller pieces that we can scrutinize and commit independently, we can start making forward progress sooner. The UpdateContext and reloption stuff are examples of things that might be possible to split into independent patches. > I'll find time to quantify the benefit of this patch for the targeted > use cases and to ensure that all other cases see no regressions. Looking forward to these results. This should also help us decide whether to set expression_checks by default. > I added a reloption "expression_checks" to disable this new code path. > Good idea or bad precedent? If there are cases where the added overhead outweighs the benefits (which seems like it must be true some of the time), then I think we must have a way to opt-out (or maybe even opt-in). In fact, I'd advise adding a GUC to complement the reloption so that users can configure it at higher levels. > In execIndexing I special case for IsolationIsSerializable() and I can't > remember why now but I do recall one isolation test failing... I'll > check on this and get back to the thread. Or maybe you know why that I didn't follow this. > I'd like not to build, then rebuild index tuples for these expressions > but I can't think of a way to do that without a palloc(), this is > avoided today. Is the avoidance of palloc() a strict rule? Is this discussed in the code anywhere? -- nathan
On Wed, 2025-10-08 at 15:48 -0500, Nathan Bossart wrote: > > The theory being that > > my new code using the old/new tts to form and test the index tuples > > resulting from executing expressions was using the resultsRelInfo > > struct > > created during plan execution, not the information found on the > > page, > > and so was safe without the lock. > > An open question (at least from me) is whether this is safe. I'm not > familiar enough with this area of code yet to confidently determine > that. The optimization requires that the expression evaluates to the same thing on the old and new tuples. That determination doesn't have anything to do with a lock on the buffer, so long as the old tuple isn't pruned away or something. And clearly it won't be pruned, because we're in the process of updating it, so we have a snapshot that can see it. There might be subtleties in other parts of the proposal, but the above determination can be made safely without a buffer lock. > > > I added a reloption "expression_checks" to disable this new code > > path. > > Good idea or bad precedent? > > If there are cases where the added overhead outweighs the benefits > (which > seems like it must be true some of the time), then I think we must > have a > way to opt-out (or maybe even opt-in). In fact, I'd advise adding a > GUC to > complement the reloption so that users can configure it at higher > levels. I'll push back against this. For now I'm fine with developer options to make testing easier, but we should find a way to make this work well without tuning. Regards, Jeff Davis
On Tue, 2025-10-07 at 17:36 -0400, Greg Burd wrote: > After reviewing how updates work in the executor, I discovered that > during execution the new tuple slot is populated with the information > from ExecBuildUpdateProjection() and the old tuple, but that most > importantly for this use case that function created a bitmap of the > modified columns (the columns specified in the update). This bitmap > isn't the same as the one produced by HeapDetermineColumnsInfo() as > the > latter excludes attributes that are not changed after testing > equality > with the helper function heap_attr_equals() where as the former will > include attributes that appear in the update but are the same value > as > before. This, happily, is immaterial for the purposes of my function > ExecExprIndexesRequireUpdates() which simply needs to check to see if > index tuples generated are unchanged. So I had all I needed to run > the > checks ahead of acquiring the lock on the buffer. You're still calling ExecExprIndexesRequireUpdates() from within heap_update(). Can't you do that inside of ExecUpdatePrologue() or thereabouts? Regards, Jeff Davis
> On Oct 8, 2025, at 4:48 PM, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Oct 07, 2025 at 05:36:11PM -0400, Greg Burd wrote: >> I put the patch aside for a while, then this past week at PGConf.dev/NYC >> I heard interest from a few people (Jeff Davis, Nathan Bossart) who >> encouraged me to move the code executing the expressions to just before >> acquiring the lock but after pinning the buffer. The theory being that >> my new code using the old/new tts to form and test the index tuples >> resulting from executing expressions was using the resultsRelInfo struct >> created during plan execution, not the information found on the page, >> and so was safe without the lock. Thanks for taking a look Nathan. > > An open question (at least from me) is whether this is safe. I'm not > familiar enough with this area of code yet to confidently determine that. My read is that it is safe because we're testing the content of two TupleTableSlots both formed in the executor. The function uses only that information and doesn't reference data on the page at all. >> After reviewing how updates work in the executor, I discovered that >> during execution the new tuple slot is populated with the information >> from ExecBuildUpdateProjection() and the old tuple, but that most >> importantly for this use case that function created a bitmap of the >> modified columns (the columns specified in the update). This bitmap >> isn't the same as the one produced by HeapDetermineColumnsInfo() as the >> latter excludes attributes that are not changed after testing equality >> with the helper function heap_attr_equals() where as the former will >> include attributes that appear in the update but are the same value as >> before. This, happily, is immaterial for the purposes of my function >> ExecExprIndexesRequireUpdates() which simply needs to check to see if >> index tuples generated are unchanged. So I had all I needed to run the >> checks ahead of acquiring the lock on the buffer. > > Nice. Handy indeed. I'm not at all a fan of increasing the size of a plan node but it's only by a little... and for a good cause. >> There is much room for improvement, and your suggestions are welcome. > > A general and predictable suggestion is to find ways to break this into > smaller pieces. As-is, this patch would take me an enormous amount of time > to review in any depth. If we can break off some smaller pieces that we > can scrutinize and commit independently, we can start making forward > progress sooner. The UpdateContext and reloption stuff are examples of > things that might be possible to split into independent patches. Fair, I'll try. >> I'll find time to quantify the benefit of this patch for the targeted >> use cases and to ensure that all other cases see no regressions. > > Looking forward to these results. This should also help us decide whether > to set expression_checks by default. In past my test results were very positive for cases where this helped avoid heap and index bloat and almost immeasurably small even for cases where we were doing the work to test but ultimately unable to take the HOT path. This will require new tests as the code has changed quite a bit. >> I added a reloption "expression_checks" to disable this new code path. >> Good idea or bad precedent? > > If there are cases where the added overhead outweighs the benefits (which > seems like it must be true some of the time), then I think we must have a > way to opt-out (or maybe even opt-in). In fact, I'd advise adding a GUC to > complement the reloption so that users can configure it at higher levels. This evolved from a GUC to a reloption and I'd rather it go away entirely. I hear your concern, but I've yet to measure a perceptable impact and I'll try hard to keep it that way as this matures. Assuming that's the case, I'd like to eliminate the potentially confusing tuning knob. >> In execIndexing I special case for IsolationIsSerializable() and I can't >> remember why now but I do recall one isolation test failing... I'll >> check on this and get back to the thread. Or maybe you know why that > > I didn't follow this. More later if/when I can reproduce it and understand it better myself. >> I'd like not to build, then rebuild index tuples for these expressions >> but I can't think of a way to do that without a palloc(), this is >> avoided today. > > Is the avoidance of palloc() a strict rule? Is this discussed in the code > anywhere? Not that I know of, just my paranoid self trying to avoid it on a path that didn't have it before. > -- > nathan best. -greg
> On Oct 9, 2025, at 3:08 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2025-10-08 at 15:48 -0500, Nathan Bossart wrote: >>> The theory being that >>> my new code using the old/new tts to form and test the index tuples >>> resulting from executing expressions was using the resultsRelInfo >>> struct >>> created during plan execution, not the information found on the >>> page, >>> and so was safe without the lock. >> >> An open question (at least from me) is whether this is safe. I'm not >> familiar enough with this area of code yet to confidently determine >> that. Hey Jeff, Thanks for the nudge at PGConf.dev in NYC and for the follow-up here. > The optimization requires that the expression evaluates to the same > thing on the old and new tuples. That determination doesn't have > anything to do with a lock on the buffer, so long as the old tuple > isn't pruned away or something. And clearly it won't be pruned, because > we're in the process of updating it, so we have a snapshot that can see > it. Right, I test that the expression on the index evaluates to the same value when forming an index tuple for old/new slots. > There might be subtleties in other parts of the proposal, but the above > determination can be made safely without a buffer lock. > >> >>> I added a reloption "expression_checks" to disable this new code >>> path. >>> Good idea or bad precedent? >> >> If there are cases where the added overhead outweighs the benefits >> (which >> seems like it must be true some of the time), then I think we must >> have a >> way to opt-out (or maybe even opt-in). In fact, I'd advise adding a >> GUC to >> complement the reloption so that users can configure it at higher >> levels. > > I'll push back against this. For now I'm fine with developer options to > make testing easier, but we should find a way to make this work well > without tuning. I'm aligned with this, the reloption evolved from a GUC and I'm more of the opinion that neither should exist and that the overhead of this be minimized and so require no tuning or consideration by the end user. best. -greg > Regards, > Jeff Davis
> On Oct 9, 2025, at 3:27 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2025-10-07 at 17:36 -0400, Greg Burd wrote: >> After reviewing how updates work in the executor, I discovered that >> during execution the new tuple slot is populated with the information >> from ExecBuildUpdateProjection() and the old tuple, but that most >> importantly for this use case that function created a bitmap of the >> modified columns (the columns specified in the update). This bitmap >> isn't the same as the one produced by HeapDetermineColumnsInfo() as >> the >> latter excludes attributes that are not changed after testing >> equality >> with the helper function heap_attr_equals() where as the former will >> include attributes that appear in the update but are the same value >> as >> before. This, happily, is immaterial for the purposes of my function >> ExecExprIndexesRequireUpdates() which simply needs to check to see if >> index tuples generated are unchanged. So I had all I needed to run >> the >> checks ahead of acquiring the lock on the buffer. > > You're still calling ExecExprIndexesRequireUpdates() from within > heap_update(). Can't you do that inside of ExecUpdatePrologue() or > thereabouts? Hey Jeff, I'm trying to knit this into the executor layer but that is tricky because the concept of HOT is very heap-specific, so the executor should be ignorant of the heap's specific needs (right?). Right now, I am considering adding a step in ExecUpdatePrologue() just after opening the indexes. The idea I'm toying with is to have a new function on all TupleTableSlots that examines the before/after slots for an update and the set of updated attributes and returns a Bitmapset of the changed attributes that overlap with indexes and so should trigger index updates in ExecUpdateEpilogue(). That way for heap we'd have something like: Bitmapset *tts_heap_getidxattr(ResultRelInfo *info, TupleTableSlot *updated, TupleTableSlot *existing, Bitmapset *updated_attrs) { some combo of HeapDeterminColumnsInfo() and ExecExprIndexesRequireUpdates() returns the set of indexed attrs that this update changed } So, attributes only referenced by expressions where the expression produces the same value for the updated and existing slots would be removed from the set. Interestingly, summarizing indexes that don't overlap with changed attributes won't be updated (and that's a good thing). Problem is we're not yet accounting for what is about to happen in ExecUpdateAct() when calling into the heap_update(). That's where heap tries to fit the new tuple onto the same page. That might be possible with large tuples thanks to TOAST, it's impossible to say before getting into this function with the page locked. So, for updates we include the modified_attrs in the UpdateContext which is available to heap_update(). If the heap code decides to go HOT, great unset all attributes in the modified_attrs except any that are only summarizing. If the heap can't go HOT, fine, add the indexed attrs back into modified_attrs which should trigger all indexes to be updated. This gets rid of TU_UpdateIndexes enum and allows only modified summarizing indexes to be updated on the HOT path. Two additional benefits IMO. at least, that's what I'm trying out now, -greg > Regards, > Jeff Davis
On Tue, 2025-10-14 at 13:46 -0400, Greg Burd wrote: > I'm trying to knit this into the executor layer but that is tricky > because > the concept of HOT is very heap-specific, so the executor should be > ignorant of the heap's specific needs (right?). It's wrong for the executor to say "do a HOT update" but it's OK for the executor to say "this is the set of indexes that might have a new key after the update". If that set is empty, then the heap can choose to do a HOT update. > Right now, I am considering > adding a step in ExecUpdatePrologue() just after opening the indexes. Seems like a reasonable place. > The idea I'm toying with is to have a new function on all > TupleTableSlots... > > That way for heap we'd have something like: > Bitmapset *tts_heap_getidxattr(ResultRelInfo *info, > TupleTableSlot *updated, > TupleTableSlot *existing, > Bitmapset *updated_attrs) > { > some combo of HeapDeterminColumnsInfo() and > ExecExprIndexesRequireUpdates() > > returns the set of indexed attrs that this update changed > } Why is this a generic method for all slots? Do we need to reuse it somewhere else? I would have expected just a static method in nodeModifyTable.c that does just what's needed. And to be precise, it's the set of indexed attrs where the update might have created a new key, right? The whole point is that we don't care if the indexed attr has been changed, so long as it doesn't create a new index key. > Interestingly, summarizing indexes that don't overlap with changed > attributes won't be updated (and that's a good thing). Nice. > Problem is we're not yet accounting for what is about to happen in > ExecUpdateAct() when calling into the heap_update(). That's where > heap tries to fit the new tuple onto the same page. That might be > possible with large tuples thanks to TOAST, it's impossible to say > before getting into this function with the page locked. I don't see why that's a problem. The executor can pass down the list of indexed attrs that might have created new keys after the update, then heap_update uses that information (along with other factors, like if it fits on the same page) to determine whether to perform a HOT update or not. > So, for updates we include the modified_attrs in the UpdateContext > which is available to heap_update(). It doesn't look like UpdateContext is currently available to heap_update(). We might need to change the signature. But I think it's fine to change the signature if it results in a cleaner design -- tableam extensions often need source changes when new major versions are released. > If the heap code decides to > go HOT, great unset all attributes in the modified_attrs except any > that are only summarizing. If the heap can't go HOT, fine, add > the indexed attrs back into modified_attrs which should trigger all > indexes to be updated. IIUC, that sounds like a good plan. > This gets rid of TU_UpdateIndexes enum and allows only modified > summarizing indexes to be updated on the HOT path. Two additional > benefits IMO. I'm not sure that I understand, but I'll look at that after we sort out some of the other details. Regards, Jeff Davis