Re: race condition in pg_class

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: race condition in pg_class
Дата
Msg-id 20240512232923.aa.nmisch@google.com
обсуждение исходный текст
Ответ на Re: race condition in pg_class  (Noah Misch <noah@leadboat.com>)
Ответы Re: race condition in pg_class
Re: race condition in pg_class
Список pgsql-hackers
I'm attaching patches implementing the LockTuple() design.  It turns out we
don't just lose inplace updates.  We also overwrite unrelated tuples,
reproduced at inplace.spec.  Good starting points are README.tuplock and the
heap_inplace_update_scan() header comment.

On Wed, Nov 01, 2023 at 08:09:15PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > > We could perhaps make this work by using the existing tuple-lock
> > > infrastructure, rather than inventing new locktags (a choice that
> > > spills to a lot of places including clients that examine pg_locks).

> > > I'd also like to find a solution that fixes the case of a conflicting
> > > manual UPDATE (although certainly that's a stretch goal we may not be
> > > able to reach).

I implemented that; search for ri_needLockTagTuple.

> - GRANT passes to heapam the fixed-size portion of the pre-modification tuple.
>   heap_update() compares those bytes to the oldtup in shared buffers to see if
>   an inplace update happened.  (HEAD could get the bytes from a new
>   heap_update() parameter, while back branches would need a different passing
>   approach.)

This could have been fine, but ...

> - LockTuple(), as seen in its attached prototype.  I like this least at the
>   moment, because it changes pg_locks content without having a clear advantage
>   over the previous option.

... I settled on the LockTuple() design for these reasons:

- Solves more conflicts by waiting, instead of by ERROR or by retry loops.
- Extensions wanting inplace updates don't have a big disadvantage over core
  code inplace updates.
- One could use this to stop "tuple concurrently updated" for pg_class rows,
  by using SearchSysCacheLocked1() for all pg_class DDL and making that
  function wait for any existing xmax like inplace_xmax_lock() does.  I don't
  expect to write that, but it's a nice option to have.
- pg_locks shows the new lock acquisitions.

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
  CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
  still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.

- AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical
  section, but it is critical.  We must not commit transactional DDL without
  other backends receiving an inval.  (When the inplace inval becomes
  nontransactional, it will face the same threat.)

- Trouble is possible, I bet, if the system crashes between the inplace-update
  memcpy() and XLogInsert().  See the new XXX comment below the memcpy().
  Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
  finally issuing memcpy() into the buffer.

- [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
  xmin does not stop pruning, an MVCC scan in that backend can find zero
  tuples when one is live.  This is like what all backends got in the days of
  SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
  the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
  setting that flag later and unsetting it earlier.)

If you find decisions in this thread's patches are tied to any of those such
that I should not separate those, let's discuss that.  Topics in the patches
that I feel are most fruitful for debate:

- This makes inplace update block if the tuple has an updater.  It's like one
  GRANT blocking another, except an inplace updater won't get "ERROR:  tuple
  concurrently updated" like one of the GRANTs would.  I had implemented
  versions that avoided this blocking by mutating each tuple in the updated
  tuple chain.  That worked, but it had corner cases bad for maintainability,
  listed in the inplace_xmax_lock() header comment.  I'd rather accept the
  blocking, so hackers can rule out those corner cases.  A long-running GRANT
  already hurts VACUUM progress more just by keeping an XID running.

- Pre-checks could make heap_inplace_update_cancel() calls rarer.  Avoiding
  one of those avoids an exclusive buffer lock, and it avoids waiting on
  concurrent heap_update() if any.  We'd pre-check the syscache tuple.
  EventTriggerOnLogin() does it that way, because the code was already in that
  form.  I expect only vac_update_datfrozenxid() concludes !dirty enough to
  matter.  I didn't bother with the optimization, but it would be simple.

- If any citus extension user feels like porting its heap_inplace_update()
  call to this, I'd value hearing about your experience.

- I paid more than my usual attention to test coverage, considering the patch
  stack's intensity compared to most back-patch bug fixes.

I've kept all the above topics brief; feel free to ask for more details.

Thanks,
nm

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: WAL record CRC calculated incorrectly because of underlying buffer modification
Следующее
От: Paul Jungwirth
Дата:
Сообщение: Re: SQL:2011 application time