Re: snapshot too old issues, first around wraparound and then more.

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: snapshot too old issues, first around wraparound and then more.
Дата
Msg-id 20200403001713.3ukjhhqwscsgf2rl@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: snapshot too old issues, first around wraparound and then more.  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: snapshot too old issues, first around wraparound and then more.
Список pgsql-hackers
Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:
> I have no objection to the idea that *if* the feature is hopelessly
> broken, it should be removed.

I don't think we have a real choice here at this point, at least for the
back branches.

Just about nothing around old_snapshot_threshold works correctly:

* There are basically no tests (see [1] I jsut sent, and also
  old_snapshot_threshold bypassing a lot of the relevant code).

* We don't detect errors after hot pruning (to allow that is a major
  point of the feature) when access is via any sort of index
  scans. Wrong query results.

* The time->xid mapping is is entirely broken. We don't prevent bloat
  for many multiples of old_snapshot_threshold (if above 1min).

  It's possible, but harder, to have this cause wrong query results.

* In read-mostly workloads it can trigger errors in sessions that are
  much younger than old_snapshot_threshold, if the transactionid is not
  advancing.

  I've not tried to reproduce, but I suspect this can also cause wrong
  query results. Because a later snapshot can have the same xmin as
  older transactions, it sure looks like we can end up with data from an
  older xmin getting removed, but the newer snapshot's whenTaken will
  prevent TestForOldSnapshot_impl() from raising an error.

* I am fairly sure that it can cause crashes (or even data corruption),
  because it assumes that DML never needs to check for old snapshots
  (with no meaningful justificiation). Leading to heap_update/delete to
  assume the page header is a tuple.

* There's obviously also the wraparound issue that made me start this
  thread initially.

Since this is a feature that can result in wrong query results (and
quite possibly crashes / data corruption), I don't think we can just
leave this unfixed.  But given the amount of code / infrastructure
changes required to get this into a working feature, I don't see how we
can unleash those changes onto the stable branches.

There's quite a few issues in here that require not just local bugfixes,
but some design changes too. And it's pretty clear that the feature
didn't go through enough review before getting committed. I see quite
some merit in removing the code in master, and having a potential
reimplementation go through a normal feature integration process.


I don't really know what to do here. Causing problems by neutering a
feature in the back branch *sucks*. While not quite as bad, removing a
feature without a replacement in a major release is pretty harsh
too. But I don't really see any other realistic path forward.


FWIW, I've now worked around the interdependency of s_t_o my snapshot
scalability patch (only took like 10 days). I have manually confirmed it
works with 0/1 minute thresholds.  I can make the tests pass unmodified
if I just add SetOldSnapshotThresholdTimestamp() calls when not
necessary (which obviously makes no sense).  Lead to some decent
improvements around pruning that are independent of s_t_o (with more
possibilities "opened").  But I still think we need to do something
here.

Greetings,

Andres Freund

[1] https://postgr.es/m/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: snapshot too old issues, first around wraparound and then more.
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: [PATCH] Incremental sort (was: PoC: Partial sort)