Обсуждение: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Поиск
Список
Период
Сортировка

Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl"

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

- Melanie

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
[2] https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com

Вложения

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Peter Geoghegan
Дата:
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".
>
> In back branches starting with 14, failing to remove tuples older than
> OldestXmin during pruning caused vacuum to infinitely loop in
> lazy_scan_prune(), as investigated on this [1] thread.

This is a great summary.

> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
> has a transaction that sees that tuple as alive, because it will
> simply wait to replay the removal until it would be correct to do so
> or recovery conflict handling will cancel the transaction that sees
> the tuple as alive and allow replay to continue.

I think that this is the right general approach.

> The repro forces a round of index vacuuming after the standby
> reconnects and before pruning a dead tuple whose xmax is older than
> OldestXmin.
>
> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> calls GetOldestNonRemovableTransactionId(), thereby updating the
> backend's GlobalVisState and moving maybe_needed backwards.

Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.

--
Peter Geoghegan



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Alena Rybakina
Дата:

Hi, Melanie! I'm glad to hear you that you have found a root case of the problem) Thank you for that!

On 21.06.2024 02:42, Melanie Plageman wrote:
Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessive freezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine.

if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
    return HEAPTUPLE_DEAD;

Can I keep it?

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
[2] https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com
To be honest, the meson test is new for me, but I see its useful features. I think I will use it for checking my features)

I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replica when I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problem because the symptoms were the same.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Heikki Linnakangas
Дата:
On 21/06/2024 03:02, Peter Geoghegan wrote:
> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>> If vacuum fails to remove a tuple with xmax older than
>> VacuumCutoffs->OldestXmin and younger than
>> GlobalVisState->maybe_needed, it will ERROR out when determining
>> whether or not to freeze the tuple with "cannot freeze committed
>> xmax".
>>
>> In back branches starting with 14, failing to remove tuples older than
>> OldestXmin during pruning caused vacuum to infinitely loop in
>> lazy_scan_prune(), as investigated on this [1] thread.
> 
> This is a great summary.

+1

>> We can fix this by always removing tuples considered dead before
>> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
>> has a transaction that sees that tuple as alive, because it will
>> simply wait to replay the removal until it would be correct to do so
>> or recovery conflict handling will cancel the transaction that sees
>> the tuple as alive and allow replay to continue.
> 
> I think that this is the right general approach.

+1

>> The repro forces a round of index vacuuming after the standby
>> reconnects and before pruning a dead tuple whose xmax is older than
>> OldestXmin.
>>
>> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
>> calls GetOldestNonRemovableTransactionId(), thereby updating the
>> backend's GlobalVisState and moving maybe_needed backwards.
> 
> Right. I saw details exactly consistent with this when I used GDB
> against a production instance.
> 
> I'm glad that you were able to come up with a repro that involves
> exactly the same basic elements, including index page deletion.

Would it be possible to make it robust so that we could always run it 
with "make check"? This seems like an important corner case to 
regression test.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
> has a transaction that sees that tuple as alive, because it will
> simply wait to replay the removal until it would be correct to do so
> or recovery conflict handling will cancel the transaction that sees
> the tuple as alive and allow replay to continue.
>
> This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide
excessivefreezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid
ofthis by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine. 
>
> if (prstate->cutoffs &&
> TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
> prstate->cutoffs->OldestMxact != FirstMultiXactId &&
> NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
>     return HEAPTUPLE_DEAD;
>
> Can I keep it?

This looks like an addition to the new criteria I added to
heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so,
it looks like it would only return HEAPTUPLE_DEAD (and thus only
remove) a subset of the tuples my original criteria would remove. When
vacuum calculates OldestMxact as FirstMultiXactId, it would not remove
those tuples deleted before OldestXmin. It seems like OldestMxact will
equal FirstMultiXactID sometimes right after initdb and after
transaction ID wraparound. I'm not sure I totally understand the
criteria.

One thing I find confusing about this is that this would actually
remove less tuples than with my criteria -- which could lead to more
freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we
would not remove tuples deleted before OldestXmin and thus return
HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider
freezing them. So, it seems like we would do more freezing by adding
this criteria.

Could you explain more about how the criteria you are suggesting
works? Are you saying it does less freezing than master or less
freezing than with my patch?

> Attached is the suggested fix for master plus a repro. I wrote it as a
> recovery suite TAP test, but I am _not_ proposing we add it to the
> ongoing test suite. It is, amongst other things, definitely prone to
> flaking. I also had to use loads of data to force two index vacuuming
> passes now that we have TIDStore, so it is a slow test.
-- snip --
> I have a modified version of this that repros the infinite loop on
> 14-16 with substantially less data. See it here [2]. Also, the repro
> attached to this mail won't work on 14 and 15 because of changes to
> background_psql.
>
> I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without
replicawhen I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original
problembecause the symptoms were the same. 

Did you get similar behavior on master or on back branches? Was the
behavior you observed the infinite loop or the error during
heap_prepare_freeze_tuple()?

In my examples, the replica is needed because something has to move
the horizon on the primary backwards. When a standby reconnects with
an older oldest running transaction ID than any of the running
transactions on the primary and the vacuuming backend recomputes its
RecentXmin, the horizon may move backwards when compared to the
horizon calculated at the beginning of the vacuum. Vacuum does not
recompute cutoffs->OldestXmin during vacuuming a relation but it may
recompute the values in the GlobalVisState it uses for pruning.

We knew of only one other way that the horizon could move backwards
which Matthias describes here [1]. However, this is thought to be its
own concurrency-related bug in the commit-abort path that should be
fixed -- as opposed to the standby reconnecting with an older oldest
running transaction ID which can be expected.

Do you know if you were seeing the effects of the scenario Matthias describes?

- Melanie

[1]
https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 21/06/2024 03:02, Peter Geoghegan wrote:
> > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> >
> >> The repro forces a round of index vacuuming after the standby
> >> reconnects and before pruning a dead tuple whose xmax is older than
> >> OldestXmin.
> >>
> >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> >> calls GetOldestNonRemovableTransactionId(), thereby updating the
> >> backend's GlobalVisState and moving maybe_needed backwards.
> >
> > Right. I saw details exactly consistent with this when I used GDB
> > against a production instance.
> >
> > I'm glad that you were able to come up with a repro that involves
> > exactly the same basic elements, including index page deletion.
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.

I'd have to look into how to ensure I can stabilize some of the parts
that seem prone to flaking. I can probably stabilize the vacuum bit
with a query of pg_stat_activity making sure it is waiting to acquire
the cleanup lock.

I don't, however, see a good way around the large amount of data
required to trigger more than one round of index vacuuming. I could
generate the data more efficiently than I am doing here
(generate_series() in the from clause). Perhaps with a copy? I know it
is too slow now to go in an ongoing test, but I don't have an
intuition around how fast it would have to be to be acceptable. Is
there a set of additional tests that are slower that we don't always
run? I didn't follow how the wraparound test ended up, but that seems
like one that would have been slow.

- Melanie



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Tomas Vondra
Дата:


On 6/24/24 16:53, Melanie Plageman wrote:
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> On 21/06/2024 03:02, Peter Geoghegan wrote:
>>> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
>>> <melanieplageman@gmail.com> wrote:
>>>
>>>> The repro forces a round of index vacuuming after the standby
>>>> reconnects and before pruning a dead tuple whose xmax is older than
>>>> OldestXmin.
>>>>
>>>> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
>>>> calls GetOldestNonRemovableTransactionId(), thereby updating the
>>>> backend's GlobalVisState and moving maybe_needed backwards.
>>>
>>> Right. I saw details exactly consistent with this when I used GDB
>>> against a production instance.
>>>
>>> I'm glad that you were able to come up with a repro that involves
>>> exactly the same basic elements, including index page deletion.
>>
>> Would it be possible to make it robust so that we could always run it
>> with "make check"? This seems like an important corner case to
>> regression test.
> 
> I'd have to look into how to ensure I can stabilize some of the parts
> that seem prone to flaking. I can probably stabilize the vacuum bit
> with a query of pg_stat_activity making sure it is waiting to acquire
> the cleanup lock.
> 
> I don't, however, see a good way around the large amount of data
> required to trigger more than one round of index vacuuming. I could
> generate the data more efficiently than I am doing here
> (generate_series() in the from clause). Perhaps with a copy? I know it
> is too slow now to go in an ongoing test, but I don't have an
> intuition around how fast it would have to be to be acceptable. Is
> there a set of additional tests that are slower that we don't always
> run? I didn't follow how the wraparound test ended up, but that seems
> like one that would have been slow.
> 

I think it depends on what is the impact on the 'make check' duration.
If it could be added to one of the existing test groups, then it depends
on how long the slowest test in that group is. If the new test needs to
be in a separate group, it probably needs to be very fast.

But I was wondering how much time are we talking about, so I tried

creating a table, filling it with 300k rows => 250ms
creating an index => 180ms
delete 90% data => 200ms
vacuum t => 130ms

which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms.
I'm not sure how much more stuff does the test need to do, but this
would be pretty reasonable, if we could add it to an existing group.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin.

I don't have a great feeling about this fix. It's not that I think
it's wrong. It's just that the underlying problem here is that we have
heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
of deciding what gets pruned, but that doesn't actually work, because
as I pointed out in
http://postgr.es/m/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com
it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
fix is to consider both variables, which again may be totally correct,
but wouldn't it be a lot better if we didn't have two variables
fighting for control of the same behavior?

(I'm not trying to be a nuisance here -- I think it's great that
you've done the work to pin this down and perhaps there is no better
fix than what you've proposed.)

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Peter Geoghegan
Дата:
On Mon, Jun 24, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't have a great feeling about this fix. It's not that I think
> it's wrong. It's just that the underlying problem here is that we have
> heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
> struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
> of deciding what gets pruned, but that doesn't actually work, because
> as I pointed out in
> http://postgr.es/m/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com
> it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
> fix is to consider both variables, which again may be totally correct,
> but wouldn't it be a lot better if we didn't have two variables
> fighting for control of the same behavior?

Why would it be better? It's to our advantage to have vistest prune
away extra tuples when possible. Andres placed a lot of emphasis on
that during the snapshot scalability work -- vistest can be updated on
the fly.

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

--
Peter Geoghegan



On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The problem here is that OldestXmin is supposed to be more
> conservative than vistest, which it almost always is, except in this
> one edge case. I don't think that plugging that hole changes the basic
> fact that there is one source of truth about what *needs* to be
> pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

I agree with you that letting the pruning horizon move forward during
vacuum is desirable. I'm just wondering if having the vacuum code need
to know a second horizon is really the best way to address that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Peter Geoghegan
Дата:
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

If we did things like that then it would still be necessary to write a
patch like the one Melanie came up with, on the grounds that we'd
really need to be paranoid about having missed some subtlety. We might
as well just rely on the mechanism directly. I just don't think that
it makes much difference.

--
Peter Geoghegan



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

For the purposes of pruning, we are effectively always using the more
conservative of the two with this patch.

Are you more concerned about having a single horizon for pruning or
about having a horizon that does not move backwards after being
established at the beginning of vacuuming the relation?

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

> I agree with you that letting the pruning horizon move forward during
> vacuum is desirable. I'm just wondering if having the vacuum code need
> to know a second horizon is really the best way to address that.

I was thinking about this some more and I realized I don't really get
why we think using GlobalVisState for pruning will let us remove more
tuples in the common case.

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change. But, in writing this repro, it is actually quite
hard to get GlobalVisState to update. Our backend's RecentXmin needs
to have changed. And there aren't very many places where we take a new
snapshot after starting to vacuum a relation. One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

- Melanie



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Peter Geoghegan
Дата:
On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change.

I believe that it's more of an aspirational thing at this point. That
it is currently aspirational (it happens to some extent but isn't ever
particularly useful) shouldn't change the analysis about how to fix
this bug.

> One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

No, that's definitely not what we were thinking of. It's just an
accident that it's almost the only thing that'll do that.

--
Peter Geoghegan



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.
heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD
tuples with normal xmaxes (non-multis) so that it can freeze tuples
from aborted updates. The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update). Freezing dead tuples
replaces their xmax with InvalidTransactionId -- which would make them
look alive. So, it makes sense we don't do this for dead tuples in the
common case. But why is it 1) okay and 2) desirable to freeze xmaxes
of tuples from aborted updates? Won't it make them look alive again?

- Melanie



On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Are you more concerned about having a single horizon for pruning or
> about having a horizon that does not move backwards after being
> established at the beginning of vacuuming the relation?

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed.

> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change. But, in writing this repro, it is actually quite
> hard to get GlobalVisState to update. Our backend's RecentXmin needs
> to have changed. And there aren't very many places where we take a new
> snapshot after starting to vacuum a relation. One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Peter Geoghegan
Дата:
On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> One thing I don't understand is why it is okay to freeze the xmax of a
> dead tuple just because it is from an aborted update.

We don't do that with XID-based xmaxs. Though perhaps we should, since
we'll already prune-away the successor tuple, and so might as well go
one tiny step further and clear the xmax for the original tuple via
freezing/setting it InvalidTransactionId. Instead we just leave the
original tuple largely undisturbed, with its original xmax.

We do something like that with Multi-based xmax fields, though not
with the specific goal of cleaning up after aborts in mind (we can
also remove lockers that are no longer running, regardless of where
they are relative to OldestXmin, stuff like that). The actual goal
with that is to enforce MultiXactCutoff, independently of whether or
not their member XIDs are < FreezeLimit yet.

> The only case in which we freeze dead tuples
> with a non-multi xmax is if the xmax is from before OldestXmin and is
> also not committed (so from an aborted update).

Perhaps I misunderstand, but: we simply don't freeze DEAD (not
RECENTLY_DEAD) tuples in the first place, because we don't have to
(pruning removes them instead). It doesn't matter if they're DEAD due
to being from aborted transactions or DEAD due to being
deleted/updated by a transaction that committed (committed and <
OldestXmin).

The freezing related code paths in heapam.c don't particularly care
whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
Just as long as it's not fully DEAD (then it should have been pruned).

--
Peter Geoghegan



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Peter Geoghegan
Дата:
On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

Your characterizing of OldestXmin + vistest as two horizons seems
pretty arbitrary to me. I know what you mean, of course, but it seems
like a distinction without a difference.

> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It definitely can do that. Just not in a way that meaningfully
increases the number of heap tuples that we can recognize as DEAD and
remove. At least not currently.

> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

That's my understanding, yes. That is, vistest is approximately the
same thing as OldestXmin anyway. At least for now.

--
Peter Geoghegan



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > One thing I don't understand is why it is okay to freeze the xmax of a
> > dead tuple just because it is from an aborted update.
>
> We don't do that with XID-based xmaxs. Though perhaps we should, since
> we'll already prune-away the successor tuple, and so might as well go
> one tiny step further and clear the xmax for the original tuple via
> freezing/setting it InvalidTransactionId. Instead we just leave the
> original tuple largely undisturbed, with its original xmax.

I thought that was the case too, but we call
heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then

    else if (TransactionIdIsNormal(xid))
    {
        /* Raw xmax is normal XID */
        freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
    }

And then later we

    if (freeze_xmax)
        frz->xmax = InvalidTransactionId;

and then when we execute freezing the tuple in heap_execute_freeze_tuple()

    HeapTupleHeaderSetXmax(tuple, frz->xmax);

Which sets the xmax to InvalidTransactionId. Or am I missing something?

> > The only case in which we freeze dead tuples
> > with a non-multi xmax is if the xmax is from before OldestXmin and is
> > also not committed (so from an aborted update).
>
> Perhaps I misunderstand, but: we simply don't freeze DEAD (not
> RECENTLY_DEAD) tuples in the first place, because we don't have to
> (pruning removes them instead). It doesn't matter if they're DEAD due
> to being from aborted transactions or DEAD due to being
> deleted/updated by a transaction that committed (committed and <
> OldestXmin).

Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples.
HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax
of a tuple that has been deleted or updated by a transaction that
committed with InvalidTransactionId. And it seems like the code does
that? Why even call heap_prepare_freeze_tuple() on
HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact
freezing?

> The freezing related code paths in heapam.c don't particularly care
> whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
> Just as long as it's not fully DEAD (then it should have been pruned).

But it just seems like we shouldn't freeze RECENTLY_DEAD either.

- Melanie



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I thought the idea was that the GlobalVisTest stuff would force a
> > recalculation now and then, but maybe it doesn't actually do that?
>
> It definitely can do that. Just not in a way that meaningfully
> increases the number of heap tuples that we can recognize as DEAD and
> remove. At least not currently.
>
> > Suppose process A begins a transaction, acquires an XID, and then goes
> > idle. Process B now begins a giant vacuum. At some point in the middle
> > of the vacuum, A ends the transaction. Are you saying that B's
> > GlobalVisTest never really notices that this has happened?
>
> That's my understanding, yes. That is, vistest is approximately the
> same thing as OldestXmin anyway. At least for now.

Exactly. Something has to cause this backend to update its view of the
horizon. At the end of index vacuuming,
GetOldestNonRemovableTransactionId() will explicitly
ComputeXidHorizons() which will update our backend's GlobalVisStates.
Otherwise, if our backend's RecentXmin is updated, by taking a new
snapshot, then we may update our GlobalVisStates. See
GlobalVisTestShouldUpdate() for the conditions under which we would
update our GlobalVisStates during the normal visibility checks
happening during pruning.

Vacuum used to open indexes after calculating horizons before starting
its first pass. This led to a recomputation of the horizon. But, in
master, there aren't many obvious places where such a thing would be
happening.

- Melanie



On Mon, Jun 24, 2024 at 04:51:24PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I'm not sure I understand. The most important thing here is fixing the
> > bug. But if we have a choice of how to fix the bug, I'd prefer to do
> > it by having the pruning code test one horizon that is always correct,
> > rather than (as I think the patch does) having it test against two
> > horizons because as a way of covering possible discrepancies between
> > those values.
> 
> Your characterizing of OldestXmin + vistest as two horizons seems
> pretty arbitrary to me. I know what you mean, of course, but it seems
> like a distinction without a difference.

"Two horizons" matches how I model it.  If the two were _always_ indicating
the same notion of visibility, we wouldn't have this thread.

On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed, and I think using different sources for pruning and freezing is a
recipe for future bugs.  Fundamentally, both are about answering "is
snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
That's not to say this thread shall unify the two, but I suspect that's the
right long-term direction.



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Peter Geoghegan
Дата:
On Mon, Jun 24, 2024 at 9:30 PM Noah Misch <noah@leadboat.com> wrote:
> On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > Right now, in master, we do use a single horizon when determining what
> > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > freezing and full page visibility determinations. Using a different
> > horizon for pruning by vacuum than freezing is what is causing the
> > error on master.
>
> Agreed, and I think using different sources for pruning and freezing is a
> recipe for future bugs.  Fundamentally, both are about answering "is
> snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> That's not to say this thread shall unify the two, but I suspect that's the
> right long-term direction.

What does it really mean to unify the two, though?

If the OldestXmin field was located in struct GlobalVisState (next to
definitely_needed and maybe_needed), but everything worked in
essentially the same way as it will with Melanie's patch in place,
would that count as unifying the two? Why or why not?

--
Peter Geoghegan



On Mon, Jun 24, 2024 at 09:49:53PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 9:30 PM Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > > Right now, in master, we do use a single horizon when determining what
> > > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > > freezing and full page visibility determinations. Using a different
> > > horizon for pruning by vacuum than freezing is what is causing the
> > > error on master.
> >
> > Agreed, and I think using different sources for pruning and freezing is a
> > recipe for future bugs.  Fundamentally, both are about answering "is
> > snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> > That's not to say this thread shall unify the two, but I suspect that's the
> > right long-term direction.
> 
> What does it really mean to unify the two, though?
> 
> If the OldestXmin field was located in struct GlobalVisState (next to
> definitely_needed and maybe_needed), but everything worked in
> essentially the same way as it will with Melanie's patch in place,
> would that count as unifying the two? Why or why not?

To me, no, unification would mean removing the data redundancy.  Relocating
the redundant field and/or code that updates the redundant field certainly
could reduce the risk of bugs, so I'm not opposing everything short of
removing the data redundancy.  I'm just agreeing with the "prefer" from
https://postgr.es/m/CA+TgmoYzS_bkt_MrNxr5QrXDKfedmh4tStn8UBTTBXqv=3JTew@mail.gmail.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Andres Freund
Дата:
Hi,

On 2024-06-24 16:35:50 -0400, Robert Haas wrote:
> On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Are you more concerned about having a single horizon for pruning or
> > about having a horizon that does not move backwards after being
> > established at the beginning of vacuuming the relation?
> 
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

I think that's going in the wrong direction. We *want* to prune more
aggressively if we can (*), the necessary state is represented by the
vistest. That's a different thing than *having* to prune tuples beyond a
certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
we're having here is that the two states can get out of sync due to the
vistest "moving backwards", because of hot_standby_feedback (and perhaps also
an issue around aborts).

To prevent that we can
a) make sure that we always take the hard cutoff into account
b) prevent vistest from going backwards

(*) we really ought to become more aggressive, by removing intermediary row
    versions when they're not visible to anyone, but not yet old enough to be
    below the horizon. But that realistically will only be possible in *some*
    cases, e.g. when the predecessor row version is on the same page.



> > I had always thought it was because the vacuuming backend's
> > GlobalVisState will get updated periodically throughout vacuum and so,
> > assuming the oldest running transaction changes, our horizon for
> > vacuum would change. But, in writing this repro, it is actually quite
> > hard to get GlobalVisState to update. Our backend's RecentXmin needs
> > to have changed. And there aren't very many places where we take a new
> > snapshot after starting to vacuum a relation. One of those is at the
> > end of index vacuuming, but that can only affect the pruning horizon
> > if we have to do multiple rounds of index vacuuming. Is that really
> > the case we are thinking of when we say we want the pruning horizon to
> > move forward during vacuum?
> 
> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It forces an accurate horizon to be determined the first time it would require
it to determine visibility. The "first time" is determined by RecentXmin not
having changed.

The main goal of the vistest stuff was to not have to determine an accurate
horizon in GetSnapshotData(). Determining an accurate horizon requires
accessing each backends ->xmin, which causes things to scale badly, as ->xmin
changes so frequently.

The cost of determining the accurate horizon is irrelevant for vacuums, but
it's not at all irrelevant for on-access pruning.


> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

Not at the moment, but we should add heuristics like that.


Greetings,

Andres Freund



On Tue, Jun 25, 2024 at 8:03 AM Andres Freund <andres@anarazel.de> wrote:
> I think that's going in the wrong direction. We *want* to prune more
> aggressively if we can (*), the necessary state is represented by the
> vistest. That's a different thing than *having* to prune tuples beyond a
> certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
> we're having here is that the two states can get out of sync due to the
> vistest "moving backwards", because of hot_standby_feedback (and perhaps also
> an issue around aborts).

I agree that we want to prune more aggressively if we can. I think
that fixing this by preventing vistest from going backward is
reasonable, and I like it better than what Melanie proposed, although
I like what Melanie proposed much better than not fixing it! I'm not
sure how to do that cleanly, but one of you may have an idea.

I do think that having a bunch of different XID values that function
as horizons and a vistest object that holds some more XID horizons
floating around in vacuum makes the code hard to understand. The
relationships between the various values are not well-documented. For
instance, the vistest has to be after vacrel->cutoffs.OldestXmin for
correctness, but I don't think there's a single comment anywhere
saying that; meanwhile, the comments for VacuumCutoffs say "OldestXmin
is the Xid below which tuples deleted by any xact (that committed)
should be considered DEAD, not just RECENTLY_DEAD." Surely the reader
can be forgiven for thinking that this is the cutoff that will
actually be used by pruning, but it isn't.

And more generally, it seems like a fairly big problem to me that
LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs
object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also
points to a GlobalVisState object that contains definitely_needed and
maybe_needed. That is six different XID cutoffs for one vacuum
operation. That's a lot. I can't describe how they're all different
from each other or what the necessary relationships between them are
off-hand, and I bet nobody else could either, at least until recently,
else we might not have this bug. I feel like if it were possible to
have fewer of them and still have things work, we'd be better off. I'm
not sure that's doable. But six seems like a lot.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Andres Freund
Дата:
On 2024-06-25 08:42:02 -0400, Robert Haas wrote:
> On Tue, Jun 25, 2024 at 8:03 AM Andres Freund <andres@anarazel.de> wrote:
> > I think that's going in the wrong direction. We *want* to prune more
> > aggressively if we can (*), the necessary state is represented by the
> > vistest. That's a different thing than *having* to prune tuples beyond a
> > certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
> > we're having here is that the two states can get out of sync due to the
> > vistest "moving backwards", because of hot_standby_feedback (and perhaps also
> > an issue around aborts).
> 
> I agree that we want to prune more aggressively if we can. I think
> that fixing this by preventing vistest from going backward is
> reasonable, and I like it better than what Melanie proposed, although
> I like what Melanie proposed much better than not fixing it! I'm not
> sure how to do that cleanly, but one of you may have an idea.

It's not hard - but it has downsides. It'll mean that - outside of vacuum -
we'll much more often not react to horizons going backwards due to
hot_standby_feedback. Which means that hot_standby_feedback, when used without
slots, will prevent fewer conflicts.


> I do think that having a bunch of different XID values that function
> as horizons and a vistest object that holds some more XID horizons
> floating around in vacuum makes the code hard to understand. The
> relationships between the various values are not well-documented. For
> instance, the vistest has to be after vacrel->cutoffs.OldestXmin for
> correctness, but I don't think there's a single comment anywhere
> saying that;

It is somewhat documented:

 * Note: the approximate horizons (see definition of GlobalVisState) are
 * updated by the computations done here. That's currently required for
 * correctness and a small optimization. Without doing so it's possible that
 * heap vacuum's call to heap_page_prune_and_freeze() uses a more conservative
 * horizon than later when deciding which tuples can be removed - which the
 * code doesn't expect (breaking HOT).


> And more generally, it seems like a fairly big problem to me that
> LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs
> object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also
> points to a GlobalVisState object that contains definitely_needed and
> maybe_needed. That is six different XID cutoffs for one vacuum
> operation. That's a lot. I can't describe how they're all different
> from each other or what the necessary relationships between them are
> off-hand, and I bet nobody else could either, at least until recently,
> else we might not have this bug. I feel like if it were possible to
> have fewer of them and still have things work, we'd be better off. I'm
> not sure that's doable. But six seems like a lot.

Agreed. I don't think you can just unify things though, they actually are all
different for good, or at least decent, reasons.  I think improving the naming
alone could help a good bit though.

Greetings,

Andres Freund



On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> we'll much more often not react to horizons going backwards due to
> hot_standby_feedback. Which means that hot_standby_feedback, when used without
> slots, will prevent fewer conflicts.

Can you explain this in more detail?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Alena Rybakina
Дата:
On 24.06.2024 17:37, Melanie Plageman wrote:
On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessive freezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine.

if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))    return HEAPTUPLE_DEAD;

Can I keep it?
This looks like an addition to the new criteria I added to
heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so,
it looks like it would only return HEAPTUPLE_DEAD (and thus only
remove) a subset of the tuples my original criteria would remove. When
vacuum calculates OldestMxact as FirstMultiXactId, it would not remove
those tuples deleted before OldestXmin. It seems like OldestMxact will
equal FirstMultiXactID sometimes right after initdb and after
transaction ID wraparound. I'm not sure I totally understand the
criteria.

One thing I find confusing about this is that this would actually
remove less tuples than with my criteria -- which could lead to more
freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we
would not remove tuples deleted before OldestXmin and thus return
HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider
freezing them. So, it seems like we would do more freezing by adding
this criteria.

Could you explain more about how the criteria you are suggesting
works? Are you saying it does less freezing than master or less
freezing than with my patch?


At first, I noticed that with this patch, vacuum fouls the nodes more often than before, and it seemed to me that more time was spent initializing the cluster with this patch than before, so I suggested considering this condition. After checking again, I found that the problem was with my laptop. So, sorry for the noise.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.
-- snip --
I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replica when I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problem because the symptoms were the same.
Did you get similar behavior on master or on back branches? Was the
behavior you observed the infinite loop or the error during
heap_prepare_freeze_tuple()?

In my examples, the replica is needed because something has to move
the horizon on the primary backwards. When a standby reconnects with
an older oldest running transaction ID than any of the running
transactions on the primary and the vacuuming backend recomputes its
RecentXmin, the horizon may move backwards when compared to the
horizon calculated at the beginning of the vacuum. Vacuum does not
recompute cutoffs->OldestXmin during vacuuming a relation but it may
recompute the values in the GlobalVisState it uses for pruning.

We knew of only one other way that the horizon could move backwards
which Matthias describes here [1]. However, this is thought to be its
own concurrency-related bug in the commit-abort path that should be
fixed -- as opposed to the standby reconnecting with an older oldest
running transaction ID which can be expected.

Do you know if you were seeing the effects of the scenario Matthias describes?


[1] https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com
I'm sorry, I need a little more time to figure this out. I will answer this question later.
-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Melanie Plageman
Дата:
On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> > It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> > we'll much more often not react to horizons going backwards due to
> > hot_standby_feedback. Which means that hot_standby_feedback, when used without
> > slots, will prevent fewer conflicts.
>
> Can you explain this in more detail?

If we prevent GlobalVisState from moving backward, then we would less
frequently be pushing the horizon backward on the primary in response
to hot standby feedback. Then, the primary would do more things that
would not be safely replayable on the standby -- so the standby could
end up encountering more recovery conflicts.

- Melanie



On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> > > It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> > > we'll much more often not react to horizons going backwards due to
> > > hot_standby_feedback. Which means that hot_standby_feedback, when used without
> > > slots, will prevent fewer conflicts.
> >
> > Can you explain this in more detail?
>
> If we prevent GlobalVisState from moving backward, then we would less
> frequently be pushing the horizon backward on the primary in response
> to hot standby feedback. Then, the primary would do more things that
> would not be safely replayable on the standby -- so the standby could
> end up encountering more recovery conflicts.

I don't get it. hot_standby_feedback only moves horizons backward on
the primary, AFAIK, when it first connects, or when it reconnects.
Which I guess could be frequent for some users with flaky networks,
but does that really rise to the level of "much more often"?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Andres Freund
Дата:
Hi,

On 2024-06-25 12:31:11 -0400, Robert Haas wrote:
> On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> > > > It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> > > > we'll much more often not react to horizons going backwards due to
> > > > hot_standby_feedback. Which means that hot_standby_feedback, when used without
> > > > slots, will prevent fewer conflicts.
> > >
> > > Can you explain this in more detail?
> >
> > If we prevent GlobalVisState from moving backward, then we would less
> > frequently be pushing the horizon backward on the primary in response
> > to hot standby feedback. Then, the primary would do more things that
> > would not be safely replayable on the standby -- so the standby could
> > end up encountering more recovery conflicts.
> 
> I don't get it. hot_standby_feedback only moves horizons backward on
> the primary, AFAIK, when it first connects, or when it reconnects.
> Which I guess could be frequent for some users with flaky networks,
> but does that really rise to the level of "much more often"?

Well, the thing is that with the "prevent it from going backwards" approach,
once the horizon is set to something recent in a backend, it's "sticky".  If a
replica is a bit behind or if there's a long-lived snapshot and disconnects,
the vistest state will advance beyond where the replica needs it to be.  Even
if the standby later reconnects, the vistest in long-lived sessions will still
have the more advanced state. So all future pruning these backends do runs
into the risk of performing pruning that removes rows the standby still deems
visible and thus causes recovery conflicts.

I.e. you don't even need frequent disconnects, you just need one disconnect
and sessions that aren't shortlived.

That said, obviously there will be plenty setups where this won't cause an
issue. I don't really have a handle on how often it'd be a problem.


Greetings,

Andres Freund



On Tue, Jun 25, 2024 at 1:10 PM Andres Freund <andres@anarazel.de> wrote:
> That said, obviously there will be plenty setups where this won't cause an
> issue. I don't really have a handle on how often it'd be a problem.

Fair enough. Even if it's not super-common, it doesn't seem like a
great idea to regress such scenarios in the back-branches.

Is there any way that we could instead tweak things so that we adjust
the visibility test object itself? Like can have a GlobalVisTest API
where we can supply the OldestXmin from the VacuumCutoffs and have it
... do something useful with that?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

От
Andres Freund
Дата:
On 2024-06-25 14:35:00 -0400, Robert Haas wrote:
> Is there any way that we could instead tweak things so that we adjust
> the visibility test object itself? Like can have a GlobalVisTest API
> where we can supply the OldestXmin from the VacuumCutoffs and have it
> ... do something useful with that?

I doubt that's doable in the back branches. And even on HEAD, I don't think
it's a particularly attractive option - there's just a global vistest for each
of the types of objects with a specific horizon (they need to be updated
occasionally, e.g. when taking snapshots). So there's not really a spot to put
an associated OldestXmin. We could put it there and remove it at the end of
vacuum / in an exception handler, but that seems substantially worse.



On Tue, Jun 25, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:
> I doubt that's doable in the back branches. And even on HEAD, I don't think
> it's a particularly attractive option - there's just a global vistest for each
> of the types of objects with a specific horizon (they need to be updated
> occasionally, e.g. when taking snapshots). So there's not really a spot to put
> an associated OldestXmin. We could put it there and remove it at the end of
> vacuum / in an exception handler, but that seems substantially worse.

Oh, right: I forgot that the visibility test objects were just
pointers to global variables.

Well, I don't know. I guess that doesn't leave any real options but to
fix it as Melanie proposed. But I still don't like it very much. I
feel like having to test against two different thresholds in the
pruning code is surely a sign that we're doing something wrong.

--
Robert Haas
EDB: http://www.enterprisedb.com