Обсуждение: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Поиск
Список
Период
Сортировка

eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Hi,

The attached patch set eliminates xl_heap_visible, the WAL record
emitted when a block of the heap is set all-visible/frozen in the
visibility map. Instead, it includes the information needed to update
the VM in the WAL record already emitted by the operation modifying
the heap page.

Currently COPY FREEZE and vacuum are the only operations that set the
VM. So, this patch modifies the xl_heap_multi_insert and xl_heap_prune
records.

The result is a dramatic reduction in WAL volume for these operations.
I've included numbers below.

I also think that it makes more sense to include changes to the VM in
the same WAL record as the changes that rendered the page all-visible.
In some cases, we will only set the page all-visible, but that is in
the context of the operation on the heap page which discovered that it
was all-visible. Therefore, I find this to be a clarity as well as a
performance improvement.

This project is also the first step toward setting the VM on-access
for queries which do not modify the page. There are a few design
issues that must be sorted out for that project which I will detail
separately. Note that this patch set currently does not implement
setting the VM on-access.

The attached patch set isn't 100% polished. I think some of the
variable names and comments could use work, but I'd like to validate
the idea of doing this before doing a full polish. This is a summary
of what is in the set:

Patches:
0001 - 0002: cleanup
0003 - 0004: refactoring
0005: COPY FREEZE changes
0006: refactoring
0007: vacuum phase III changes
0008: vacuum phase I empty page changes
0009 - 0012: refactoring
0013: vacuum phase I normal page changes
0014: cleanup

Performance benefits of eliminating xl_heap_visible:

vacuum of table with index (DDL at bottom of email)
--
master -> patch
WAL bytes: 405346 -> 303088 = 25% reduction
WAL records: 6682 -> 4459 = 33% reduction

vacuum of table without index
--
master -> patch
WAL records: 4452 -> 2231 = 50% reduction
WAL bytes: 289016 -> 177978 = 38% reduction

COPY FREEZE of table without index
--
master -> patch
WAL records: 3672777 -> 1854589 = 50% reduction
WAL bytes: 841340339 -> 748545732  = 11% reduction (new pages need a
copy of the whole page)

table for vacuum example:
--
create table foo(a int, b numeric, c numeric) with (autovacuum_enabled= false);
insert into foo select i % 18, repeat('1', 400)::numeric, repeat('2',
400)::numeric from generate_series(1,40000)i;
-- don't make index for no-index case
create index on foo(a);
delete from foo where a = 1;
vacuum (verbose, process_toast false) foo;


copy freeze example:
--
-- create a data file
create table large(a int, b int) with (autovacuum_enabled = false,
fillfactor = 10);
insert into large SELECT generate_series(1,40000000)i, 1;
copy large to 'large.data';

-- example
BEGIN;
create table large(a int, b int) with (autovacuum_enabled = false,
fillfactor = 10);
COPY large FROM 'large.data' WITH (FREEZE);
COMMIT;

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Jun 23, 2025 at 4:25 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The attached patch set eliminates xl_heap_visible, the WAL record
> emitted when a block of the heap is set all-visible/frozen in the
> visibility map. Instead, it includes the information needed to update
> the VM in the WAL record already emitted by the operation modifying
> the heap page.

Rebased in light of recent changes on master:

0001: cleanup
0002: preparatory work
0003: eliminate xl_heap_visible for COPY FREEZE
0004 - 0005: eliminate xl_heap_visible for vacuum's phase III
0006: eliminate xl_heap_visible for vacuum phase I empty pages
0007 - 0010: preparatory refactoring
0011: eliminate xl_heap_visible from vacuum phase I prune/freeze
0012: remove xl_heap_visible

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Jun 26, 2025 at 6:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Rebased in light of recent changes on master:

This needed another rebase, and, in light of the discussion in [1],
I've also removed the patch to add heap wrappers for setting pages
all-visible.

More notably, the final patch (0012) in attached v3 allows on-access
pruning to set the VM.

To do this, it plumbs some information down from the executor to the
table scan about whether or not the table is modified by the query. We
don't want to set the VM only to clear it while scanning pages for an
UPDATE or while locking rows in a SELECT FOR UPDATE.

Because we only do on-access pruning when pd_prune_xid is valid, we
shouldn't need much of a heuristic for deciding when to set the VM
on-access -- but I've included one anyway: we only do it if we are
actually pruning or if the page is already dirty and no FPI would be
emitted.

You can see it in action with the following:

create extension pg_visibility;
create table foo (a int, b int) with (autovacuum_enabled=false, fillfactor=90);
insert into foo select generate_series(1,300), generate_series(1,300);
create index on foo (a);
update foo set b = 51 where b = 50;
select * from foo where a = 50;
select * from pg_visibility_map_summary('foo');

The SELECT will set a page all-visible in the VM.
In this patch set, on-access pruning is enabled for sequential scans
and the underlying heap relation in index scans and bitmap heap scans.
This example can exercise any of the three if you toggle
enable_indexscan and enable_bitmapscan appropriately.

From a performance perspective, If you run a trivial pgbench, you can
see far more all-visible pages set in the pgbench_[x] relations with
no noticeable overhead. But, I'm planning to do some performance
experiments to show how this affects our ability to choose index only
scan plans in realistic workloads.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Yj%3DyrL%2BgGGsqfYVQcYn7rDp6hDeoF1vN453JDp8dEY%2Bw%40mail.gmail.com

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Wed, Jul 9, 2025 at 5:59 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Jun 26, 2025 at 6:04 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Rebased in light of recent changes on master:
>
> This needed another rebase, and, in light of the discussion in [1],
> I've also removed the patch to add heap wrappers for setting pages
> all-visible.

Andrey Borodin made the excellent point off-list that I forgot to
remove the xl_heap_visible struct itself -- which is rather important
to a patch set purporting to eliminate xl_heap_visible! New version
attached.


- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andrey Borodin
Дата:

> On 12 Jul 2025, at 03:19, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> remove the xl_heap_visible struct

Same goes for VISIBILITYMAP_XLOG_CATALOG_REL and XLOG_HEAP2_VISIBLE. But please do not rush to remove it, perhaps I
willhave a more exhaustive list later. Currently the patch set is expected to be unpolished. 
I just need to absorb all effects to have a high-level evaluation of the patch set effect.

I'm still trying to grasp connection of first patch with Assert(prstate->cutoffs) to other patches;

Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 

So far I do not see any general problems in delegating redo work from xl_heap_visible to other record. FWIW I observed
severalcases of VM corruptions that might be connected to the fact that we log VM changes independently of data changes
thatcaused VM to change. But I have no real evidence or understanding what happened. 


Best regards, Andrey Borodin.


Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Sun, Jul 13, 2025 at 2:34 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 12 Jul 2025, at 03:19, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > remove the xl_heap_visible struct
>
> Same goes for VISIBILITYMAP_XLOG_CATALOG_REL and XLOG_HEAP2_VISIBLE. But please do not rush to remove it, perhaps I
willhave a more exhaustive list later. Currently the patch set is expected to be unpolished. 
> I just need to absorb all effects to have a high-level evaluation of the patch set effect.

I actually did remove those if you check the last version posted. I
did notice there is one remaining comment referring to
XLOG_HEAP2_VISIBLE I missed somehow, but the actual enums/macros were
removed already.

> I'm still trying to grasp connection of first patch with Assert(prstate->cutoffs) to other patches;

I added this because I noticed that it was used without validating it
was provided in that location. The last patch in the set which sets
the VM on access changes where cutoffs are used, so I noticed what I
felt was a missing assert in master while developing that page.

> Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 

Could you clarify what you mean by this? Are you talking about the
string representation of the visibility map bits in the WAL record
representations in heapdesc.c?

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andrey Borodin
Дата:

> On 14 Jul 2025, at 00:15, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
>>
>> Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 
>
> Could you clarify what you mean by this? Are you talking about the
> string representation of the visibility map bits in the WAL record
> representations in heapdesc.c?

This might be a bit off-topic for this thread, but as long as the patch touches that code we can look into this too.

If VM bit all-visible is set while page is not all-visible IndexOnlyScan will show incorrect results. I observed this
inconsistencyfew times on production. 

Two persistent subsystems (VM and heap) contradict each other, that's why I think this is a data corruption. Yes, we
canrepair the VM by assuming heap to be the source of truth in this case. But we must also emit ERRCODE_DATA_CORRUPTED
XX001code into the logs. In many cases this will alert on-call SRE. 

To do so I propose to replace elog(WARNING,...) with ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),..).


Best regards, Andrey Borodin.


Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for continuing to take a look, Andrey.

On Mon, Jul 14, 2025 at 2:37 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> This might be a bit off-topic for this thread, but as long as the patch touches that code we can look into this too.
>
> If VM bit all-visible is set while page is not all-visible IndexOnlyScan will show incorrect results. I observed this
inconsistencyfew times on production. 

That's very unfortunate. I wonder what could be causing this. Do you
suspect a bug in Postgres? Or something wrong with the disk, etc?

> Two persistent subsystems (VM and heap) contradict each other, that's why I think this is a data corruption. Yes, we
canrepair the VM by assuming heap to be the source of truth in this case. But we must also emit ERRCODE_DATA_CORRUPTED
XX001code into the logs. In many cases this will alert on-call SRE. 
>
> To do so I propose to replace elog(WARNING,...) with ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),..).

Ah, you mean the warnings currently in lazy_scan_prune(). To me this
suggestion makes sense. I see at least one other example with
ERRCODE_DATA_CORRUPTED that is an error level below ERROR.

I have attached a cleaned up and updated version of the patch set (it
doesn't yet include your suggested error message change).


What's new in this version
-----
In addition to general code, comment, and commit message improvements,
notable changes are as follows:

- I have used the GlobalVisState for determining if the whole page is
visible in a more natural way.

- I micro-benchmarked and identified some sources of regression in the
additional code SELECT queries would do to set the VM. So, there are
several new commits addressing these (for example inlining several
functions and unsetting all-visible when we see a dead tuple if we
won't attempt freezing).

- Because heap_page_prune_and_freeze() was getting long, I added some
helper functions.


Performance impact of setting the VM on-access
-------
I found that with the patch set applied, we set many pages all-visible
in the VM on access, resulting in a higher overall number of pages set
all-visible, reducing load for vacuum, and dramatically decreasing
heap fetches by index-only scans.

I devised a simple benchmark -- with 8 workers inserting 20 rows at a
time into a table with a few columns and updating a single row that
they just inserted. Another worker queries the table 1x second using
an index.

After running the benchmark for a few minutes, though the table was
autovacuumed several times in both cases, with the patchset applied,
15% more blocks were all-visible at the end of the benchmark.

And with my patch applied, index-only scans did far fewer heap
fetches. A SELECT count(*) of the table at the same point in the
benchmark did 10,000 heap fetches on master and 500 with the patch
applied (I used auto_explain to determine this).

With my patch applied, autovacuum workers write half as much WAL as on
master. Some of this is courtesy of other patches in the set which
eliminate separate WAL records for setting the page all-visible. But,
vacuum is also scanning fewer pages and dirtying fewer buffers because
they are being set all-visible on-access.

There are more details about the benchmark at the end of the email.


Setting pd_prune_xid on insert
------
The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
patch in the set. It sets pd_prune_xid on insert (so pages filled by
COPY or insert can also be set all-visible in the VM before they are
vacuumed). I gave it a .txt extension because it currently fails
035_standby_logical_decoding due to a recovery conflict. I need to
investigate more to see if this is a bug in my patch set or elsewhere
in Postgres.

Besides the failing test, I have a feeling that my current heuristic
for whether or not to set the VM on-access is not quite right for
pages that have only been inserted to -- and if we get it wrong, we've
wasted those CPU cycles because we didn't otherwise need to prune the
page.


- Melanie


Benchmark
-------
psql -c "
DROP TABLE IF EXISTS simple_table;

CREATE TABLE simple_table (
    id SERIAL PRIMARY KEY,
    group_id INT NOT NULL,
    data TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);

create index on simple_table(group_id);
"

pgbench \
  --no-vacuum \
  --random-seed=0 \
  -c 8 \
  -j 8 \
  -M prepared \
  -T 200 \
  > "pgbench_run_summary_update_${version}" \
-f- <<EOF &
\set gid random(1,1000)

INSERT INTO simple_table (group_id, data)
  SELECT :gid, 'inserted'
  RETURNING id \gset

update simple_table set data = 'updated' where id = :id;

insert into simple_table (group_id, data)
  select :gid, 'inserted'
  from generate_series(1,20);
EOF
insert_pid=$!

pgbench \
  --no-vacuum \
  --random-seed=0 \
  -c 1 \
  -j 1 \
  --rate=1 \
  -M prepared \
  -T 200 \
  > "pgbench_run_summary_select_${version}" \
-f- <<EOF
\set gid random(1, 1000)
select max(created_at) from simple_table where group_id = :gid;
select count(*) from simple_table where group_id = :gid;
EOF

wait $insert_pid

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> patch in the set. It sets pd_prune_xid on insert (so pages filled by
> COPY or insert can also be set all-visible in the VM before they are
> vacuumed). I gave it a .txt extension because it currently fails
> 035_standby_logical_decoding due to a recovery conflict. I need to
> investigate more to see if this is a bug in my patch set or elsewhere
> in Postgres.

I figured out that if we set the VM on-access, we need to enable
hot_standby_feedback in more places in 035_standby_logical_decoding.pl
to avoid recovery conflicts. I've done that in the attached updated
version 6. There are a few other issues in
035_standby_logical_decoding.pl that I reported here [1]. With these
changes, setting pd_prune_xid on insert passes tests. Whether or not
we want to do it (and what the heuristic should be for deciding when
to do it) is another question.

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

Hi!

Andrey told me off-list about this thread and I decided to take a look.

I tried to play with each patch in this patchset and find a
corruption, but I was unsuccessful. I will conduct further tests
later. I am not implying that I suspect this patchset causes any
corruption; I am merely attempting to verify it.

I also have few comments and questions. Here is my (very limited)
review of 0001, because I believe that removing xl_heap_visible from
COPY FREEZE is pure win, so this patch can be very beneficial by
itself.

visibilitymap_set_vmbyte is introduced in 0001 and removed in 0012.
This is strange to me, maybe we can avoid visibilitymap_set_vmbyte in
first place?

In 0001:

1)
should we add "Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
LW_EXCLUSIVE));" in visibilitymap_set_vmbyte?

Also here  `Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer),
vmbuffer));` can be beneficial:

>/*
>+ * If we're only adding already frozen rows to a previously empty
>+ * page, mark it as all-frozen and update the visibility map. We're
>+ * already holding a pin on the vmbuffer.
>+ */
>   else if (all_frozen_set)
>+ {
>    PageSetAllVisible(page);
>+ LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
>+ visibilitymap_set_vmbyte(relation,
>+ BufferGetBlockNumber(buffer),
>+ vmbuffer,
>+ VISIBILITYMAP_ALL_VISIBLE |
>+ VISIBILITYMAP_ALL_FROZEN);
>+ }

2)
in heap_xlog_multi_insert:

+
+ visibilitymap_pin(reln, blkno, &vmbuffer);
+ visibilitymap_set_vmbyte(....)

Do we need to pin vmbuffer here? Looks like
XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
and COPY ... WITH (FREEZE true) test.

3)
>+
> +#ifdef TRACE_VISIBILITYMAP
> + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
> +#endif

I can see this merely copy-pasted from visibilitymap_set, but maybe
display "flags" also?

4) visibilitymap_set receives  XLogRecPtr recptr parameters, which is
set to WAL record lsn during recovery and to InvalidXLogRecPtr
otherwise. visibilitymap_set manages VM page LSN bases on this recptr
value (inside function logic). visibilitymap_set_vmbyte behaves
vise-versa and makes its caller responsible for page LSN management.
Maybe we should keep these two functions akin to each other?


--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com


0002 No comments from me. Looks straightforward.

Few comments on 0003.

1) This patch introduces XLHP_HAS_VMFLAGS. However it lacks some
helpful comments about this new status bit.

a) In heapam_xlog.h, in xl_heap_prune struct definition:


/*
* If XLHP_HAS_CONFLICT_HORIZON is set, the conflict horizon XID follows,
* unaligned
*/
+ /* If XLHP_HAS_VMFLAGS is set, newly set visibility map bits comes,
unaligned */

b)

we can add here comment why we use  memcpy assignment, akin to /*
memcpy() because snapshot_conflict_horizon is stored unaligned */

+ /* Next are the optionally included vmflags. Copy them out for later use. */
+ if ((xlrec.flags & XLHP_HAS_VMFLAGS) != 0)
+ {
+ memcpy(&vmflags, maindataptr, sizeof(uint8));
+ maindataptr += sizeof(uint8);

2) Should we move conflict_xid = visibility_cutoff_xid; assignment
just after heap_page_is_all_visible_except_lpdead call in
lazy_vacuum_heap_page?

3) Looking at this diff, do not comprehend one bit: how are we
protected from passing an all-visible page to lazy_vacuum_heap_page. I
did not manage to reproduce such behaviour though.

+ if ((vmflags & VISIBILITYMAP_VALID_BITS) != 0)
+ {
+ Assert(!PageIsAllVisible(page));
+ set_pd_all_vis = true;
+ LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
+ PageSetAllVisible(page);
+ visibilitymap_set_vmbyte(vacrel->rel,
+ blkno,
+



--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

v6-0015:
I chose to verify whether this single modification would be beneficial
on the HEAD.

Benchmark I did:

```

\timing
CREATE TABLE zz(i int);
alter table zz set (autovacuum_enabled = false);
TRUNCATE zz;
copy zz from program 'yes 2 | head -n 180000000';
copy zz from program 'yes 2 | head -n 180000000';

delete from zz where (REPLACE(REPLACE(ctid::text, '(', '{'), ')',
'}')::int[])[2] = 7 ;

VACUUM FREEZE zz;
```

And I checked perf top footprint for last statement (vacuum).  My
detailed results are attached. It is a HEAD vs HEAD+v6-0015 benchmark.

TLDR: function inlining is indeed beneficial, TransactionIdPrecedes
function disappears from perf top footprint, though query runtime is
not changed much. So, while not resulting in query speedup, this can
save CPU.
Maybe we can derive an artificial benchmark, which will show query
speed up, but for now I dont have one.

--
Best regards,
Kirill Reshke

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for all the reviews. I'm working on responding to your previous
mails with a new version.

On Wed, Aug 27, 2025 at 8:55 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> v6-0015:
> I chose to verify whether this single modification would be beneficial
> on the HEAD.
>
> Benchmark I did:
>
> ```
>
> \timing
> CREATE TABLE zz(i int);
> alter table zz set (autovacuum_enabled = false);
> TRUNCATE zz;
> copy zz from program 'yes 2 | head -n 180000000';
> copy zz from program 'yes 2 | head -n 180000000';
>
> delete from zz where (REPLACE(REPLACE(ctid::text, '(', '{'), ')',
> '}')::int[])[2] = 7 ;
>
> VACUUM FREEZE zz;
> ```
>
> And I checked perf top footprint for last statement (vacuum).  My
> detailed results are attached. It is a HEAD vs HEAD+v6-0015 benchmark.
>
> TLDR: function inlining is indeed beneficial, TransactionIdPrecedes
> function disappears from perf top footprint, though query runtime is
> not changed much. So, while not resulting in query speedup, this can
> save CPU.
> Maybe we can derive an artificial benchmark, which will show query
> speed up, but for now I dont have one.

I'm not surprised that vacuum freeze does not show a speed up from the
function inlining.

This patch was key for avoiding a regression in the most contrived
worst case scenario example of setting the VM on-access. That is, if
you are pruning only a single tuple on the page as part of a SELECT
query that returns no tuples (think SELECT * FROM foo OFFSET N where N
is greater than the number of rows in the table), and I add
determining if the page is all visible, then the overhead of these
extra function calls in heap_prune_record_unchanged_lp_normal() is
noticeable.

We might be able to come up with a similar example in vacuum without
freeze since it will try to determine if the page is all-visible. Your
example is still running on my machine, though, so I haven't verified
this yet :)

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for the review! Updates are in attached v7.

One note on 0022 in the set, which sets pd_prune_xid on insert: the
recently added index-killtuples isolation test was failing with this
patch applied. With the patch, the "access" step reports more heap
page hits than before. After some analysis, it seems one of the heap
pages in kill_prior_tuples table is now being pruned in an earlier
step. Somehow this leads to 4 hits counted instead of 3 (even though
there are only 4 blocks in the relation). I recall Bertrand mentioning
something in some other thread about hits being double counted with
AIO reads, so I'm going to try and go dig that up. But, for now, I've
modified the test -- I believe the patch is only revealing an issue
with instrumentation, not causing a bug.

On Tue, Aug 26, 2025 at 5:58 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> visibilitymap_set_vmbyte is introduced in 0001 and removed in 0012.
> This is strange to me, maybe we can avoid visibilitymap_set_vmbyte in
> first place?

The reason for this is that in the earlier patch I introduce
visibilitymap_set_vmbyte() for one user while other users still use
visibilitymap_set(). But, by the end of the set, all users use
visibillitymap_set_vmbyte(). So I think it makes most sense for it to
be named visibilitymap_set() at that point. Until all users are
committed, the two functions both have to exist and need different
names.

> In 0001:
> should we add "Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
> LW_EXCLUSIVE));" in visibilitymap_set_vmbyte?

I don't want any operations on the heap buffer (including asserts) in
visibilitymap_set_vmbyte(). The heap block is only provided to look up
the VM bits.

I think your idea is a good one for the existing visibilitymap_set(),
though, so I've added a new patch to the set (0002) which does this. I
also added a similar assertion for the vmbuffer to
visibilitymap_set_vmbyte().

> Also here  `Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer),
> vmbuffer));` can be beneficial:

I had omitted this because the same logic is checked inside of
visiblitymap_set_vmbyte() with an error occurring if conditions are
not met. However, since the same is true in visibilitymap_set() and
heap_multi_insert() still asserted visiblitymap_pin_ok(), I've added
it back to my patch set.

> in heap_xlog_multi_insert:
> +
> + visibilitymap_pin(reln, blkno, &vmbuffer);
> + visibilitymap_set_vmbyte(....)
>
> Do we need to pin vmbuffer here? Looks like
> XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
> with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
> and COPY ... WITH (FREEZE true) test.

I thought the reason visibilitymap_set() did it was that it was
possible for the block of the VM corresponding to the block of the
heap to be different during recovery than it was when emitting the
record, and thus we needed the part of visiblitymap_pin() that
released the old vmbuffer and got the new one corresponding to the
heap block.

I can't quite think of how this could happen though.

Assuming it can't happen, then we can get rid of visiblitymap_pin()
(and add visibilitymap_pin_ok()) in both visiblitymap_set_vmbyte() and
visibilitymap_set(). I've done this to visibilitymap_set() in a
separate patch 0001. I would like other opinions/confirmation that the
block of the VM corresponding to the heap block cannot differ during
recovery from that what it was when the record was emitted during
normal operation, though.

> > +#ifdef TRACE_VISIBILITYMAP
> > + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
> > +#endif
>
> I can see this merely copy-pasted from visibilitymap_set, but maybe
> display "flags" also?

Done in attached.

> 4) visibilitymap_set receives  XLogRecPtr recptr parameters, which is
> set to WAL record lsn during recovery and to InvalidXLogRecPtr
> otherwise. visibilitymap_set manages VM page LSN bases on this recptr
> value (inside function logic). visibilitymap_set_vmbyte behaves
> vise-versa and makes its caller responsible for page LSN management.
> Maybe we should keep these two functions akin to each other?

So, with visibilitymap_set_vmbyte(), the whole idea is to just update
the VM and then leave the WAL logging and other changes to the caller
(like marking the buffer dirty, setting the page LSN, etc). The series
of operations needed to make a persistent change are up to the caller.
visibilitymap_set() is meant to just make sure that the correct bits
in the VM are set for the given heap block.

I looked at ways of making the current visibilitymap_set() API cleaner
-- like setting the heap page LSN with the VM recptr in the caller of
visibilitymap_set() instead. There wasn't a way of doing it that
seemed like enough of an improvement to merit the change.

Not to mention, the goal of the patchset is to remove the current
visibilitymap_set(), so I'm not too worried about parity between the
two functions. They may coexist for awhile, but hopefully today's
visibilitymap_set() will eventually be removed.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Tue, Aug 26, 2025 at 4:01 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> Few comments on 0003.
>
> 1) This patch introduces XLHP_HAS_VMFLAGS. However it lacks some
> helpful comments about this new status bit.

I added the ones you suggested in my v7 posted here [1].

> 2) Should we move conflict_xid = visibility_cutoff_xid; assignment
> just after heap_page_is_all_visible_except_lpdead call in
> lazy_vacuum_heap_page?

Why would we want to do that? We only want to set it if the page is
all visible, so we would have to guard it similarly.

> 3) Looking at this diff, do not comprehend one bit: how are we
> protected from passing an all-visible page to lazy_vacuum_heap_page. I
> did not manage to reproduce such behaviour though.
>
> + if ((vmflags & VISIBILITYMAP_VALID_BITS) != 0)
> + {
> + Assert(!PageIsAllVisible(page));
> + set_pd_all_vis = true;
> + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> + PageSetAllVisible(page);
> + visibilitymap_set_vmbyte(vacrel->rel,
> + blkno,

So, for one, there is an assert just above this code in
lazy_vacuum_heap_page() that nunused > 0 -- so we know that the page
couldn't have been all-visible already because it had unused line
pointers.

Otherwise, if it was possible for an already all-visible page to get
here, the same thing would happen that happens on master --
heap_page_is_all_visible[_except_lpdead()] would return true and we
would try to set the VM which would end up being a no-op.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_YD0ecXeAh%2BDmJpzQOJwcRzmMyGdcc5W_0pEF78rYSJkQ%40mail.gmail.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Thu, 28 Aug 2025 at 00:02, Melanie Plageman
<melanieplageman@gmail.com> wrote:

> > Do we need to pin vmbuffer here? Looks like
> > XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
> > with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
> > and COPY ... WITH (FREEZE true) test.
>
> I thought the reason visibilitymap_set() did it was that it was
> possible for the block of the VM corresponding to the block of the
> heap to be different during recovery than it was when emitting the
> record, and thus we needed the part of visiblitymap_pin() that
> released the old vmbuffer and got the new one corresponding to the
> heap block.
>
> I can't quite think of how this could happen though.
>
> Assuming it can't happen, then we can get rid of visiblitymap_pin()
> (and add visibilitymap_pin_ok()) in both visiblitymap_set_vmbyte() and
> visibilitymap_set(). I've done this to visibilitymap_set() in a
> separate patch 0001. I would like other opinions/confirmation that the
> block of the VM corresponding to the heap block cannot differ during
> recovery from that what it was when the record was emitted during
> normal operation, though.

I did micro git-blame research here. I spotted only one related change
[0]. Looks like before this change pin was indeed needed.
But not after this change, so this visibilitymap_pin is just an oversight?
Related thread is [1]. I quickly checked the discussion in this
thread, and it looks like no one was bothered about these lines or VM
logging changes (in this exact pin buffer aspect). The discussion was
of other aspects of this commit.

[0] https://github.com/postgres/postgres/commit/2c03216d8311
[1] https://www.postgresql.org/message-id/533D6CBF.6080203%40vmware.com


-- 
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> I did micro git-blame research here. I spotted only one related change
> [0]. Looks like before this change pin was indeed needed.
> But not after this change, so this visibilitymap_pin is just an oversight?
> Related thread is [1]. I quickly checked the discussion in this
> thread, and it looks like no one was bothered about these lines or VM
> logging changes (in this exact pin buffer aspect). The discussion was
> of other aspects of this commit.

Wow, thanks so much for doing that research. Looking at it myself, it
does indeed seem like just an oversight. It isn't harmful since it
won't take another pin, but it is confusing, so I think we should at
least remove it in master. I'm not as sure about back branches.

I would like someone to confirm that there is no way we could end up
with a different block of the VM containing the vm bits for a heap
block during recovery than during normal operation.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Tue, Sep 2, 2025 at 5:52 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I did micro git-blame research here. I spotted only one related change
> > [0]. Looks like before this change pin was indeed needed.
> > But not after this change, so this visibilitymap_pin is just an oversight?
> > Related thread is [1]. I quickly checked the discussion in this
> > thread, and it looks like no one was bothered about these lines or VM
> > logging changes (in this exact pin buffer aspect). The discussion was
> > of other aspects of this commit.
>
> Wow, thanks so much for doing that research. Looking at it myself, it
> does indeed seem like just an oversight. It isn't harmful since it
> won't take another pin, but it is confusing, so I think we should at
> least remove it in master. I'm not as sure about back branches.

I've updated the commit message in the patch set to reflect the
research you did in attached v8.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andres Freund
Дата:
Hi,

On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 27 Aug 2025 08:50:15 -0400
> Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay

LGTM.


> From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 27 Aug 2025 10:07:29 -0400
> Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set

LGTM.


> From 07f31099754636ec9dabf6cca06c33c4b19c230c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 17 Jun 2025 17:22:10 -0400
> Subject: [PATCH v8 03/22] Eliminate xl_heap_visible in COPY FREEZE
>
> Instead of emitting a separate WAL record for setting the VM bits in
> xl_heap_visible, specify the changes to make to the VM block in the
> xl_heap_multi_insert record instead.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com


> +        /*
> +         * If we're only adding already frozen rows to a previously empty
> +         * page, mark it as all-frozen and update the visibility map. We're
> +         * already holding a pin on the vmbuffer.
> +         */
>          else if (all_frozen_set)
> +        {
> +            Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer), vmbuffer));
>              PageSetAllVisible(page);
> +            LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> +            visibilitymap_set_vmbyte(relation,
> +                                     BufferGetBlockNumber(buffer),
> +                                     vmbuffer,
> +                                     VISIBILITYMAP_ALL_VISIBLE |
> +                                     VISIBILITYMAP_ALL_FROZEN);
> +        }

From an abstraction POV I don't love that heapam now is responsible for
acquiring and releasing the lock. But that ship already kind of has sailed, as
heapam.c is already responsible for releasing the vm buffer etc...

I've wondered about splitting the responsibilities up into multiple
visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
lock and set the LSN. But it's probably not worth it.


> +    /*
> +     * Now read and update the VM block. Even if we skipped updating the heap
> +     * page due to the file being dropped or truncated later in recovery, it's
> +     * still safe to update the visibility map.  Any WAL record that clears
> +     * the visibility map bit does so before checking the page LSN, so any
> +     * bits that need to be cleared will still be cleared.
> +     *
> +     * It is only okay to set the VM bits without holding the heap page lock
> +     * because we can expect no other writers of this page.
> +     */
> +    if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> +        XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> +                                      &vmbuffer) == BLK_NEEDS_REDO)
> +    {
> +        Relation    reln = CreateFakeRelcacheEntry(rlocator);
> +
> +        Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> +        visibilitymap_set_vmbyte(reln, blkno,
> +                                 vmbuffer,
> +                                 VISIBILITYMAP_ALL_VISIBLE |
> +                                 VISIBILITYMAP_ALL_FROZEN);
> +
> +        /*
> +         * It is not possible that the VM was already set for this heap page,
> +         * so the vmbuffer must have been modified and marked dirty.
> +         */
> +        Assert(BufferIsDirty(vmbuffer));

How about making visibilitymap_set_vmbyte() return whether it needed to do
something? This seems somewhat indirect...

I think it might be good to encapsulate this code into a helper in
visibilitymap.c, there will be more callers in the subsequent patches.


> +/*
> + * Set flags in the VM block contained in the passed in vmBuf.
> + *
> + * This function is for callers which include the VM changes in the same WAL
> + * record as the modifications of the heap page which rendered it all-visible.
> + * Callers separately logging the VM changes should invoke visibilitymap_set()
> + * instead.
> + *
> + * Caller must have pinned and exclusive locked the correct block of the VM in
> + * vmBuf. This block should contain the VM bits for the given heapBlk.
> + *
> + * During normal operation (i.e. not recovery), this should be called in a
> + * critical section which also makes any necessary changes to the heap page
> + * and, if relevant, emits WAL.
> + *
> + * Caller is responsible for WAL logging the changes to the VM buffer and for
> + * making any changes needed to the associated heap page. This includes
> + * maintaining any invariants such as ensuring the buffer containing heapBlk
> + * is pinned and exclusive locked.
> + */
> +uint8
> +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> +                         Buffer vmBuf, uint8 flags)

Why is it named vmbyte? This actually just sets the two bits corresponding to
the buffer, not the entire byte. So it seems somewhat misleading to reference
byte.




> From dc318358572f61efbd0e05aae2b9a077b422bcf5 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 18 Jun 2025 12:42:13 -0400
> Subject: [PATCH v8 05/22] Eliminate xl_heap_visible from vacuum phase III
>
> Instead of emitting a separate xl_heap_visible record for each page that
> is rendered all-visible by vacuum's third phase, include the updates to
> the VM in the already emitted xl_heap_prune record.

Reading through the change I didn't particularly like that there's another
optional field in xl_heap_prune, as it seemed liked something that should be
encoded in flags.  Of course there aren't enough flag bits available.  But
that made me look at the rest of the record: Uh, what do we use the reason
field for?  As far as I can tell f83d709760d8 added it without introducing any
users? It doesn't even seem to be set.


> @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>             (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
>
>      /*
> -     * We are about to remove and/or freeze tuples.  In Hot Standby mode,
> -     * ensure that there are no queries running for which the removed tuples
> -     * are still visible or which still consider the frozen xids as running.
> -     * The conflict horizon XID comes after xl_heap_prune.
> +     * After xl_heap_prune is the optional snapshot conflict horizon.
> +     *
> +     * In Hot Standby mode, we must ensure that there are no running queries
> +     * which would conflict with the changes in this record. If pruning, that
> +     * means we cannot remove tuples still visible to transactions on the
> +     * standby. If freezing, that means we cannot freeze tuples with xids that
> +     * are still considered running on the standby. And for setting the VM, we
> +     * cannot do so if the page isn't all-visible to all transactions on the
> +     * standby.
>       */

I'm a bit confused by this new comment - it sounds like we're deciding whether
to remove tuple versions, but that decision has long been made, no?



> @@ -2846,8 +2848,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>      OffsetNumber unused[MaxHeapTuplesPerPage];
>      int            nunused = 0;
>      TransactionId visibility_cutoff_xid;
> +    TransactionId conflict_xid = InvalidTransactionId;
>      bool        all_frozen;
>      LVSavedErrInfo saved_err_info;
> +    uint8        vmflags = 0;
> +    bool        set_pd_all_vis = false;
>
>      Assert(vacrel->do_index_vacuuming);
>
> @@ -2858,6 +2863,20 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
>                               InvalidOffsetNumber);
>
> +    if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> +                                               vacrel->cutoffs.OldestXmin,
> +                                               deadoffsets, num_offsets,
> +                                               &all_frozen, &visibility_cutoff_xid,
> +                                               &vacrel->offnum))
> +    {
> +        vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> +        if (all_frozen)
> +        {
> +            vmflags |= VISIBILITYMAP_ALL_FROZEN;
> +            Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> +        }
> +    }
> +
>      START_CRIT_SECTION();

I am rather confused - we never can set all-visible if there are any LP_DEAD
items left. If the idea is that we are removing the LP_DEAD items in
lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
vacuum visited the page and collected dead items?

Ugh, I see - it works because we pass in the set of dead items.  I think that
makes the name *really* misleading, it's not except LP_DEAD, it's except the
offsets passed in, no?

But then you actually check that the set of dead items didn't change - what
guarantees that?


I didn't look at the later patches, except that I did notice this:

> @@ -268,7 +264,7 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          Relation    reln = CreateFakeRelcacheEntry(rlocator);
>
>          visibilitymap_pin(reln, blkno, &vmbuffer);
> -        old_vmbits = visibilitymap_set_vmbyte(reln, blkno, vmbuffer, vmflags);
> +        old_vmbits = visibilitymap_set(reln, blkno, vmbuffer, vmflags);
>          /* Only set VM page LSN if we modified the page */
>          if (old_vmbits != vmflags)
>              PageSetLSN(BufferGetPage(vmbuffer), lsn);
> @@ -279,143 +275,6 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          UnlockReleaseBuffer(vmbuffer);
>  }

Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
have done so, as you noticed in one of the early on patches?

Greetings,

Andres Freund



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Wed, 3 Sept 2025 at 04:11, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Sep 2, 2025 at 5:52 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > >
> > > I did micro git-blame research here. I spotted only one related change
> > > [0]. Looks like before this change pin was indeed needed.
> > > But not after this change, so this visibilitymap_pin is just an oversight?
> > > Related thread is [1]. I quickly checked the discussion in this
> > > thread, and it looks like no one was bothered about these lines or VM
> > > logging changes (in this exact pin buffer aspect). The discussion was
> > > of other aspects of this commit.
> >
> > Wow, thanks so much for doing that research. Looking at it myself, it
> > does indeed seem like just an oversight. It isn't harmful since it
> > won't take another pin, but it is confusing, so I think we should at
> > least remove it in master. I'm not as sure about back branches.
>
> I've updated the commit message in the patch set to reflect the
> research you did in attached v8.
>
> - Melanie



Hi!

small comments regarding new series

0001, 0002, 0017 LGTM


In 0015:

```
reshke@yezzey-cbdb-bench:~/postgres$ git diff
src/backend/access/heap/pruneheap.c
diff --git a/src/backend/access/heap/pruneheap.c
b/src/backend/access/heap/pruneheap.c
index 05b51bd8d25..0794af9ae89 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -1398,7 +1398,7 @@ heap_prune_record_unchanged_lp_normal(Page page,
PruneState *prstate, OffsetNumb
                                /*
                                 * For now always use prstate->cutoffs
for this test, because
                                 * we only update 'all_visible' when
freezing is requested. We
-                                * could use
GlobalVisTestIsRemovableXid instead, if a
+                                * could use GlobalVisXidVisibleToAll
instead, if a
                                 * non-freezing caller wanted to set the VM bit.
                                 */
                                Assert(prstate->cutoffs);
```

Also, maybe GlobalVisXidTestAllVisible is a slightly better name? (The
term 'all-visible' is one that we occasionally utilize)


--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for the review!

On Tue, Sep 2, 2025 at 7:54 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> > From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 27 Aug 2025 08:50:15 -0400
> > Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay

I didn't push it yet because I did a new version that actually
eliminates the asserts in heap_multi_insert() before calling
visibilitymap_set() -- since they are redundant with checks inside
visibilitymap_set(). 0001 of attached v9 is what I plan to push,
barring any objections.

> > From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 27 Aug 2025 10:07:29 -0400
> > Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set

I pushed this.

> From an abstraction POV I don't love that heapam now is responsible for
> acquiring and releasing the lock. But that ship already kind of has sailed, as
> heapam.c is already responsible for releasing the vm buffer etc...
>
> I've wondered about splitting the responsibilities up into multiple
> visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
> lock and set the LSN. But it's probably not worth it.

Yea, I explored heap wrappers coupling heap operations related to
setting the VM along with the VM updates [1], but the results weren't
appealing. Setting the heap LSN and marking the heap buffer dirty and
such happens in a different place in different callers because it is
happening as part of the operations that actually end up rendering the
page all-visible.

And a VM-only helper would literally just acquire and release the lock
and set the LSN on the vm page -- which I don't think is worth it.

> > +     /*
> > +      * Now read and update the VM block. Even if we skipped updating the heap
> > +      * page due to the file being dropped or truncated later in recovery, it's
> > +      * still safe to update the visibility map.  Any WAL record that clears
> > +      * the visibility map bit does so before checking the page LSN, so any
> > +      * bits that need to be cleared will still be cleared.
> > +      *
> > +      * It is only okay to set the VM bits without holding the heap page lock
> > +      * because we can expect no other writers of this page.
> > +      */
> > +     if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > +             XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > +                                                                       &vmbuffer) == BLK_NEEDS_REDO)
> > +     {
> > +             Relation        reln = CreateFakeRelcacheEntry(rlocator);
> > +
> > +             Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> > +             visibilitymap_set_vmbyte(reln, blkno,
> > +                                                              vmbuffer,
> > +                                                              VISIBILITYMAP_ALL_VISIBLE |
> > +                                                              VISIBILITYMAP_ALL_FROZEN);
> > +
> > +             /*
> > +              * It is not possible that the VM was already set for this heap page,
> > +              * so the vmbuffer must have been modified and marked dirty.
> > +              */
> > +             Assert(BufferIsDirty(vmbuffer));
>
> How about making visibilitymap_set_vmbyte() return whether it needed to do
> something? This seems somewhat indirect...

It does return the state of the previous bits. But, I am specifically
asserting that the buffer is dirty because I am about to set the page
LSN. So I don't just care that changes were made, I care that we
remembered to mark the buffer dirty.

> I think it might be good to encapsulate this code into a helper in
> visibilitymap.c, there will be more callers in the subsequent patches.

By the end of the set, the different callers have different
expectations (some don't expect the buffer to have been dirtied
necessarily) and where they do the various related operations is
spread out depending on the caller. I just couldn't come up with a
helper solution I liked.

That being said, I definitely don't think it's needed for this patch
(logging setting the VM in xl_heap_multi_insert()).

> > +uint8
> > +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> > +                                              Buffer vmBuf, uint8 flags)
>
> Why is it named vmbyte? This actually just sets the two bits corresponding to
> the buffer, not the entire byte. So it seems somewhat misleading to reference
> byte.

Renamed it to visibilitymap_set_vmbits.

> > Instead of emitting a separate xl_heap_visible record for each page that
> > is rendered all-visible by vacuum's third phase, include the updates to
> > the VM in the already emitted xl_heap_prune record.
>
> Reading through the change I didn't particularly like that there's another
> optional field in xl_heap_prune, as it seemed liked something that should be
> encoded in flags.  Of course there aren't enough flag bits available.  But
> that made me look at the rest of the record: Uh, what do we use the reason
> field for?  As far as I can tell f83d709760d8 added it without introducing any
> users? It doesn't even seem to be set.

yikes, you are right about the "reason" member. Attached 0002 removes
it, and I'll go ahead and fix it in the back branches too. I can't
fathom how that slipped through the cracks. We do pass the PruneReason
for setting the rmgr info about what type of record it is (i.e. if it
is one emitted by vacuum phase I, phase III, or on-access pruning).
But we don't need or use a separate member.. I went back and tried to
figure out what the rationale was, but I couldn't find anything.

As for the VM flags being an optional unaligned member -- in v9, I've
expanded the flags member to a uint16 to make room for the extra
flags. Seems we've been surviving with using up 2 bytes this long.

> > @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> >                  (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
> >
> >       /*
> > -      * We are about to remove and/or freeze tuples.  In Hot Standby mode,
> > -      * ensure that there are no queries running for which the removed tuples
> > -      * are still visible or which still consider the frozen xids as running.
> > -      * The conflict horizon XID comes after xl_heap_prune.
> > +      * After xl_heap_prune is the optional snapshot conflict horizon.
> > +      *
> > +      * In Hot Standby mode, we must ensure that there are no running queries
> > +      * which would conflict with the changes in this record. If pruning, that
> > +      * means we cannot remove tuples still visible to transactions on the
> > +      * standby. If freezing, that means we cannot freeze tuples with xids that
> > +      * are still considered running on the standby. And for setting the VM, we
> > +      * cannot do so if the page isn't all-visible to all transactions on the
> > +      * standby.
> >        */
>
> I'm a bit confused by this new comment - it sounds like we're deciding whether
> to remove tuple versions, but that decision has long been made, no?

Well, the comment is a revision of a comment that was already there on
essentially why replaying this record could cause recovery conflicts.
It mentioned pruning and freezing, so I expanded it to mention setting
the VM. Taking into account your confusion, I tried rewording it in
attached v9.

> > +     if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> > +
vacrel->cutoffs.OldestXmin,
> > +                                                                                        deadoffsets, num_offsets,
> > +                                                                                        &all_frozen,
&visibility_cutoff_xid,
> > +                                                                                        &vacrel->offnum))
>
> I am rather confused - we never can set all-visible if there are any LP_DEAD
> items left. If the idea is that we are removing the LP_DEAD items in
> lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
> removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
> vacuum visited the page and collected dead items?
>
> Ugh, I see - it works because we pass in the set of dead items.  I think that
> makes the name *really* misleading, it's not except LP_DEAD, it's except the
> offsets passed in, no?
>
> But then you actually check that the set of dead items didn't change - what
> guarantees that?

So, I pass in the deadoffsets we got from the TIDStore. If the only
dead items on the page are exactly those dead items, then the page
will be all-visible as soon as we set those LP_UNUSED -- which we do
unconditionally. And we have the lock on the page, so no one can
on-access prune and make new dead items while we are in
lazy_vacuum_heap_page().

Given your confusion, I've refactored this and used a different
approach -- I explicitly check the passed-in deadoffsets array when I
encounter a dead item and see if it is there. That should hopefully
make it more clear.

> I didn't look at the later patches, except that I did notice this:
<--snip-->
> Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
> have done so, as you noticed in one of the early on patches?

Fixed. Thanks!

- Melanie

[1] [1]
https://www.postgresql.org/message-id/flat/CAAKRu_Yj%3DyrL%2BgGGsqfYVQcYn7rDp6hDeoF1vN453JDp8dEY%2Bw%40mail.gmail.com#94602c599abdc8dfc5e438bd24bd8d50

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Wed, Sep 3, 2025 at 5:06 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> small comments regarding new series
>
> 0001, 0002, 0017 LGTM

Thanks for continuing to review!

> In 0015:
>
> Also, maybe GlobalVisXidTestAllVisible is a slightly better name? (The
> term 'all-visible' is one that we occasionally utilize)

Actually, I was trying to distinguish it from all-visible because I
interpret that to mean every thing is visible -- as in, every tuple on
a page is visible to everyone. And here we are referring to one xid
and want to know if it is visible to everyone as no longer running. I
don't think my name  ("visible-to-all") is good, but I'm hesitant to
co-opt "all-visible" here.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> > On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> > > From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Wed, 27 Aug 2025 08:50:15 -0400
> > > Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay
>
> I didn't push it yet because I did a new version that actually
> eliminates the asserts in heap_multi_insert() before calling
> visibilitymap_set() -- since they are redundant with checks inside
> visibilitymap_set(). 0001 of attached v9 is what I plan to push,
> barring any objections.

I pushed this, so rebased v10 is  attached. I've added one new patch:
0002 adds ERRCODE_DATA_CORRUPTED to the existing log messages about
VM/data corruption in vacuum. Andrey Borodin earlier suggested this,
and I had neglected to include it.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> yikes, you are right about the "reason" member. Attached 0002 removes
> it, and I'll go ahead and fix it in the back branches too.

I think changing this in the back-branches is a super-bad idea. If you
want, you can add a comment in the back-branches saying "oops, we
shipped a field that isn't used for anything", but changing the struct
definition is very likely to make 0 people happy and >0 people
unhappy. On the other hand, changing this in master is a good idea and
you should go ahead and do that before this creates any more
confusion.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Sep 8, 2025 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > yikes, you are right about the "reason" member. Attached 0002 removes
> > it, and I'll go ahead and fix it in the back branches too.
>
> I think changing this in the back-branches is a super-bad idea. If you
> want, you can add a comment in the back-branches saying "oops, we
> shipped a field that isn't used for anything", but changing the struct
> definition is very likely to make 0 people happy and >0 people
> unhappy. On the other hand, changing this in master is a good idea and
> you should go ahead and do that before this creates any more
> confusion.

Yes, that makes 100% sense. It should have occurred to me. I've pushed
the commit to master. I didn't put an updated set of patches here in
case someone was already reviewing them, as nothing else has changed.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Mon, Sep 8, 2025 at 11:44 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I pushed this, so rebased v10 is  attached. I've added one new patch:
> 0002 adds ERRCODE_DATA_CORRUPTED to the existing log messages about
> VM/data corruption in vacuum. Andrey Borodin earlier suggested this,
> and I had neglected to include it.

Writing "ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED)" is very
much a minority position. Generally the call to errcode() is on the
following line. I think the commit message could use a bit of work,
too. The first sentence heavily duplicates the second and the fourth,
and the third sentence isn't sufficiently well-connected to the rest
to make it clear why you're restating this general principle in this
commit message.

Perhaps something like:

Add error codes when VACUUM discovers VM corruption

Commit fd6ec93bf890314ac694dc8a7f3c45702ecc1bbd and other previous
work has established the principle that when an error is potentially
reachable in case of on-disk corruption, but is not expected to be
reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows
log monitoring software to search for evidence of corruption by
filtering on the error code.

That kibitzing aside, I think this is pretty clearly the right thing to do.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Sep 8, 2025 at 2:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Commit fd6ec93bf890314ac694dc8a7f3c45702ecc1bbd and other previous
> work has established the principle that when an error is potentially
> reachable in case of on-disk corruption, but is not expected to be
> reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows
> log monitoring software to search for evidence of corruption by
> filtering on the error code.
>
> That kibitzing aside, I think this is pretty clearly the right thing to do.

Thanks for the suggested wording and the pointer to that thread.

I noticed that in that thread they decided to use errmsg_internal()
instead of errmsg() for a few different reasons -- one of which was
that the situation is not supposed to happen/cannot happen -- which I
don't really understand. It is a reachable code path. Another is that
it is extra work for translators, which I'm also not sure how to apply
to my situation. Are the VM corruption cases worth extra work to the
translators?

I think the most compelling reason is that people will want to search
for the error message in English online. So, for that reason, my
instinct is to use errmsg_internal() in my case as well.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Mon, Sep 8, 2025 at 3:14 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I noticed that in that thread they decided to use errmsg_internal()
> instead of errmsg() for a few different reasons -- one of which was
> that the situation is not supposed to happen/cannot happen -- which I
> don't really understand. It is a reachable code path. Another is that
> it is extra work for translators, which I'm also not sure how to apply
> to my situation. Are the VM corruption cases worth extra work to the
> translators?
>
> I think the most compelling reason is that people will want to search
> for the error message in English online. So, for that reason, my
> instinct is to use errmsg_internal() in my case as well.

I don't find that reason particularly compelling -- people could want
to search for any error message, or they could equally want to be able
to read it without Google translate. Guessing which messages are
obscure enough that we need not translate them exceeds my powers. If I
were doing it, I'd make it errmsg() rather than errmsg_internal() and
let the translations team change it if they don't think it's worth
bothering with, because if you make it errmsg_internal() then they
won't see it until somebody complains about it not getting translated.
However, I suspect different committers would pursue different
strategies here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
Reviewing 0003:

+               /*
+                * If we're only adding already frozen rows to a
previously empty
+                * page, mark it as all-frozen and update the
visibility map. We're
+                * already holding a pin on the vmbuffer.
+                */
                else if (all_frozen_set)
+               {
                        PageSetAllVisible(page);
+                       LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
+                       visibilitymap_set_vmbits(relation,
+
  BufferGetBlockNumber(buffer),
+
  vmbuffer,
+
  VISIBILITYMAP_ALL_VISIBLE |
+
  VISIBILITYMAP_ALL_FROZEN);

Locking a buffer in a critical section violates the order of
operations proposed in the 'Write-Ahead Log Coding' section of
src/backend/access/transam/README.

+        * Now read and update the VM block. Even if we skipped
updating the heap
+        * page due to the file being dropped or truncated later in
recovery, it's
+        * still safe to update the visibility map.  Any WAL record that clears
+        * the visibility map bit does so before checking the page LSN, so any
+        * bits that need to be cleared will still be cleared.
+        *
+        * It is only okay to set the VM bits without holding the heap page lock
+        * because we can expect no other writers of this page.

The first paragraph of this paraphrases a similar content in
xlog_heap_visible(), but I don't see the variation in phrasing as an
improvement.

The second paragraph does not convince me at all. I see no reason to
believe that this is safe, or that it is a good idea. The code in
xlog_heap_visible() thinks its OK to unlock and relock the page to
make visibilitymap_set() happy, which is cringy but probably safe for
lack of concurrent writers, but skipping locking altogether seems
deeply unwise.

- *             visibilitymap_set        - set a bit in a previously pinned page
+ *             visibilitymap_set        - set bit(s) in a previously
pinned page and log
+ *      visibilitymap_set_vmbits - set bit(s) in a pinned page

I suspect the indentation was done with a different mix of spaces and
tabs here, because this doesn't align for me.

In general, this idea makes some sense to me -- there doesn't seem to
be any particularly good reason why the visibility-map update should
be handled by a different WAL record than the all-visible flag on the
page itself. It's a little hard for me to make that statement too
conclusively without studying more of the patches than I've had time
to do today, but off the top of my head it seems to make sense.
However, I'm not sure you've taken enough care with the details here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Sep 8, 2025 at 4:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Reviewing 0003:
>
> Locking a buffer in a critical section violates the order of
> operations proposed in the 'Write-Ahead Log Coding' section of
> src/backend/access/transam/README.

Right, I noticed some other callers of visibiltymap_set() (like
lazy_scan_new_or_empty()) did call it in a critical section (and it
exclusive locks the VM page), so I thought perhaps it was better to
keep this operation as close as possible to where we update the VM
(similar to how it is in master in visibilitymap_set()).

But, I think you're right that maintaining the order of operations
proposed in transam/README is more important. As such, in attached
v11, I've modified this patch and the other patches where I replace
visibilitymap_set() with visibilitymap_set_vmbits() to exclusively
lock the vmbuffer before the critical section.
visibilitymap_set_vmbits() asserts that we have the vmbuffer
exclusively locked, so we should be good.

> +        * Now read and update the VM block. Even if we skipped
> updating the heap
> +        * page due to the file being dropped or truncated later in
> recovery, it's
> +        * still safe to update the visibility map.  Any WAL record that clears
> +        * the visibility map bit does so before checking the page LSN, so any
> +        * bits that need to be cleared will still be cleared.
> +        *
> +        * It is only okay to set the VM bits without holding the heap page lock
> +        * because we can expect no other writers of this page.
>
> The first paragraph of this paraphrases a similar content in
> xlog_heap_visible(), but I don't see the variation in phrasing as an
> improvement.

The only difference is I replaced the phrase "LSN interlock" with
"being dropped or truncated later in recovery" -- which is more
specific and, I thought, more clear. Without this comment, it took me
some time to understand the scenarios that might lead us to skip
updating the heap block. heap_xlog_visible() has cause to describe
this situation in an earlier comment -- which is why I think the LSN
interlock comment is less confusing there.

Anyway, I'm open to changing the comment. I could:
1) copy-paste the same comment as heap_xlog_visible()
2) refer to the comment in heap_xlog_visible() (comment seemed a bit
short for that)
3) diverge the comments further by improving the new comment in
heap_xlog_multi_insert() in some way
4) something else?

> The second paragraph does not convince me at all. I see no reason to
> believe that this is safe, or that it is a good idea. The code in
> xlog_heap_visible() thinks its OK to unlock and relock the page to
> make visibilitymap_set() happy, which is cringy but probably safe for
> lack of concurrent writers, but skipping locking altogether seems
> deeply unwise.

Actually in master, heap_xlog_visible() has no lock on the heap page
when it calls visibiltymap_set(). It releases that lock before
recording the freespace in the FSM and doesn't take it again.

It does unlock and relock the VM page -- because visibilitymap_set()
expects to take the lock on the VM.

I agree that not holding the heap lock while updating the VM is
unsatisfying. We can't hold it while doing the IO to read in the VM
block in XLogReadBufferForRedoExtended(). So, we could take it again
before calling visibilitymap_set(). But we don't always have the heap
buffer, though. I suspect this is partially why heap_xlog_visible()
unconditionally passes InvalidBuffer to visibilitymap_set() as the
heap buffer and has special case handling for recovery when we don't
have the heap buffer.

In any case, it isn't an active bug, and I don't think future-proofing
VM replay (i.e. against parallel recovery) is a prerequisite for
committing this patch since it is also that way on master.

> - *             visibilitymap_set        - set a bit in a previously pinned page
> + *             visibilitymap_set        - set bit(s) in a previously
> pinned page and log
> + *      visibilitymap_set_vmbits - set bit(s) in a pinned page
>
> I suspect the indentation was done with a different mix of spaces and
> tabs here, because this doesn't align for me.

oops, fixed.

I pushed the ERRCODE_DATA_CORRUPTED patch, so attached v11 is rebased
and also has the changes mentioned above.

Since you've started reviewing the set, I'll note that patches
0005-0011 are split up for ease of review and it may not necessarily
make sense to keep that separation for eventual commit. They are a
series of steps to move VM updates from lazy_scan_prune() into
pruneheap.c.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Mon, Sep 8, 2025 at 6:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> But, I think you're right that maintaining the order of operations
> proposed in transam/README is more important. As such, in attached
> v11, I've modified this patch and the other patches where I replace
> visibilitymap_set() with visibilitymap_set_vmbits() to exclusively
> lock the vmbuffer before the critical section.
> visibilitymap_set_vmbits() asserts that we have the vmbuffer
> exclusively locked, so we should be good.

That sounds good. I think it is OK to keep some of the odd things that
we're currently doing if they're hard to eliminate, but if they're not
really needed then I'd rather see us standardize the code. I feel (and
I think you may agree, based on other conversations that we've had)
that the visibility map code is somewhat oddly structured, and I'd
like to see us push the amount of oddness down rather than up, if we
can reasonably do so without breaking everything.

> The only difference is I replaced the phrase "LSN interlock" with
> "being dropped or truncated later in recovery" -- which is more
> specific and, I thought, more clear. Without this comment, it took me
> some time to understand the scenarios that might lead us to skip
> updating the heap block. heap_xlog_visible() has cause to describe
> this situation in an earlier comment -- which is why I think the LSN
> interlock comment is less confusing there.
>
> Anyway, I'm open to changing the comment. I could:
> 1) copy-paste the same comment as heap_xlog_visible()
> 2) refer to the comment in heap_xlog_visible() (comment seemed a bit
> short for that)
> 3) diverge the comments further by improving the new comment in
> heap_xlog_multi_insert() in some way
> 4) something else?

IMHO, copying and pasting comments is not great, and comments with
identical intent and divergent wording are also not great. The former
is not great because having a whole bunch of copies of the same
comment, especially if it's a block comment rather than a 1-liner,
uses up a bunch of space and creates a maintenance hazard in the sense
that future updates might not get propagated to all copies. The latter
is not great because it makes it hard to grep for other instances that
should be adjusted when you adjust one, and also because if one
version really is better than the other than ideally we'd like to have
the good version everywhere. Of course, there's some tension between
these two goals. In this particular case, thinking a little harder
about your proposed change, it seems to me that "LSN interlock" is
more clear about what the immediate test is that would cause us to
skip updating the heap page, and "being dropped or truncated later in
recovery" is more clear about what the larger state of the world that
would lead to that situation is. But whatever preference anyone might
have about which way to go with that choice, it is hard to see why the
preference should go one way in one case and the other way in another
case. Therefore, I favor an approach that leads either to an identical
comment in both places, or to one comment referring to the other.

> > The second paragraph does not convince me at all. I see no reason to
> > believe that this is safe, or that it is a good idea. The code in
> > xlog_heap_visible() thinks its OK to unlock and relock the page to
> > make visibilitymap_set() happy, which is cringy but probably safe for
> > lack of concurrent writers, but skipping locking altogether seems
> > deeply unwise.
>
> Actually in master, heap_xlog_visible() has no lock on the heap page
> when it calls visibiltymap_set(). It releases that lock before
> recording the freespace in the FSM and doesn't take it again.
>
> It does unlock and relock the VM page -- because visibilitymap_set()
> expects to take the lock on the VM.
>
> I agree that not holding the heap lock while updating the VM is
> unsatisfying. We can't hold it while doing the IO to read in the VM
> block in XLogReadBufferForRedoExtended(). So, we could take it again
> before calling visibilitymap_set(). But we don't always have the heap
> buffer, though. I suspect this is partially why heap_xlog_visible()
> unconditionally passes InvalidBuffer to visibilitymap_set() as the
> heap buffer and has special case handling for recovery when we don't
> have the heap buffer.

You know, I wasn't thinking carefully enough about the distinction
between the heap page and the visibility map page here. I thought you
were saying that you were modifying a page without a lock on that
page, but you aren't: you're saying you're modifying a page without a
lock on another page to which it is related. The former seems
disastrous, but the latter might be OK. However, I'm sort of confused
about what the comment is trying to say to justify that:

+        * It is only okay to set the VM bits without holding the heap page lock
+        * because we can expect no other writers of this page.

It is not exactly clear to me whether "this page" here refers to the
heap page or the VM page. If it means the heap page, why should that
be so if we haven't got any kind of lock? If it means the VM page,
then why is the heap page even relevant?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Tue, Sep 9, 2025 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 6:29 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > The only difference is I replaced the phrase "LSN interlock" with
> > "being dropped or truncated later in recovery" -- which is more
> > specific and, I thought, more clear. Without this comment, it took me
> > some time to understand the scenarios that might lead us to skip
> > updating the heap block. heap_xlog_visible() has cause to describe
> > this situation in an earlier comment -- which is why I think the LSN
> > interlock comment is less confusing there.
> >
> > Anyway, I'm open to changing the comment. I could:
> > 1) copy-paste the same comment as heap_xlog_visible()
> > 2) refer to the comment in heap_xlog_visible() (comment seemed a bit
> > short for that)
> > 3) diverge the comments further by improving the new comment in
> > heap_xlog_multi_insert() in some way
> > 4) something else?
>
> IMHO, copying and pasting comments is not great, and comments with
> identical intent and divergent wording are also not great. The former
> is not great because having a whole bunch of copies of the same
> comment, especially if it's a block comment rather than a 1-liner,
> uses up a bunch of space and creates a maintenance hazard in the sense
> that future updates might not get propagated to all copies. The latter
> is not great because it makes it hard to grep for other instances that
> should be adjusted when you adjust one, and also because if one
> version really is better than the other than ideally we'd like to have
> the good version everywhere. Of course, there's some tension between
> these two goals. In this particular case, thinking a little harder
> about your proposed change, it seems to me that "LSN interlock" is
> more clear about what the immediate test is that would cause us to
> skip updating the heap page, and "being dropped or truncated later in
> recovery" is more clear about what the larger state of the world that
> would lead to that situation is. But whatever preference anyone might
> have about which way to go with that choice, it is hard to see why the
> preference should go one way in one case and the other way in another
> case. Therefore, I favor an approach that leads either to an identical
> comment in both places, or to one comment referring to the other.

I see what you are saying.

For heap_xlog_visible() the LSN interlock comment is easier to parse
because of an earlier comment before reading the heap page:

    /*
     * Read the heap page, if it still exists. If the heap file has dropped or
     * truncated later in recovery, we don't need to update the page, but we'd
     * better still update the visibility map.
     */

I've gone with the direct copy-paste of the LSN interlock paragraph in
attached v12. I think referring to the other comment is too confusing
in context here. However, I also added a line about what could cause
the LSN interlock -- but above it, so as to retain grep-ability of the
other comment.

> > > The second paragraph does not convince me at all. I see no reason to
> > > believe that this is safe, or that it is a good idea. The code in
> > > xlog_heap_visible() thinks its OK to unlock and relock the page to
> > > make visibilitymap_set() happy, which is cringy but probably safe for
> > > lack of concurrent writers, but skipping locking altogether seems
> > > deeply unwise.
> >
> > Actually in master, heap_xlog_visible() has no lock on the heap page
> > when it calls visibiltymap_set(). It releases that lock before
> > recording the freespace in the FSM and doesn't take it again.
> >
> > It does unlock and relock the VM page -- because visibilitymap_set()
> > expects to take the lock on the VM.
> >
> > I agree that not holding the heap lock while updating the VM is
> > unsatisfying. We can't hold it while doing the IO to read in the VM
> > block in XLogReadBufferForRedoExtended(). So, we could take it again
> > before calling visibilitymap_set(). But we don't always have the heap
> > buffer, though. I suspect this is partially why heap_xlog_visible()
> > unconditionally passes InvalidBuffer to visibilitymap_set() as the
> > heap buffer and has special case handling for recovery when we don't
> > have the heap buffer.
>
> You know, I wasn't thinking carefully enough about the distinction
> between the heap page and the visibility map page here. I thought you
> were saying that you were modifying a page without a lock on that
> page, but you aren't: you're saying you're modifying a page without a
> lock on another page to which it is related. The former seems
> disastrous, but the latter might be OK. However, I'm sort of confused
> about what the comment is trying to say to justify that:
>
> +        * It is only okay to set the VM bits without holding the heap page lock
> +        * because we can expect no other writers of this page.
>
> It is not exactly clear to me whether "this page" here refers to the
> heap page or the VM page. If it means the heap page, why should that
> be so if we haven't got any kind of lock? If it means the VM page,
> then why is the heap page even relevant?

I've expanded the comment in v12. In normal operation we must have the
lock on the heap page when setting the VM bits because if another
backend cleared PD_ALL_VISIBLE, we could have the forbidden scenario
where PD_ALL_VISIBLE is clear and the VM is set. This is not allowed
because then someone else may read the VM, conclude the page is
all-visible, and then an index-only scan can return wrong results. In
recovery, there are no concurrent writers, so it can't happen.

It is worth discussing how to fix it in heap_xlog_visible() so that
future scenarios like parallel recovery could not break this. However,
this patch is not a deviation from the behavior on master, and,
technically the behavior on master works.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Tue, Sep 9, 2025 at 12:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> For heap_xlog_visible() the LSN interlock comment is easier to parse
> because of an earlier comment before reading the heap page:
>
>     /*
>      * Read the heap page, if it still exists. If the heap file has dropped or
>      * truncated later in recovery, we don't need to update the page, but we'd
>      * better still update the visibility map.
>      */
>
> I've gone with the direct copy-paste of the LSN interlock paragraph in
> attached v12. I think referring to the other comment is too confusing
> in context here. However, I also added a line about what could cause
> the LSN interlock -- but above it, so as to retain grep-ability of the
> other comment.

I think that reads a little strangely. I would consolidate: Note that
the heap relation may have been dropped or truncated, leading us to
skip updating the heap block due to the LSN interlock. However, even
in that case, it's still safe to update the visibility map, etc. The
rest of the comment is perhaps a tad more explicit than our usual
practice, but that might be a good thing, because sometimes we're a
little too terse about these critical details.

I just realized that I don't like this:

+ /*
+ * If we're only adding already frozen rows to a previously empty
+ * page, mark it as all-frozen and update the visibility map. We're
+ * already holding a pin on the vmbuffer.
+ */

The thing is, we rarely position a block comment just before an "else
if". There are probably instances, but it's not typical. That's why
the existing comment contains two "if blah then blah" statements of
which you deleted the second -- because it needed to cover both the
"if" and the "else if". An alternative style is to move the comment
down a nesting level and rephrase without the conditional, ie. "We're
only adding frozen rows to a previously empty page, so mark it as
all-frozen etc." But I don't know that I like doing that for one
branch of the "if" and not the other.

The rest of what's now 0001 looks OK to me now, although you might
want to wait for a review from somebody more knowledgeable about this
area.

Some very quick comments on the next few patches -- far from a full review:

0002. Looks boring, probably unobjectionable provided the payoff patch is OK.

0003. What you've done here with xl_heap_prune.flags is kind of
horrifying. The problem is that, while you've added code explaining
that VISIBILITYMAP_ALL_{VISIBLE,FROZEN} are honorary XLHP flags,
nobody who isn't looking directly at that comment is going to
understand the muddling of the two namespaces. I would suggest not
doing this, even if it means defining redundant constants and writing
technically-unnecessary code to translate between them.

0004. It is not clear to me why you need to get
log_heap_prune_and_freeze to do the work here. Why can't
log_newpage_buffer get the job done already?

0005. It looks a little curious that you delete the
identify-corruption logic from the end of the if-nest and add it to
the beginning. Ceteris paribus, you'd expect that to be worse, since
corruption is a rare case.

0006. "to me marked" -> "to be marked".

+                * If the heap page is all-visible but the VM bit is
not set, we don't
+                * need to dirty the heap page.  However, if checksums
are enabled, we
+                * do need to make sure that the heap page is dirtied
before passing
+                * it to visibilitymap_set(), because it may be logged.
                 */
-               PageSetAllVisible(page);
-               MarkBufferDirty(buf);
+               if (!PageIsAllVisible(page) || XLogHintBitIsNeeded())
+               {
+                       PageSetAllVisible(page);
+                       MarkBufferDirty(buf);
+               }

I really hate this. Maybe you're going to argue that it's not the job
of this patch to fix the awfulness here, but surely marking a buffer
dirty in case some other function decides to WAL-log it is a
ridiculous plan.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for the review! I've made the changes to comments and minor
fixes you suggested in attached v13 and have limited my inline
responses to areas where further discussion is required.

On Tue, Sep 9, 2025 at 3:26 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> 0003. What you've done here with xl_heap_prune.flags is kind of
> horrifying. The problem is that, while you've added code explaining
> that VISIBILITYMAP_ALL_{VISIBLE,FROZEN} are honorary XLHP flags,
> nobody who isn't looking directly at that comment is going to
> understand the muddling of the two namespaces. I would suggest not
> doing this, even if it means defining redundant constants and writing
> technically-unnecessary code to translate between them.

Fair. I've introduced new XLHP flags in attached v13. Hopefully it
puts an end to the horror.

> 0004. It is not clear to me why you need to get
> log_heap_prune_and_freeze to do the work here. Why can't
> log_newpage_buffer get the job done already?

Well, I need something to emit the changes to the VM. I'm eliminating
all users of xl_heap_visible. Empty pages are the ones that benefit
the least from switching from xl_heap_visible -> xl_heap_prune. But,
if I don't transition them, we have to maintain all the
xl_heap_visible code (including visibilitymap_set() in its long form).

As for log_newpage_buffer(), I could keep it if you think it is too
confusing to change log_heap_prune_and_freeze()'s API (by passing
force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
there and then call log_heap_prune_and_freeze().

I just thought it seemed simple to avoid emitting the new page record
and the VM update record, so why not -- but I don't have strong
feelings.

> 0005. It looks a little curious that you delete the
> identify-corruption logic from the end of the if-nest and add it to
> the beginning. Ceteris paribus, you'd expect that to be worse, since
> corruption is a rare case.

On master, the two corruption cases are sandwiched between the normal
VM set cases. And I actually think doing it this way is brittle. If
you put the cases which set the VM first, you have to have completely
bulletproof the if statements guarding them to foreclose any possible
corruption case from entering because otherwise you will overwrite the
corruption you then try to detect.

But, specifically, from a performance perspective:

I think moving up the third case doesn't matter because the check is so cheap:

    else if (presult.lpdead_items > 0 && PageIsAllVisible(page))

And as for moving up the second case (the other corruption case), the
non-cheap thing it does is call visibilitymap_get_status()

    else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
             visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)

But once you call visibilitymap_get_status() once, assuming there is
no corruption and you need to go set the VM, you've already got that
page of the VM read, so it is probably pretty cheap. Overall, I didn't
think this would add noticeable overhead or many wasted operations.

And I thought that reorganizing the code improved clarity as well as
decreased the likelihood of bugs from insufficiently guarding positive
cases against corrupt pages and overwriting corruption instead of
detecting it.

If we're really worried about it from a performance perspective, I
could add an extra test at the top of identify_and_fix_vm_corruption()
that dumps out early if (!all_visible_according_to_vm &&
presult.all_visible).

> +                * If the heap page is all-visible but the VM bit is
> not set, we don't
> +                * need to dirty the heap page.  However, if checksums
> are enabled, we
> +                * do need to make sure that the heap page is dirtied
> before passing
> +                * it to visibilitymap_set(), because it may be logged.
>                  */
> -               PageSetAllVisible(page);
> -               MarkBufferDirty(buf);
> +               if (!PageIsAllVisible(page) || XLogHintBitIsNeeded())
> +               {
> +                       PageSetAllVisible(page);
> +                       MarkBufferDirty(buf);
> +               }
>
> I really hate this. Maybe you're going to argue that it's not the job
> of this patch to fix the awfulness here, but surely marking a buffer
> dirty in case some other function decides to WAL-log it is a
> ridiculous plan.

Right, it isn't pretty. But I don't quite see what the alternative is.
We need to mark the buffer dirty before setting the LSN. We could
perhaps rewrite visibilitymap_set()'s API to return the LSN of the
xl_heap_visible record and stamp it on the heap buffer ourselves. But
1) I think visibilitymap_set() purposefully conceals its WAL logging
ways from the caller and propagating that info back up starts to make
the API messy in another way and 2) I'm a bit loath to make big
changes to visibilitymap_set() right now since my patch set eventually
resolves this by putting the changes to the VM and heap page in the
same WAL record.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Tue, Sep 9, 2025 at 7:08 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Fair. I've introduced new XLHP flags in attached v13. Hopefully it
> puts an end to the horror.

I suggest not renumbering all of the existing flags and just adding
these new ones at the end. Less code churn and more likely to break in
an obvious way if you mix up the two sets of flags.

More on 0002:

+ set_heap_lsn = XLogHintBitIsNeeded() ? true : set_heap_lsn;

Maybe just if (XLogHintBitIsNeeded) set_heap_lsn = true? I don't feel
super-strongly that what you've done is bad but it looks weird to my
eyes.

+ * If we released any space or line pointers or will be setting a page in
+ * the visibility map, measure the page's freespace to later update the

"setting a page in the visibility map" seems a little muddled to me.
You set bits, not pages.

+ * all-visible (or all-frozen, depending on the vacuum mode,) which is

This comma placement is questionable.

  /*
+ * Note that the heap relation may have been dropped or truncated, leading
+ * us to skip updating the heap block due to the LSN interlock. However,
+ * even in that case, it's still safe to update the visibility map. Any
+ * WAL record that clears the visibility map bit does so before checking
+ * the page LSN, so any bits that need to be cleared will still be
+ * cleared.
+ *
+ * Note that the lock on the heap page was dropped above. In normal
+ * operation this would never be safe because a concurrent query could
+ * modify the heap page and clear PD_ALL_VISIBLE -- violating the
+ * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
+ * the VM is set.
+ *
+ * In recovery, we expect no other writers, so writing to the VM page
+ * without holding a lock on the heap page is considered safe enough. It
+ * is done this way when replaying xl_heap_visible records (see
  */

How many copies of this comment do you plan to end up with?

The comment for log_heap_prune_and_freeze seems to be anticipating future work.

> > 0004. It is not clear to me why you need to get
> > log_heap_prune_and_freeze to do the work here. Why can't
> > log_newpage_buffer get the job done already?
>
> Well, I need something to emit the changes to the VM. I'm eliminating
> all users of xl_heap_visible. Empty pages are the ones that benefit
> the least from switching from xl_heap_visible -> xl_heap_prune. But,
> if I don't transition them, we have to maintain all the
> xl_heap_visible code (including visibilitymap_set() in its long form).
>
> As for log_newpage_buffer(), I could keep it if you think it is too
> confusing to change log_heap_prune_and_freeze()'s API (by passing
> force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
> there and then call log_heap_prune_and_freeze().
>
> I just thought it seemed simple to avoid emitting the new page record
> and the VM update record, so why not -- but I don't have strong
> feelings.

Yeah, I'm not sure what the right thing to do here is. I think I was
again experiencing brain fade by forgetting that there is a heap page
and a VM page and, of course, log_heap_newpage() probably isn't going
to touch the latter. So that makes sense. On the other hand, we could
only have one type of WAL record for every single operation in the
system if we gave it enough flags, and force_heap_fpi seems
suspiciously like a flag that turns this into a whole different kind
of WAL record.

> > 0005. It looks a little curious that you delete the
> > identify-corruption logic from the end of the if-nest and add it to
> > the beginning. Ceteris paribus, you'd expect that to be worse, since
> > corruption is a rare case.
>
> On master, the two corruption cases are sandwiched between the normal
> VM set cases. And I actually think doing it this way is brittle. If
> you put the cases which set the VM first, you have to have completely
> bulletproof the if statements guarding them to foreclose any possible
> corruption case from entering because otherwise you will overwrite the
> corruption you then try to detect.

Hmm. In the current code, we first test (!all_visible_according_to_vm
&& presult.all_visible), then (all_visible_according_to_vm &&
!PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel,
blkno, &vmbuffer) != 0), and then (presult.lpdead_items > 0 &&
PageIsAllVisible(page)). The first and second can never coexist,
because they require opposite values of all_visible_according_to_vm.
The second and third cannot coexist because they require opposite
values of PageIsAllVisible(page). It is not entirely obvious that the
first and third tests couldn't both pass, but you'd have to have
presult.all_visible and presult.lpdead_items > 0, and it's a bit hard
to see how heap_page_prune_and_freeze() could ever allow that.
Consider:

    if (prstate.all_visible && prstate.lpdead_items == 0)
    {
        presult->all_visible = prstate.all_visible;
        presult->all_frozen = prstate.all_frozen;
    }
    else
    {
        presult->all_visible = false;
        presult->all_frozen = false;
    }
...
    presult->lpdead_items = prstate.lpdead_items;

So I don't really think I'm persuaded that the current way is brittle.
But that having been said, I agree with you that the order of the
checks is kind of random, and I don't think it really matters that
much for performance. What does matter is clarity. I feel like what
I'd ideally like this logic to do is say: do we want the VM bit for
the page to be set to all-frozen, just all-visible, or neither? Then
push the VM bit to the correct state, dragging the page-level bit
along behind. And the current logic sort of does that. It's roughly:

1. Should we go from not-all-visible to either all-visible or
all-frozen? If yes, do so.
2. Should we go from either all-visible or all-frozen to
not-all-visible? If yes, do so.
3. Should we go from either all-visible or all-frozen to
not-all-visible for a different reason? If yes, do so.
4. Should we go from all-visible to all-frozen? If yes, do so.

But what's weird is that all the tests are written differently, and we
have two different reasons for going to not-all-visible, namely
PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only
one test for each of the other state-transitions, because the
decision-making for those cases is fully completed at an earlier
stage. I would kind of like to see this expressed in a way that first
decides which state transition to make (forward-to-all-frozen,
forward-to-all-visible, backward-to-all-visible,
backward-to-not-all-visible, nothing) and then does the corresponding
work. What you're doing instead is splitting half of those functions
off into a helper function while keeping the other half where they are
without cleaning up any of the logic. Now, maybe that's OK: I'm far
from having grokked the whole patch set. But it is not any more clear
than what we have now, IMHO, and perhaps even a bit less so.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Wed, Sep 10, 2025 at 4:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 7:08 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Fair. I've introduced new XLHP flags in attached v13. Hopefully it
> > puts an end to the horror.
>
> I suggest not renumbering all of the existing flags and just adding
> these new ones at the end. Less code churn and more likely to break in
> an obvious way if you mix up the two sets of flags.

Makes sense. In my attached v14, I have not renumbered them.

> More on 0002:

After an off-list discussion we had about how to make the patches in
the set progressively improve the code instead of just mechanically
refactoring it, I have made some big changes in the intermediate
patches in the set.

Before actually including the VM changes in the vacuum/prune WAL
records, I first include setting PD_ALL_VISIBLE with the other changes
to the heap page so that we can remove the heap page from the VM
setting WAL chain. This happens to fix the bug we discussed where if
you set an all-visible page all-frozen and checksums/wal_log_hints are
enabled, you may end up setting an LSN on a page that was not marked
dirty.

0001 is RFC but waiting on one other reviewer
0002 - 0007 is a bit of cleanup I had later in the patch set but moved
up because I think it made the intermediate patches better
0008 - 0012 removes the heap page from the XLOG_HEAP2_VISIBLE WAL
chain (it makes all callers of visibilitymap_set() set PD_ALL_VISIBLE
in the same WAL record as changes to the heap page)
0013 - 0018 finish the job eliminating XLOG_HEAP2_VISIBLE and set VM
bits in the same WAL record as the heap changes
0019 - 0024 set the VM on-access

>   /*
> + * Note that the heap relation may have been dropped or truncated, leading
> + * us to skip updating the heap block due to the LSN interlock. However,
> + * even in that case, it's still safe to update the visibility map. Any
> + * WAL record that clears the visibility map bit does so before checking
> + * the page LSN, so any bits that need to be cleared will still be
> + * cleared.
> + *
> + * Note that the lock on the heap page was dropped above. In normal
> + * operation this would never be safe because a concurrent query could
> + * modify the heap page and clear PD_ALL_VISIBLE -- violating the
> + * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
> + * the VM is set.
> + *
> + * In recovery, we expect no other writers, so writing to the VM page
> + * without holding a lock on the heap page is considered safe enough. It
> + * is done this way when replaying xl_heap_visible records (see
>   */
>
> How many copies of this comment do you plan to end up with?

By the end, one for copy freeze replay and one for prune/freeze/vacuum
replay. I felt two wasn't too bad and was easier than meta-explaining
what the other comment was explaining.

> > > 0004. It is not clear to me why you need to get
> > > log_heap_prune_and_freeze to do the work here. Why can't
> > > log_newpage_buffer get the job done already?
> >
> > Well, I need something to emit the changes to the VM. I'm eliminating
> > all users of xl_heap_visible. Empty pages are the ones that benefit
> > the least from switching from xl_heap_visible -> xl_heap_prune. But,
> > if I don't transition them, we have to maintain all the
> > xl_heap_visible code (including visibilitymap_set() in its long form).
> >
> > As for log_newpage_buffer(), I could keep it if you think it is too
> > confusing to change log_heap_prune_and_freeze()'s API (by passing
> > force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
> > there and then call log_heap_prune_and_freeze().
> >
> > I just thought it seemed simple to avoid emitting the new page record
> > and the VM update record, so why not -- but I don't have strong
> > feelings.
>
> Yeah, I'm not sure what the right thing to do here is. I think I was
> again experiencing brain fade by forgetting that there is a heap page
> and a VM page and, of course, log_heap_newpage() probably isn't going
> to touch the latter. So that makes sense. On the other hand, we could
> only have one type of WAL record for every single operation in the
> system if we gave it enough flags, and force_heap_fpi seems
> suspiciously like a flag that turns this into a whole different kind
> of WAL record.

I've kept log_heap_newpage() and used log_heap_prune_and_freeze() for
setting PD_ALL_VISIBLE and the VM.

> > > 0005. It looks a little curious that you delete the
> > > identify-corruption logic from the end of the if-nest and add it to
> > > the beginning. Ceteris paribus, you'd expect that to be worse, since
> > > corruption is a rare case.
> >
> > On master, the two corruption cases are sandwiched between the normal
> > VM set cases. And I actually think doing it this way is brittle. If
> > you put the cases which set the VM first, you have to have completely
> > bulletproof the if statements guarding them to foreclose any possible
> > corruption case from entering because otherwise you will overwrite the
> > corruption you then try to detect.
>
> Hmm. In the current code, we first test (!all_visible_according_to_vm
> && presult.all_visible), then (all_visible_according_to_vm &&
> !PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel,
> blkno, &vmbuffer) != 0), and then (presult.lpdead_items > 0 &&
> PageIsAllVisible(page)). The first and second can never coexist,
> because they require opposite values of all_visible_according_to_vm.
> The second and third cannot coexist because they require opposite
> values of PageIsAllVisible(page). It is not entirely obvious that the
> first and third tests couldn't both pass, but you'd have to have
> presult.all_visible and presult.lpdead_items > 0, and it's a bit hard
> to see how heap_page_prune_and_freeze() could ever allow that.
> Consider:
>
>     if (prstate.all_visible && prstate.lpdead_items == 0)
>     {
>         presult->all_visible = prstate.all_visible;
>         presult->all_frozen = prstate.all_frozen;
>     }
>     else
>     {
>         presult->all_visible = false;
>         presult->all_frozen = false;
>     }
> ...
>     presult->lpdead_items = prstate.lpdead_items;
>
> So I don't really think I'm persuaded that the current way is brittle.

I meant brittle because it has to be so carefully coded for it to work
out this way. If you ever wanted to change or enhance it, it's quite
hard to know how to make sure all of them are entirely mutually
exclusive.

> But that having been said, I agree with you that the order of the
> checks is kind of random, and I don't think it really matters that
> much for performance. What does matter is clarity. I feel like what
> I'd ideally like this logic to do is say: do we want the VM bit for
> the page to be set to all-frozen, just all-visible, or neither? Then
> push the VM bit to the correct state, dragging the page-level bit
> along behind. And the current logic sort of does that. It's roughly:
>
> 1. Should we go from not-all-visible to either all-visible or
> all-frozen? If yes, do so.
> 2. Should we go from either all-visible or all-frozen to
> not-all-visible? If yes, do so.
> 3. Should we go from either all-visible or all-frozen to
> not-all-visible for a different reason? If yes, do so.
> 4. Should we go from all-visible to all-frozen? If yes, do so.

I don't necessarily agree that fixing corruption and setting the VM
should be together -- they feel like separate things to me. But, I
don't feel strongly enough about it to push it.

> But what's weird is that all the tests are written differently, and we
> have two different reasons for going to not-all-visible, namely
> PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only
> one test for each of the other state-transitions, because the
> decision-making for those cases is fully completed at an earlier
> stage. I would kind of like to see this expressed in a way that first
> decides which state transition to make (forward-to-all-frozen,
> forward-to-all-visible, backward-to-all-visible,
> backward-to-not-all-visible, nothing) and then does the corresponding
> work. What you're doing instead is splitting half of those functions
> off into a helper function while keeping the other half where they are
> without cleaning up any of the logic. Now, maybe that's OK: I'm far
> from having grokked the whole patch set. But it is not any more clear
> than what we have now, IMHO, and perhaps even a bit less so.

In terms of my patch set, I do have to change something about this
mixture of fixing corruption and setting the VM because I need to set
the VM bits in the same critical section as making the other changes
to the heap page (pruning, etc) and include the VM set changes in the
same WAL record (note that clearing the VM to fix corruption is not
WAL-logged).

What I've gone with is determining what to set the VM bits to and then
fixing the corruption at the same time. Then, later, when making the
changes to the heap page, I actually set the VM. This is kind of the
opposite of what you suggested above -- determining what to set the
bits to altogether -- corruption and non-corruption cases together. I
don't think we can do that though, because fixing the corruption is
non WAL-logged changes to the page and VM and setting the VM bits is a
WAL-logged change. And, you can't clear bits with visibilitymap_set()
(there's an assertion about that). So you have to call different
functions (not to mention emit distinct error messages). I don't know
that I've come up with the ideal solution, though.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andres Freund
Дата:
Hi,

On 2025-09-17 20:10:07 -0400, Melanie Plageman wrote:
> 0001 is RFC but waiting on one other reviewer

> From cacff6c95e38d370b87148bc48cf6ac5f086ed07 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 17 Jun 2025 17:22:10 -0400
> Subject: [PATCH v14 01/24] Eliminate COPY FREEZE use of XLOG_HEAP2_VISIBLE
> diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
> index cf843277938..faa7c561a8a 100644
> --- a/src/backend/access/heap/heapam_xlog.c
> +++ b/src/backend/access/heap/heapam_xlog.c
> @@ -551,6 +551,7 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      int            i;
>      bool        isinit = (XLogRecGetInfo(record) & XLOG_HEAP_INIT_PAGE) != 0;
>      XLogRedoAction action;
> +    Buffer        vmbuffer = InvalidBuffer;
>
>      /*
>       * Insertion doesn't overwrite MVCC data, so no conflict processing is
> @@ -571,11 +572,11 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
>      {
>          Relation    reln = CreateFakeRelcacheEntry(rlocator);
> -        Buffer        vmbuffer = InvalidBuffer;
>
>          visibilitymap_pin(reln, blkno, &vmbuffer);
>          visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
>          ReleaseBuffer(vmbuffer);
> +        vmbuffer = InvalidBuffer;
>          FreeFakeRelcacheEntry(reln);
>      }
>
> @@ -662,6 +663,57 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      if (BufferIsValid(buffer))
>          UnlockReleaseBuffer(buffer);
>
> +    buffer = InvalidBuffer;
> +
> +    /*
> +     * Now read and update the VM block.
> +     *
> +     * Note that the heap relation may have been dropped or truncated, leading
> +     * us to skip updating the heap block due to the LSN interlock.

I don't fully understand this - how does dropping/truncating the relation lead
to skipping due to the LSN interlock?


> +     * even in that case, it's still safe to update the visibility map. Any
> +     * WAL record that clears the visibility map bit does so before checking
> +     * the page LSN, so any bits that need to be cleared will still be
> +     * cleared.
> +     *
> +     * Note that the lock on the heap page was dropped above. In normal
> +     * operation this would never be safe because a concurrent query could
> +     * modify the heap page and clear PD_ALL_VISIBLE -- violating the
> +     * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
> +     * the VM is set.
> +     *
> +     * In recovery, we expect no other writers, so writing to the VM page
> +     * without holding a lock on the heap page is considered safe enough. It
> +     * is done this way when replaying xl_heap_visible records (see
> +     * heap_xlog_visible()).
> +     */
> +    if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> +        XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> +                                      &vmbuffer) == BLK_NEEDS_REDO)
> +    {

Why are we using RBM_ZERO_ON_ERROR here? I know it's copied from
heap_xlog_visible(), but I don't immediately understand (or remember) why we
do so there either?


> +        Page        vmpage = BufferGetPage(vmbuffer);
> +        Relation    reln = CreateFakeRelcacheEntry(rlocator);

Hm. Do we really need to continue doing this ugly fake relcache stuff? I'd
really like to eventually get rid of that and given that the new "code shape"
delegates a lot more responsibility to the redo routines, they should have a
fairly easy time not needing a fake relcache?  Afaict the relation already is
not used outside of debugging paths?


> +        /* initialize the page if it was read as zeros */
> +        if (PageIsNew(vmpage))
> +            PageInit(vmpage, BLCKSZ, 0);
> +
> +        visibilitymap_set_vmbits(reln, blkno,
> +                                 vmbuffer,
> +                                 VISIBILITYMAP_ALL_VISIBLE |
> +                                 VISIBILITYMAP_ALL_FROZEN);
> +
> +        /*
> +         * It is not possible that the VM was already set for this heap page,
> +         * so the vmbuffer must have been modified and marked dirty.
> +         */

I assume that's because we a) checked the LSN interlock b) are replaying
something that needed to newly set the bit?


Except for the above comments, this looks pretty good to me.


Seems 0002 should just be applied...


Re 0003: I wonder if it's getting to the point that a struct should be used as
the argument.

Greetings,

Andres Freund



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Sep 18, 2025 at 12:48 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-09-17 20:10:07 -0400, Melanie Plageman wrote:
>
> > +     /*
> > +      * Now read and update the VM block.
> > +      *
> > +      * Note that the heap relation may have been dropped or truncated, leading
> > +      * us to skip updating the heap block due to the LSN interlock.
>
> I don't fully understand this - how does dropping/truncating the relation lead
> to skipping due to the LSN interlock?

Yes, this wasn't right. I misunderstood.

What I think it should say is that if the heap update was skipped due
to LSN interlock we still have to replay the updates to the VM because
each vm page contains bits for multiple heap blocks and if the record
included a vm page FPI, subsequent updates to the VM may rely on this
FPI to avoid torn pages. We don't condition it on the heap redo having
been an FPI, probably because it is not worth it -- but I wonder if
that is worth calling out in the comment?

Do we also need to replay it when the heap redo returns BLK_NOTFOUND?
I assume this can happen in the case of relation dropped or truncated
-- but in this case there wouldn't be subsequent records updating the
VM for other heap blocks that we need to replay because the other heap
blocks won't be found either, right?

> > +     if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > +             XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > +                                                                       &vmbuffer) == BLK_NEEDS_REDO)
> > +     {
>
> Why are we using RBM_ZERO_ON_ERROR here? I know it's copied from
> heap_xlog_visible(), but I don't immediately understand (or remember) why we
> do so there either?

It has been RBM_ZERO_ON_ERROR since XLogReadBufferForRedoExtended()
was introduced here in 2c03216d8311.
I think we probably do this because vm_readbuf() passes ReadBuffer()
RBM_ZERO_ON_ERROR and has this comment

     * For reading we use ZERO_ON_ERROR mode, and initialize the page if
     * necessary. It's always safe to clear bits, so it's better to clear
     * corrupt pages than error out.

Do you think I also should have a comment in heap_xlog_multi_insert()?

> > +             Page            vmpage = BufferGetPage(vmbuffer);
> > +             Relation        reln = CreateFakeRelcacheEntry(rlocator);
>
> Hm. Do we really need to continue doing this ugly fake relcache stuff? I'd
> really like to eventually get rid of that and given that the new "code shape"
> delegates a lot more responsibility to the redo routines, they should have a
> fairly easy time not needing a fake relcache?  Afaict the relation already is
> not used outside of debugging paths?

Yes, interestingly we don't have the relname in recovery anyway, so it
does all this fake relcache stuff only to convert the relfilenode to a
string and uses that.

The fake relcache stuff will still be used by visibilitymap_pin()
which callers like heap_xlog_delete() use to get the VM page. And I
don't think it is worth coming up with a version of that that doesn't
use the relcache. But you're right that the Relation is not needed for
visibilitymap_set_vmbits(). I've changed it to just take the relation
name as a string.


> > +             /* initialize the page if it was read as zeros */
> > +             if (PageIsNew(vmpage))
> > +                     PageInit(vmpage, BLCKSZ, 0);
> > +
> > +             visibilitymap_set_vmbits(reln, blkno,
> > +                                                              vmbuffer,
> > +                                                              VISIBILITYMAP_ALL_VISIBLE |
> > +                                                              VISIBILITYMAP_ALL_FROZEN);
> > +
> > +             /*
> > +              * It is not possible that the VM was already set for this heap page,
> > +              * so the vmbuffer must have been modified and marked dirty.
> > +              */
>
> I assume that's because we a) checked the LSN interlock b) are replaying
> something that needed to newly set the bit?

Yes, perhaps it is not worth having the assert since it attracts extra
attention to an invariant that is unlikely to be in danger of
regression.

> Seems 0002 should just be applied...

Done

> Re 0003: I wonder if it's getting to the point that a struct should be used as
> the argument.

I have been thinking about this. I have yet to come up with a good
idea for a struct name or multiple struct names that seem to fit here.
I could move the other output parameters into the PruneFreezeResult
and then maybe make some kind of PruneFreezeParameters struct or
something?

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
I find this patch set quite hard to follow. 0001 altogether removes
the use of XLOG_HEAP2_VISIBLE in cases where we use
XLOG_HEAP2_MULTI_INSERT, but then 0007 (the next non-refactoring
patch) begins half-removing the dependency on XLOG_HEAP2_VISIBLE,
assisted by 0009 and 0010, and then later you come back and remove the
other half of the dependency. I know it was I who proposed (off-list)
first making the XLOG_HEAP2_VISIBLE record only deal with the VM page
and not the heap buffer, but I'm not sure that idea quite worked out
in terms of making this easier to follow. At the least, it seems weird
that COPY FREEZE is an exception that gets handled in a different way
than all the other cases, fully removing the dependency in one step.
It would also be nice if each time you repost this, or maybe in a
README that you post along beside the actual patches, you'd include
some kind of roadmap to help the reader understand the internal
structure of the patch set, like 1 does this, 2-9 get us to here,
10-whatever get us to this next place.

I don't really understand how the interlocking works. 0011 changes
visibilitymap_set so that it doesn't take the heap block as an
argument, but we'd better hold a lock on the heap page while setting
the VM bit, otherwise I think somebody could come along and modify the
heap page after we decided it was all-visible and before we actually
set the VM bit, which would be terrible. I would expect the comments
and the commit message to say something about that, but I don't see
that they do.

I find myself fearful of the way that 0007 propagates the existing
hacks around setting the VM bit into a new place:

+               /*
+                * We always emit a WAL record when setting
PD_ALL_VISIBLE, but we are
+                * careful not to emit a full page image unless
+                * checksums/wal_log_hints are enabled. We only set
the heap page LSN
+                * if full page images were an option when emitting
WAL. Otherwise,
+                * subsequent modifications of the page may
incorrectly skip emitting
+                * a full page image.
+                */
+               if (do_prune || nplans > 0 ||
+                       (xlrec.flags & XLHP_SET_PD_ALL_VIS &&
XLogHintBitIsNeeded()))
+                       PageSetLSN(page, lsn);

I suppose it's not the worst thing to duplicate this logic, because
you're later going to remove the original copy. But, it took me >10
minutes to find the text in src/backend/access/transam/README, in the
second half of the "Writing Hints" section, that explains the overall
principle here, and since the patch set doesn't seem to touch that
text, maybe you weren't even aware it was there. And, it's a little
weird to have a single WAL record that is either a hint or not a hint
depending on a complex set of conditions. (IMHO mixing & and &&
without parentheses is quite brave, and an explicit != 0 might not be
a bad idea either.)

Anyway, I kind of wonder if it's time to back out the hack that I
installed here many years ago. At the time, I thought that it would be
bad if a VACUUM swept over the visibility map setting VM bits and as a
result emitted an FPI for every page in the entire heap ... but
everyone who is running with checksums has accepted that cost already,
and with those being the default, that's probably going to be most
people. It would be even more compelling if we were going to freeze,
prune, and set all-visible on access, because then presumably the case
where we touch a page and ONLY set the VM bit would be rare, so the
cost of doing that wouldn't matter much, but I guess the patch doesn't
go that far -- we can freeze or set all-visible on access but not
prune, without which the scenario I was worrying about at the time is
still fairly plausible, I think, if checksums are turned off.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Wed, Sep 24, 2025 at 4:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I find this patch set quite hard to follow. 0001 altogether removes
> the use of XLOG_HEAP2_VISIBLE in cases where we use
> XLOG_HEAP2_MULTI_INSERT, but then 0007 (the next non-refactoring
> patch) begins half-removing the dependency on XLOG_HEAP2_VISIBLE,
> assisted by 0009 and 0010, and then later you come back and remove the
> other half of the dependency. I know it was I who proposed (off-list)
> first making the XLOG_HEAP2_VISIBLE record only deal with the VM page
> and not the heap buffer, but I'm not sure that idea quite worked out
> in terms of making this easier to follow. At the least, it seems weird
> that COPY FREEZE is an exception that gets handled in a different way
> than all the other cases, fully removing the dependency in one step.
> It would also be nice if each time you repost this, or maybe in a
> README that you post along beside the actual patches, you'd include
> some kind of roadmap to help the reader understand the internal
> structure of the patch set, like 1 does this, 2-9 get us to here,
> 10-whatever get us to this next place.

In attached v16, I’ve reverted to removing XLOG_HEAP2_VISIBLE
entirely, rather than first removing each caller's heap page from the
VM WAL chain. I reordered changes and squashed several refactoring
patches to improve patch-by-patch readability. This should make the
set read differently from earlier versions that removed
XLOG_HEAP2_VISIBLE and had more step-by-step mechanical refactoring.

I think if we plan to go all the way with removing XLOG_HEAP2_VISIBLE,
having intermediate patches that just set PD_ALL_VISIBLE when making
other heap pages are more confusing than helpful. Also, I think having
separate flags for setting PD_ALL_VISIBLE in the WAL record
over-complicated the code.

0001:  remove XLOG_HEAP2_VISIBLE from COPY FREEZE
0002 - 0005: various refactoring in advance of removing
XLOG_HEAP2_VISIBLE in pruning
0006: Pruning and freezing by vacuum sets the VM and emits a single
WAL record with those changes
0007: Reaping (phase III) by vacuum sets the VM and sets line pointers
unused in a single WAL record
0008 - 0009: XLOG_HEAP2_VISIBLE is eliminated
0010 - 0012: preparation for setting VM on-access
0013: set VM on-access
0014: set pd_prune_xid on insert

> I find myself fearful of the way that 0007 propagates the existing
> hacks around setting the VM bit into a new place:
>
> +               /*
> +                * We always emit a WAL record when setting
> PD_ALL_VISIBLE, but we are
> +                * careful not to emit a full page image unless
> +                * checksums/wal_log_hints are enabled. We only set
> the heap page LSN
> +                * if full page images were an option when emitting
> WAL. Otherwise,
> +                * subsequent modifications of the page may
> incorrectly skip emitting
> +                * a full page image.
> +                */
> +               if (do_prune || nplans > 0 ||
> +                       (xlrec.flags & XLHP_SET_PD_ALL_VIS &&
> XLogHintBitIsNeeded()))
> +                       PageSetLSN(page, lsn);
>
> I suppose it's not the worst thing to duplicate this logic, because
> you're later going to remove the original copy. But, it took me >10
> minutes to find the text in src/backend/access/transam/README, in the
> second half of the "Writing Hints" section, that explains the overall
> principle here, and since the patch set doesn't seem to touch that
> text, maybe you weren't even aware it was there.

I don't think that src/backend/access/transam/README must change with
my patch. It is still true that if the only change we are making to
the heap page is setting PD_ALL_VISIBLE and checksums/wal_log_hints
are disabled, we explicitly avoid an FPI and thus can't stamp the page
LSN.

> And, it's a little
> weird to have a single WAL record that is either a hint or not a hint
> depending on a complex set of conditions.

PD_ALL_VISIBLE is different from tuple hints and other page hints
because setting the VM is always WAL logged and when we replay that,
it will always set PD_ALL_VISIBLE, so PD_ALL_VISIBLE is effectively
always WAL-logged. The other hints aren't wal-logged unless checksums
are enabled and we need an FPI. So PD_ALL_VISIBLE is different from
other page hints in multiple ways. We can't make it more like those
hints because of needing to preserve the invariant that the VM is
never set when the page is clear. The only thing we could do is forbid
omitting the FPI even when checksums are not enabled.

> Anyway, I kind of wonder if it's time to back out the hack that I
> installed here many years ago. At the time, I thought that it would be
> bad if a VACUUM swept over the visibility map setting VM bits and as a
> result emitted an FPI for every page in the entire heap ... but
> everyone who is running with checksums has accepted that cost already,
> and with those being the default, that's probably going to be most
> people.

I agree that PD_ALL_VISIBLE persistence is complicated, but we have
other special cases that complicate the code for a performance
benefit. I guess the question is if we are saying people shouldn't run
without checksums in production. If that's true, then it's fine to
remove this optimization. Otherwise, I'm not so sure.

I think cloud providers generally have checksums enabled, but I don't
know what is common on-prem.

> It would be even more compelling if we were going to freeze,
> prune, and set all-visible on access, because then presumably the case
> where we touch a page and ONLY set the VM bit would be rare, so the
> cost of doing that wouldn't matter much, but I guess the patch doesn't
> go that far -- we can freeze or set all-visible on access but not
> prune, without which the scenario I was worrying about at the time is
> still fairly plausible, I think, if checksums are turned off.

With the whole set applied, we can prune and set the VM on access but
not freeze. I have a patch to do that, but it introduced noticeable
CPU overhead to prepare the freeze plans. I'd have to spend much more
time studying it to avoid regressing workloads where we don't end up
freezing but prepare the freeze plans during SELECT queries.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Oct 6, 2025 at 6:40 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> In attached v16, I’ve reverted to removing XLOG_HEAP2_VISIBLE
> entirely, rather than first removing each caller's heap page from the
> VM WAL chain. I reordered changes and squashed several refactoring
> patches to improve patch-by-patch readability. This should make the
> set read differently from earlier versions that removed
> XLOG_HEAP2_VISIBLE and had more step-by-step mechanical refactoring.
>
> I think if we plan to go all the way with removing XLOG_HEAP2_VISIBLE,
> having intermediate patches that just set PD_ALL_VISIBLE when making
> other heap pages are more confusing than helpful. Also, I think having
> separate flags for setting PD_ALL_VISIBLE in the WAL record
> over-complicated the code.

I decided to reorder the patches to remove XLOG_HEAP2_VISIBLE from
vacuum phase III before removing it from vacuum phase I because
removing it from phase III doesn't require preliminary refactoring
patches. I've done that in the attached v17.

I've also added an experimental patch on the end that refactors large
chunks of heap_page_prune_and_freeze() into helpers. I got some
feedback off-list that heap_page_prune_and_freeze() is too unwieldy
now. I'm not sure how I feel about them yet, so I haven't documented
them or moved them up in the patch set to before changes to
heap_page_prune_and_freeze().

0001: Eliminate XLOG_HEAP2_VISIBLE from COPY FREEZE
0002: Eliminate XLOG_HEAP2_VISIBLE from phase III of vacuum
0003 - 0006: cleanup and refactoring to prepare for 0007
0007: Eliminate XLOG_HEAP2_VISIBLE from vacuum prune/freeze
0008 - 0009: Remove XLOG_HEAP2_VISIBLE
0010 - 0012: refactoring to prepare for 0013
0013: Set VM on-access
0014: Set pd_prune_xid on insert
0015: Experimental refactoring of heap_page_prune_and_freeze into helpers

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andres Freund
Дата:
Hi,

On 2025-10-08 18:54:25 -0400, Melanie Plageman wrote:
> +uint8
> +visibilitymap_set_vmbits(BlockNumber heapBlk,
> +                         Buffer vmBuf, uint8 flags,
> +                         const char *heapRelname)
> +{
> +    BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
> +    uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
> +    uint8        mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
> +    Page        page;
> +    uint8       *map;
> +    uint8        status;
> +
> +#ifdef TRACE_VISIBILITYMAP
> +    elog(DEBUG1, "vm_set flags 0x%02X for %s %d",
> +         flags, heapRelname, heapBlk);
> +#endif

I like it doesn't take a Relation anymore, but I'd just pass the smgrrelation
instead, then you don't need to allocate the string in the caller, when it's
approximately never used.

Otherwise this looks pretty close to me.



> @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>      }
>  
>      /*
> -     * If we have a full-page image, restore it and we're done.
> +     * If we have a full-page image of the heap block, restore it and we're
> +     * done with the heap block.
>       */
> -    action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> -                                           (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> -                                           &buffer);
> -    if (action == BLK_NEEDS_REDO)
> +    if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> +                                      (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> +                                      &buffer) == BLK_NEEDS_REDO)
>      {
>          Page        page = BufferGetPage(buffer);
>          OffsetNumber *redirected;

Why move it around this way?


> @@ -138,36 +157,104 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          /* There should be no more data */
>          Assert((char *) frz_offsets == dataptr + datalen);
>  
> +        if ((vmflags & VISIBILITYMAP_VALID_BITS))
> +            PageSetAllVisible(page);
> +
> +        MarkBufferDirty(buffer);
> +
> +        /*
> +         * Always emit a WAL record when setting PD_ALL_VISIBLE but only emit
> +         * an FPI if checksums/wal_log_hints are enabled.

This comment reads as-if we're WAL logging here, but this is a
Wendy's^Wrecovery.

> Advance the page LSN
> +         * only if the record could include an FPI, since recovery skips
> +         * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> +         * needed to repair a torn page.
> +         */

This is confusing, should probably just reference the stuff we did in the
!recovery case.


> +        if (do_prune || nplans > 0 ||
> +            ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> +            PageSetLSN(page, lsn);
> +
>          /*
>           * Note: we don't worry about updating the page's prunability hints.
>           * At worst this will cause an extra prune cycle to occur soon.
>           */

Not your fault, but that seems odd? Why aren't we just doing the right thing?

>      /*
> -     * If we released any space or line pointers, update the free space map.
> +     * If we released any space or line pointers or set PD_ALL_VISIBLE or the
> +     * VM, update the freespace map.

I'd replace the first or with a , ;)


> +     * Even when no actual space is freed (e.g., when only marking the page
> +     * all-visible or frozen), we still update the FSM. Because the FSM is
> +     * unlogged and maintained heuristically, it often becomes stale on
> +     * standbys. If such a standby is later promoted and runs VACUUM, it will
> +     * skip recalculating free space for pages that were marked all-visible
> +     * (or all-frozen, depending on the mode). FreeSpaceMapVacuum can then
> +     * propagate overly optimistic free space values upward, causing future
> +     * insertions to select pages that turn out to be unusable. In bulk, this
> +     * can lead to long stalls.
> +     *
> +     * To prevent this, always refresh the FSM’s view when a page becomes
> +     * all-visible or all-frozen.

I'd s/refresh/update/, because refresh sounds more like rereading the current
state of the FSM, rather than changing the FSM.


> +        /* We don't have relation name during recovery, so use relfilenode */
> +        relname = psprintf("%u", rlocator.relNumber);
> +        old_vmbits = visibilitymap_set_vmbits(blkno, vmbuffer, vmflags, relname);
>  
> -            XLogRecordPageWithFreeSpace(rlocator, blkno, freespace);
> +        /* Only set VM page LSN if we modified the page */
> +        if (old_vmbits != vmflags)
> +        {
> +            Assert(BufferIsDirty(vmbuffer));
> +            PageSetLSN(BufferGetPage(vmbuffer), lsn);
>          }
> -        else
> -            UnlockReleaseBuffer(buffer);
> +        pfree(relname);

Hm. When can we actually enter the old_vmbits == vmflags case?  It might also
be fine to just say that we don't expect it to change but are mirroring the
code in visibilitymap_set().


I wonder if the VM specific redo portion should be in a common helper? Might
not be enough code to worry though...


> @@ -2070,8 +2079,24 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
>      xlhp_prune_items dead_items;
>      xlhp_prune_items unused_items;
>      OffsetNumber frz_offsets[MaxHeapTuplesPerPage];
> +    bool        do_prune = nredirected > 0 || ndead > 0 || nunused > 0;
> +    bool        do_set_vm = vmflags & VISIBILITYMAP_VALID_BITS;
>  
>      xlrec.flags = 0;
> +    regbuf_flags = REGBUF_STANDARD;
> +
> +    Assert((vmflags & VISIBILITYMAP_VALID_BITS) == vmflags);
> +
> +    /*
> +     * We can avoid an FPI if the only modification we are making to the heap
> +     * page is to set PD_ALL_VISIBLE and checksums/wal_log_hints are disabled.

Maybe s/an FPI/an FPI for the heap pae/?


> +     * Note that if we explicitly skip an FPI, we must not set the heap page
> +     * LSN later.
> +     */
> +    if (!do_prune &&
> +        nfrozen == 0 &&
> +        (!do_set_vm || !XLogHintBitIsNeeded()))
> +        regbuf_flags |= REGBUF_NO_IMAGE;

>      /*
>       * Prepare data for the buffer.  The arrays are not actually in the
> @@ -2079,7 +2104,11 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
>       * page image, the arrays can be omitted.
>       */
>      XLogBeginInsert();
> -    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
> +    XLogRegisterBuffer(0, buffer, regbuf_flags);
> +
> +    if (do_set_vm)
> +        XLogRegisterBuffer(1, vmbuffer, 0);

Seems a bit confusing that it's named regbuf_flags but isn't used all the
XLogRegisterBuffer calls. Maybe make the name more specific
(regbuf_flags_heap?)...

>      }
>      recptr = XLogInsert(RM_HEAP2_ID, info);
>  
> -    PageSetLSN(BufferGetPage(buffer), recptr);
> +    if (do_set_vm)
> +    {
> +        Assert(BufferIsDirty(vmbuffer));
> +        PageSetLSN(BufferGetPage(vmbuffer), recptr);
> +    }

> +    /*
> +     * We must bump the page LSN if pruning or freezing. If we are only
> +     * updating PD_ALL_VISIBLE, though, we can skip doing this unless
> +     * wal_log_hints/checksums are enabled. Torn pages are possible if we
> +     * update PD_ALL_VISIBLE without bumping the LSN, but this is deemed okay
> +     * for page hint updates.
> +     */

Arguably it's not a torn page if we only modified something as narrow as a
hint bit, and are redoing that change after recovery. But that's extremely
nitpicky.

I wonder if the comment explaining this should be put into one place and
reference it from all the different places.

> @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
>                               InvalidOffsetNumber);
>  
> +    /*
> +     * Before marking dead items unused, check whether the page will become
> +     * all-visible once that change is applied.

So the function is named _would_ but here you say will :)


> This lets us reap the tuples
> +     * and mark the page all-visible within the same critical section,
> +     * enabling both changes to be emitted in a single WAL record. Since the
> +     * visibility checks may perform I/O and allocate memory, they must be
> +     * done outside the critical section.
> +     */
> +    if (heap_page_would_be_all_visible(vacrel, buffer,
> +                                       deadoffsets, num_offsets,
> +                                       &all_frozen, &visibility_cutoff_xid))
> +    {
> +        vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> +        if (all_frozen)
> +        {
> +            vmflags |= VISIBILITYMAP_ALL_FROZEN;
> +            Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> +        }
> +
> +        /* Take the lock on the vmbuffer before entering a critical section */
> +        LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);

It sure would be nice if we had documented the lock order between the heap
page and the corresponding VM page anywhere.  This is just doing what we did
before, so it's not this patch's fault, but I did get worried about it for a
moment.


> +/*
> + * Check whether the heap page in buf is all-visible except for the dead
> + * tuples referenced in the deadoffsets array.
> + *
> + * The visibility checks may perform IO and allocate memory so they must not
> + * be done in a critical section. This function is used by vacuum to determine
> + * if the page will be all-visible once it reaps known dead tuples. That way
> + * it can do both in the same critical section and emit a single WAL record.
> + *
> + * Returns true if the page is all-visible other than the provided
> + * deadoffsets and false otherwise.
> + *
> + * Output parameters:
> + *
> + *  - *all_frozen: true if every tuple on the page is frozen
> + *  - *visibility_cutoff_xid: newest xmin; valid only if page is all-visible
> + * Callers looking to verify that the page is already all-visible can call
> + * heap_page_is_all_visible().
> + *
> + * This logic is closely related to heap_prune_record_unchanged_lp_normal().
> + * If you modify this function, ensure consistency with that code. An
> + * assertion cross-checks that both remain in agreement. Do not introduce new
> + * side-effects.
> + */
> +static bool
> +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> +                               OffsetNumber *deadoffsets,
> +                               int ndeadoffsets,
> +                               bool *all_frozen,
> +                               TransactionId *visibility_cutoff_xid)
> +{
>      Page        page = BufferGetPage(buf);
>      BlockNumber blockno = BufferGetBlockNumber(buf);
>      OffsetNumber offnum,
>                  maxoff;
>      bool        all_visible = true;
> +    int            matched_dead_count = 0;
>  
>      *visibility_cutoff_xid = InvalidTransactionId;
>      *all_frozen = true;
>  
> +    Assert(ndeadoffsets == 0 || deadoffsets);
> +
> +#ifdef USE_ASSERT_CHECKING
> +    /* Confirm input deadoffsets[] is strictly sorted */
> +    if (ndeadoffsets > 1)
> +    {
> +        for (int i = 1; i < ndeadoffsets; i++)
> +            Assert(deadoffsets[i - 1] < deadoffsets[i]);
> +    }
> +#endif
> +
>      maxoff = PageGetMaxOffsetNumber(page);
>      for (offnum = FirstOffsetNumber;
>           offnum <= maxoff && all_visible;
> @@ -3649,9 +3712,15 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
>           */
>          if (ItemIdIsDead(itemid))
>          {
> -            all_visible = false;
> -            *all_frozen = false;
> -            break;
> +            if (!deadoffsets ||
> +                matched_dead_count >= ndeadoffsets ||
> +                deadoffsets[matched_dead_count] != offnum)
> +            {
> +                *all_frozen = all_visible = false;
> +                break;
> +            }
> +            matched_dead_count++;
> +            continue;
>          }
>  
>          Assert(ItemIdIsNormal(itemid));

Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
the end?


> From 6b5fc27f0d80bab1df86a2e6fb51b64fd20c3cbb Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 15 Sep 2025 12:06:19 -0400
> Subject: [PATCH v17 03/15] Assorted trivial heap_page_prune_and_freeze cleanup

Seems like a good idea, but I'm too lazy to go through this in detail.


> From c69a5219a9b792f3c9f6dc730b8810a88d088ae6 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 16 Sep 2025 14:22:10 -0400
> Subject: [PATCH v17 04/15] Add helper for freeze determination to
>  heap_page_prune_and_freeze
> 
> After scanning through the line pointers on the heap page during
> vacuum's first phase, we use several statuses and information we
> collected to determine whether or not we will use the freeze plans we
> assembled.
> 
> Do this in a helper for better readability.


> @@ -663,85 +775,11 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>       * Decide if we want to go ahead with freezing according to the freeze
>       * plans we prepared, or not.
>       */
> -    do_freeze = false;
> - ...
> +    do_freeze = heap_page_will_freeze(params->relation, buffer,
> +                                      did_tuple_hint_fpi,
> +                                      do_prune,
> +                                      do_hint_prune,
> +                                      &prstate);
>  

Assuming this is just moving the code, I like this quite bit.


> From d4a4be3eed25853fc1ea84ebc2cbe0226afd823a Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 15 Sep 2025 16:25:44 -0400
> Subject: [PATCH v17 05/15] Update PruneState.all_[visible|frozen] earlier in
>  pruning
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> In the prune/freeze path, we currently delay clearing all_visible and
> all_frozen when dead items are present. This allows opportunistic
> freezing if the page would otherwise be fully frozen, since those dead
> items are later removed in vacuum’s third phase.
> 
> However, if no freezing will be attempted, there’s no need to delay.
> Clearing the flags promptly avoids extra bookkeeping in
> heap_prune_unchanged_lp_normal(). At present this has no runtime effect
> because all callers that consider setting the VM also attempt freezing,
> but future callers (e.g. on-access pruning) may want to set the VM
> without preparing freeze plans.

s/heap_prune_unchanged_lp_normal/heap_prune_record_unchanged_lp_normal/

I think this should make it clearer that this is about reducing overhead for
future use of this code in on-access-pruning.


> We also used to defer clearing all_visible and all_frozen until after
> computing the visibility cutoff XID. By determining the cutoff earlier,
> we can update these flags immediately after deciding whether to
> opportunistically freeze. This is necessary if we want to set the VM in
> the same WAL record that prunes and freezes tuples on the page.

I think this last sentence needs to be first. This is the only really
important thing in this patch, afaict.



> From 86193a71d2ff9649b5b1c1e6963bd610285ad369 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 3 Oct 2025 15:57:02 -0400
> Subject: [PATCH v17 06/15] Make heap_page_is_all_visible independent of
>  LVRelState
> 
> Future commits will use this function inside of pruneheap.c where we do
> not have access to the LVRelState. We only need a few parameters from
> the LVRelState, so just pass those in explicitly.
> 
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
> Reviewed-by: Robert Haas <robertmhaas@gmail.com>
> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com

Makes sense. I don't think we need to wait for other stuff to be committed to
commit this.


> From dde0dfc578137f7c93f9a0e34af38dcdb841b080 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 8 Oct 2025 15:39:01 -0400
> Subject: [PATCH v17 07/15] Eliminate XLOG_HEAP2_VISIBLE from vacuum
>  prune/freeze
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit

Seems very mildly odd that 0002 references phase III in the subject, but this
doesn't...

(I'm just very lightly skimming from this point on)


> During vacuum's first and third phases, we examine tuples' visibility
> to determine if we can set the page all-visible in the visibility map.
> 
> Previously, this check compared tuple xmins against a single XID chosen at
> the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> enables future work to set the VM during on-access pruning, since ordinary
> queries have access to GlobalVisState but not OldestXmin.
> 
> This also benefits vacuum directly: GlobalVisState may advance
> during a vacuum, allowing more pages to become considered all-visible.
> In the rare case that it moves backward, VACUUM falls back to OldestXmin
> to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> prunable according to the GlobalVisState.

It could, but it currently won't advance in vacuum, right?


> From e412f9298b0735d1091f4769ace4d2d1a7e62312 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 09:57:13 -0400
> Subject: [PATCH v17 12/15] Inline TransactionIdFollows/Precedes()
> 
> Calling these from on-access pruning code had noticeable overhead in a
> profile. There does not seem to be a reason not to inline them.

Makes sense, just commit this ahead of the more complicated rest.



> From 54fcba140e515eba0eb1f9d48e7d5875b92e7e39 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 14:34:30 -0400
> Subject: [PATCH v17 13/15] Allow on-access pruning to set pages all-visible

Sorry, will have to look at this another time...

Greetings,

Andres Freund



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Thu, 9 Oct 2025 at 03:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Mon, Oct 6, 2025 at 6:40 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > In attached v16, I’ve reverted to removing XLOG_HEAP2_VISIBLE
> > entirely, rather than first removing each caller's heap page from the
> > VM WAL chain. I reordered changes and squashed several refactoring
> > patches to improve patch-by-patch readability. This should make the
> > set read differently from earlier versions that removed
> > XLOG_HEAP2_VISIBLE and had more step-by-step mechanical refactoring.
> >
> > I think if we plan to go all the way with removing XLOG_HEAP2_VISIBLE,
> > having intermediate patches that just set PD_ALL_VISIBLE when making
> > other heap pages are more confusing than helpful. Also, I think having
> > separate flags for setting PD_ALL_VISIBLE in the WAL record
> > over-complicated the code.
>
> I decided to reorder the patches to remove XLOG_HEAP2_VISIBLE from
> vacuum phase III before removing it from vacuum phase I because
> removing it from phase III doesn't require preliminary refactoring
> patches. I've done that in the attached v17.
>
> I've also added an experimental patch on the end that refactors large
> chunks of heap_page_prune_and_freeze() into helpers. I got some
> feedback off-list that heap_page_prune_and_freeze() is too unwieldy
> now. I'm not sure how I feel about them yet, so I haven't documented
> them or moved them up in the patch set to before changes to
> heap_page_prune_and_freeze().
>
> 0001: Eliminate XLOG_HEAP2_VISIBLE from COPY FREEZE
> 0002: Eliminate XLOG_HEAP2_VISIBLE from phase III of vacuum
> 0003 - 0006: cleanup and refactoring to prepare for 0007
> 0007: Eliminate XLOG_HEAP2_VISIBLE from vacuum prune/freeze
> 0008 - 0009: Remove XLOG_HEAP2_VISIBLE
> 0010 - 0012: refactoring to prepare for 0013
> 0013: Set VM on-access
> 0014: Set pd_prune_xid on insert
> 0015: Experimental refactoring of heap_page_prune_and_freeze into helpers
>
> - Melanie

Hi! Should we also bump XLOG_PAGE_MAGIC after d96f87332 & add323da40a
or do we wait for full set to be committed?
--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Michael Paquier
Дата:
On Tue, Oct 14, 2025 at 08:31:04AM +0500, Kirill Reshke wrote:
> Hi! Should we also bump XLOG_PAGE_MAGIC after d96f87332 & add323da40a
> or do we wait for full set to be committed?

I may be missing something, of course, but d96f87332 has not changed
the WAL format, VISIBILITYMAP_ALL_VISIBLE and VISIBILITYMAP_ALL_FROZEN
existing before that.  The change in xl_heap_prune as done in
add323da40a6 should have bumped the format number.
--
Michael

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Oct 13, 2025 at 11:43 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Oct 14, 2025 at 08:31:04AM +0500, Kirill Reshke wrote:
> > Hi! Should we also bump XLOG_PAGE_MAGIC after d96f87332 & add323da40a
> > or do we wait for full set to be committed?
>
> I may be missing something, of course, but d96f87332 has not changed
> the WAL format, VISIBILITYMAP_ALL_VISIBLE and VISIBILITYMAP_ALL_FROZEN
> existing before that.  The change in xl_heap_prune as done in
> add323da40a6 should have bumped the format number.

Oops! Thanks for reporting.

I messed up and forgot to do this. And, if I'm not misunderstanding
the criteria, I did the same thing at the beginning of September with
4b5f206de2bb. I've committed the bump. Hopefully I learned my lesson.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks so much for the review! I've addressed all your feedback except
what is commented on inline below.
I've gone ahead and committed the preliminary patches that you thought
were ready to commit.

Attached v18 is what remains.

0001 - 0003: refactoring
0004 - 0006: finish eliminating XLOG_HEAP2_VISIBLE
0007 - 0009: refactoring
0010: Set VM on-access
0011: Set prune xid on insert
0012: Some refactoring for discussion

For 0001, I got feedback heap_page_prune_and_freeze() has too many
arguments, so I tried to address that. I'm interested to know if folks
like this more.

0011 still needs a bit of investigation to understand fully if
anything else in the index-killtuples test needs to be changed to make
sure we have the same coverage.

0012 is sort of WIP. I got feedback heap_page_prune_and_freeze() was
too long and should be split up into helpers. I want to know if this
split makes sense. I can pull it down the patch stack if so.

Only 0001 and 0012 are optional amongst the refactoring patches. The
others are required to make on-access VM-setting possible or viable.

On Thu, Oct 9, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
>
> > @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> >       }
> > -     action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > -                                                                                (xlrec.flags & XLHP_CLEANUP_LOCK)
!=0, 
> > -                                                                                &buffer);
> > -     if (action == BLK_NEEDS_REDO)
> > +     if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > +                                                                       (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> > +                                                                       &buffer) == BLK_NEEDS_REDO)
> >       {
> >               Page            page = BufferGetPage(buffer);
> >               OffsetNumber *redirected;
>
> Why move it around this way?

Because there will be an action for the visibility map
XLogReadBufferForRedoExtended(). I could have renamed it heap_action,
but it is being used only in one place, so I preferred to just cut it
to avoid any confusion.

> > Advance the page LSN
> > +              * only if the record could include an FPI, since recovery skips
> > +              * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> > +              * needed to repair a torn page.
> > +              */
>
> This is confusing, should probably just reference the stuff we did in the
> !recovery case.

I fixed this and addressed all your feedback related to this before committing.

> > +             if (do_prune || nplans > 0 ||
> > +                     ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> > +                     PageSetLSN(page, lsn);
> > +
> >               /*
> >                * Note: we don't worry about updating the page's prunability hints.
> >                * At worst this will cause an extra prune cycle to occur soon.
> >                */
>
> Not your fault, but that seems odd? Why aren't we just doing the right thing?

The comment dates back to 6f10eb2. I imagine no one ever bothered to
fuss with extracting the XID. You could change
heap_page_prune_execute() to return the right value -- though that's a
bit ugly since it is used in normal operation as well as recovery.

> I wonder if the VM specific redo portion should be in a common helper? Might
> not be enough code to worry though...

I think it might be more code as a helper at this point.

> > @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> >                                                        VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
> >                                                        InvalidOffsetNumber);
> >
> > +     /*
> > +      * Before marking dead items unused, check whether the page will become
> > +      * all-visible once that change is applied.
>
> So the function is named _would_ but here you say will :)

I thought about it more and still feel that this function name should
contain "would". From vacuum's perspective it is "will" -- because it
knows it will remove those dead items, but from the function's
perspective it is hypothetical. I changed the comment though.

> > +     if (heap_page_would_be_all_visible(vacrel, buffer,
> > +                                                                        deadoffsets, num_offsets,
> > +                                                                        &all_frozen, &visibility_cutoff_xid))
> > +     {
> > +             vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> > +             if (all_frozen)
> > +             {
> > +                     vmflags |= VISIBILITYMAP_ALL_FROZEN;
> > +                     Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> > +             }
> > +
> > +             /* Take the lock on the vmbuffer before entering a critical section */
> > +             LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
>
> It sure would be nice if we had documented the lock order between the heap
> page and the corresponding VM page anywhere.  This is just doing what we did
> before, so it's not this patch's fault, but I did get worried about it for a
> moment.

Well, the comment above the visibilitymap_set* functions says what
expectations they have for the heap page being locked.

> > +static bool
> > +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> > +                                                        OffsetNumber *deadoffsets,
> > +                                                        int ndeadoffsets,
> > +                                                        bool *all_frozen,
> > +                                                        TransactionId *visibility_cutoff_xid)
> > +{
> >       Page            page = BufferGetPage(buf);

> Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
> the end?

I was going to put an Assert(ndeadoffsets <= matched_dead_count), but
then I started wondering if there is a way we could end up with fewer
dead items than we collected during phase I.

I had thought about if we dropped an index and then did on-access
pruning -- but we don't allow setting LP_DEAD items LP_UNUSED in
on-access pruning. So, maybe this is safe... I can do a follow-on
commit to add the assert. But I'm just not 100% sure I've thought of
all the cases where we might end up with fewer dead items.

> > During vacuum's first and third phases, we examine tuples' visibility
> > to determine if we can set the page all-visible in the visibility map.
> >
> > Previously, this check compared tuple xmins against a single XID chosen at
> > the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> > enables future work to set the VM during on-access pruning, since ordinary
> > queries have access to GlobalVisState but not OldestXmin.
> >
> > This also benefits vacuum directly: GlobalVisState may advance
> > during a vacuum, allowing more pages to become considered all-visible.
> > In the rare case that it moves backward, VACUUM falls back to OldestXmin
> > to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> > prunable according to the GlobalVisState.
>
> It could, but it currently won't advance in vacuum, right?

I thought it was possible for it to advance when calling
heap_prune_satisfies_vacuum() ->
GlobalVisTestIsRemovableXid()->...GlobalVisUpdate(). This case isn't
going to be common, but some things can cause us to update it.

We have talked about explicitly updating GlobalVisState more often
during vacuums of large tables. But I was under the impression that it
was at least possible for it to advance during vacuum now.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Wed, 15 Oct 2025 at 04:27, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Thanks so much for the review! I've addressed all your feedback except
> what is commented on inline below.
> I've gone ahead and committed the preliminary patches that you thought
> were ready to commit.
>
> Attached v18 is what remains.
>
> 0001 - 0003: refactoring
> 0004 - 0006: finish eliminating XLOG_HEAP2_VISIBLE
> 0007 - 0009: refactoring
> 0010: Set VM on-access
> 0011: Set prune xid on insert
> 0012: Some refactoring for discussion
>
> For 0001, I got feedback heap_page_prune_and_freeze() has too many
> arguments, so I tried to address that. I'm interested to know if folks
> like this more.
>
> 0011 still needs a bit of investigation to understand fully if
> anything else in the index-killtuples test needs to be changed to make
> sure we have the same coverage.
>
> 0012 is sort of WIP. I got feedback heap_page_prune_and_freeze() was
> too long and should be split up into helpers. I want to know if this
> split makes sense. I can pull it down the patch stack if so.
>
> Only 0001 and 0012 are optional amongst the refactoring patches. The
> others are required to make on-access VM-setting possible or viable.
>
> On Thu, Oct 9, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> > >       }
> > > -     action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > > -                                                                                (xlrec.flags &
XLHP_CLEANUP_LOCK)!= 0, 
> > > -                                                                                &buffer);
> > > -     if (action == BLK_NEEDS_REDO)
> > > +     if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > > +                                                                       (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> > > +                                                                       &buffer) == BLK_NEEDS_REDO)
> > >       {
> > >               Page            page = BufferGetPage(buffer);
> > >               OffsetNumber *redirected;
> >
> > Why move it around this way?
>
> Because there will be an action for the visibility map
> XLogReadBufferForRedoExtended(). I could have renamed it heap_action,
> but it is being used only in one place, so I preferred to just cut it
> to avoid any confusion.
>
> > > Advance the page LSN
> > > +              * only if the record could include an FPI, since recovery skips
> > > +              * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> > > +              * needed to repair a torn page.
> > > +              */
> >
> > This is confusing, should probably just reference the stuff we did in the
> > !recovery case.
>
> I fixed this and addressed all your feedback related to this before committing.
>
> > > +             if (do_prune || nplans > 0 ||
> > > +                     ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> > > +                     PageSetLSN(page, lsn);
> > > +
> > >               /*
> > >                * Note: we don't worry about updating the page's prunability hints.
> > >                * At worst this will cause an extra prune cycle to occur soon.
> > >                */
> >
> > Not your fault, but that seems odd? Why aren't we just doing the right thing?
>
> The comment dates back to 6f10eb2. I imagine no one ever bothered to
> fuss with extracting the XID. You could change
> heap_page_prune_execute() to return the right value -- though that's a
> bit ugly since it is used in normal operation as well as recovery.
>
> > I wonder if the VM specific redo portion should be in a common helper? Might
> > not be enough code to worry though...
>
> I think it might be more code as a helper at this point.
>
> > > @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> > >                                                        VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
> > >                                                        InvalidOffsetNumber);
> > >
> > > +     /*
> > > +      * Before marking dead items unused, check whether the page will become
> > > +      * all-visible once that change is applied.
> >
> > So the function is named _would_ but here you say will :)
>
> I thought about it more and still feel that this function name should
> contain "would". From vacuum's perspective it is "will" -- because it
> knows it will remove those dead items, but from the function's
> perspective it is hypothetical. I changed the comment though.
>
> > > +     if (heap_page_would_be_all_visible(vacrel, buffer,
> > > +                                                                        deadoffsets, num_offsets,
> > > +                                                                        &all_frozen, &visibility_cutoff_xid))
> > > +     {
> > > +             vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> > > +             if (all_frozen)
> > > +             {
> > > +                     vmflags |= VISIBILITYMAP_ALL_FROZEN;
> > > +                     Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> > > +             }
> > > +
> > > +             /* Take the lock on the vmbuffer before entering a critical section */
> > > +             LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> >
> > It sure would be nice if we had documented the lock order between the heap
> > page and the corresponding VM page anywhere.  This is just doing what we did
> > before, so it's not this patch's fault, but I did get worried about it for a
> > moment.
>
> Well, the comment above the visibilitymap_set* functions says what
> expectations they have for the heap page being locked.
>
> > > +static bool
> > > +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> > > +                                                        OffsetNumber *deadoffsets,
> > > +                                                        int ndeadoffsets,
> > > +                                                        bool *all_frozen,
> > > +                                                        TransactionId *visibility_cutoff_xid)
> > > +{
> > >       Page            page = BufferGetPage(buf);
>
> > Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
> > the end?
>
> I was going to put an Assert(ndeadoffsets <= matched_dead_count), but
> then I started wondering if there is a way we could end up with fewer
> dead items than we collected during phase I.
>
> I had thought about if we dropped an index and then did on-access
> pruning -- but we don't allow setting LP_DEAD items LP_UNUSED in
> on-access pruning. So, maybe this is safe... I can do a follow-on
> commit to add the assert. But I'm just not 100% sure I've thought of
> all the cases where we might end up with fewer dead items.
>
> > > During vacuum's first and third phases, we examine tuples' visibility
> > > to determine if we can set the page all-visible in the visibility map.
> > >
> > > Previously, this check compared tuple xmins against a single XID chosen at
> > > the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> > > enables future work to set the VM during on-access pruning, since ordinary
> > > queries have access to GlobalVisState but not OldestXmin.
> > >
> > > This also benefits vacuum directly: GlobalVisState may advance
> > > during a vacuum, allowing more pages to become considered all-visible.
> > > In the rare case that it moves backward, VACUUM falls back to OldestXmin
> > > to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> > > prunable according to the GlobalVisState.
> >
> > It could, but it currently won't advance in vacuum, right?
>
> I thought it was possible for it to advance when calling
> heap_prune_satisfies_vacuum() ->
> GlobalVisTestIsRemovableXid()->...GlobalVisUpdate(). This case isn't
> going to be common, but some things can cause us to update it.
>
> We have talked about explicitly updating GlobalVisState more often
> during vacuums of large tables. But I was under the impression that it
> was at least possible for it to advance during vacuum now.
>
> - Melanie


Hi!

First of all, I rechecked v18 patches, they still cause WAL bytes
reduction. In a no-index vacuum case my result is a 39% reduction in
WAL bytes.
Almost like in your first message.

Here are my comments about code, I may be very nitpicky in minor
details, sorry for that

In 0003:

get_conflict_xid function logic is bit strange for me, it assigns
conflict_xid to some value,  but in the very end we have

> + /*
>+ * We can omit the snapshot conflict horizon if we are not pruning or
>+ * freezing any tuples and are setting an already all-visible page
>+ * all-frozen in the VM. In this case, all of the tuples on the page must
>+ * already be visible to all MVCC snapshots on the standby.
>+ */
>+ if (!do_prune && !do_freeze &&
>+ do_set_vm && blk_already_av && set_blk_all_frozen)
> + conflict_xid = InvalidTransactionId;

I feel like we should move this check to the beginning of the
function, and just  return InvalidTransactionId in that if cond.

in 0004:

> + if (old_vmbits == new_vmbits)
> + {
> + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> + /* Unset so we don't emit WAL since no change occurred */
> + do_set_vm = false;
> + }

and then

>  END_CRIT_SECTION();
> + if (do_set_vm)
> + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> +

So, in the heap_page_prune_and_freeze function we release buffer lock
both inside and outside the crit section. As I understand, this is
actually safe. I also looked in other xlog coding practices for other
access methods (GiST, GIN, ....), and I can see that some of them
release buffers before leaving crit sections and some of them after.
But I still suggest to be in sync with 'Write-Ahead Log Coding'
section of
src/backend/access/transam/README, which says:

6. END_CRIT_SECTION()

7. Unlock and unpin the buffer(s).

Let's be consistent in this at least in this single function context.


In 0010:

I'm not terribly convenient that adding SO_ALLOW_VM_SET to TAM
ScanOptions is the right thing to do. Looks like VM bits are something
that make sense for HEAP AM for not for any TAM. So, don't we break
some layer of abstraction here? Would it be better for HEAP AM to set
some flags in heap_beginscan?


Overall 0001-0003 are mostly fine for me, 0004-0006 are the right
thing to do IMHO, but maybe they need some more review from hackers.
Other patches i did not review in a great detail, will return to this
later



--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for the review!

On Wed, Oct 29, 2025 at 7:03 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> get_conflict_xid function logic is bit strange for me, it assigns
> conflict_xid to some value,  but in the very end we have
>
> > + /*
> >+ * We can omit the snapshot conflict horizon if we are not pruning or
> >+ * freezing any tuples and are setting an already all-visible page
> >+ * all-frozen in the VM. In this case, all of the tuples on the page must
> >+ * already be visible to all MVCC snapshots on the standby.
> >+ */
> >+ if (!do_prune && !do_freeze &&
> >+ do_set_vm && blk_already_av && set_blk_all_frozen)
> > + conflict_xid = InvalidTransactionId;
>
> I feel like we should move this check to the beginning of the
> function, and just  return InvalidTransactionId in that if cond.

You're right. I've changed it as you suggest in attached v19.

> > + if (old_vmbits == new_vmbits)
> > + {
> > + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> > + /* Unset so we don't emit WAL since no change occurred */
> > + do_set_vm = false;
> > + }
>
> and then
>
> >  END_CRIT_SECTION();
> > + if (do_set_vm)
> > + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> > +
>
> So, in the heap_page_prune_and_freeze function we release buffer lock
> both inside and outside the crit section. As I understand, this is
> actually safe. I also looked in other xlog coding practices for other
> access methods (GiST, GIN, ....), and I can see that some of them
> release buffers before leaving crit sections and some of them after.
> But I still suggest to be in sync with 'Write-Ahead Log Coding'
> section of
> src/backend/access/transam/README, which says:
>
> 6. END_CRIT_SECTION()
>
> 7. Unlock and unpin the buffer(s).
>
> Let's be consistent in this at least in this single function context.

I see what you are saying. However, I don't see a good way to
determine whether or not we need to unlock the VM without introducing
another local variable in the outermost scope -- like "unlock_vm".
This function already has a lot of local variables, so I'm loath to do
that. And we want do_set_vm to reflect whether or not we actually set
it in case it gets used in the future.

This function doesn't lock or unlock the heap buffer so it doesn't
seem as urgent to me to follow the letter of the law in this case.

Attached patch doesn't have this change, but this is what it would look like:

    /* Lock vmbuffer before entering a critical section */
+   unlock_vm = do_set_vm;
    if (do_set_vm)
        LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);

@@ -1112,12 +1114,9 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
            old_vmbits = visibilitymap_set(blockno,
                                           vmbuffer, new_vmbits,
                                           params->relation->rd_locator);
-           if (old_vmbits == new_vmbits)
-           {
-               LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
-               /* Unset so we don't emit WAL since no change occurred */
-               do_set_vm = false;
-           }
+
+           /* Unset so we don't emit WAL since no change occurred */
+           do_set_vm = old_vmbits != new_vmbits;
        }

        /*
@@ -1145,7 +1144,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,

    END_CRIT_SECTION();

-   if (do_set_vm)
+   if (unlock_vm)
        LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);

> In 0010:
>
> I'm not terribly convenient that adding SO_ALLOW_VM_SET to TAM
> ScanOptions is the right thing to do. Looks like VM bits are something
> that make sense for HEAP AM for not for any TAM. So, don't we break
> some layer of abstraction here? Would it be better for HEAP AM to set
> some flags in heap_beginscan?

I don't see another good way of doing it.

The information about whether or not the relation is modified in the
query is gathered during planning and saved in the plan. We need to
get that information to the scan descriptor, which is all we have when
we call heap_page_prune_opt() during the scan. The scan descriptor is
created by the table AM implementations of scan_begin(). The table AM
callbacks don't pass down the plan -- which makes sense; the scan
shouldn't know about the plan. They do pass down flags, so I thought
it made the most sense to add a flag. Note that I was able to avoid
modifying the actual table and index AM callbacks (scan_begin() and
ambeginscan()). I only made new wrappers that took "modifies_rel".

Now, it is true that referring to the VM is somewhat of a layering
violation. Though, other table AMs may use the information about if
the query modifies the relation -- which is really what this flag
represents. The ScanOptions are usually either a type or a call to
action. Which is why I felt a bit uncomfortable calling it something
like SO_MODIFIES_REL -- which is less of an option and more a piece of
information. And it makes it sound like the scan modifies the rel,
which is not the case. I wonder if there is another solution. Or maybe
we call it SO_QUERY_MODIFIES_REL?

- Melanie

Вложения