Обсуждение: [Proposal] pg_stat_wal_records – per-record-type WAL generation statistics
[Proposal] pg_stat_wal_records – per-record-type WAL generation statistics
От
SATYANARAYANA NARLAPURAM
Дата:
Hi Hackers,
Postgre already has pg_stat_wal for aggregate WAL volume (bytes, full-page images, buffers), and pg_walinspect (superuser access required) for post-hoc forensic analysis of individual WAL segments. But I don't see a lightweight, observability tool that answers in real time which record types are responsible for the WAL. Additionally, pg_walinspect runs against on-disk WAL files, which is expensive. This view will be useful for monitoring systems to poll cheaply.
Thanks,
I'd like to propose a new system view, pg_stat_wal_records, that exposes per-resource-manager, per-record-type WAL generation counts.
Sample Output:
postgres=# SELECT * FROM pg_stat_wal_records ORDER BY count DESC LIMIT 10;
resource_manager | record_type | count | stats_reset
------------------+----------------+--------+-------------------------------
Heap | INSERT | 500000 | 2026-03-26 22:15:00.12345+00
Transaction | COMMIT | 500000 |
Btree | INSERT_LEAF | 53821 |
Heap | HOT_UPDATE | 12744 |
XLOG | FPI | 8923 |
postgres=# SELECT * FROM pg_stat_wal_records ORDER BY count DESC LIMIT 10;
resource_manager | record_type | count | stats_reset
------------------+----------------+--------+-------------------------------
Heap | INSERT | 500000 | 2026-03-26 22:15:00.12345+00
Transaction | COMMIT | 500000 |
Btree | INSERT_LEAF | 53821 |
Heap | HOT_UPDATE | 12744 |
XLOG | FPI | 8923 |
The Gap:
Postgre already has pg_stat_wal for aggregate WAL volume (bytes, full-page images, buffers), and pg_walinspect (superuser access required) for post-hoc forensic analysis of individual WAL segments. But I don't see a lightweight, observability tool that answers in real time which record types are responsible for the WAL. Additionally, pg_walinspect runs against on-disk WAL files, which is expensive. This view will be useful for monitoring systems to poll cheaply.
Use cases:
WAL volume investigation: see which record types dominate WAL generation in real time without touching disk.
Monitoring integration: Prometheus/Grafana can poll the view to track WAL composition over time and alert on anomalies.
Replication tuning: identify whether WAL volume is dominated by data changes, index maintenance, FPIs, or vacuum activity to guide tuning.
Extension debugging: custom WAL resource managers get visibility automatically.
WAL volume investigation: see which record types dominate WAL generation in real time without touching disk.
Monitoring integration: Prometheus/Grafana can poll the view to track WAL composition over time and alert on anomalies.
Replication tuning: identify whether WAL volume is dominated by data changes, index maintenance, FPIs, or vacuum activity to guide tuning.
Extension debugging: custom WAL resource managers get visibility automatically.
Key design decisions
Counting mechanism:
Counting mechanism:
The counting mechanism is a single backend-local array increment in XLogInsert():
pgstat_pending_wal_records[rmid][(info >> 4) & 0x0F]++;
This indexes into a uint64[256][16] array (32 KB per backend) using the rmgr ID and the 4-bit record-type subfield of the WAL info byte. Counters are flushed to shared memory via the standard pgstat infrastructure.
I am using per-backend pending array instead of direct shared-memory writes. The counter is incremented in backend-local memory and flushed to shared memory by the existing pgstat flush cycle. Don't expect to see any contention in the hot path (please see perf results below).
Fixed 256×16 matrix. All 256 possible rmgr IDs × 16 possible record types. This accommodates core resource managers and any custom WAL resource managers from extensions without configuration. The 32 KB per-backend cost is modest. Uses rm_identify() for human-readable names. The SRF calls each resource manager's rm_identify callback to translate the info byte into a readable record type name (for example INSERT, COMMIT, VACUUM, HOT_UPDATE). Added the reset functionality via pg_stat_reset_shared('wal_records'), consistent with the existing pattern for wal, bgwriter, archiver, etc.
View skips zero-count entries, keeping output clean.
pgstat_pending_wal_records[rmid][(info >> 4) & 0x0F]++;
This indexes into a uint64[256][16] array (32 KB per backend) using the rmgr ID and the 4-bit record-type subfield of the WAL info byte. Counters are flushed to shared memory via the standard pgstat infrastructure.
I am using per-backend pending array instead of direct shared-memory writes. The counter is incremented in backend-local memory and flushed to shared memory by the existing pgstat flush cycle. Don't expect to see any contention in the hot path (please see perf results below).
Fixed 256×16 matrix. All 256 possible rmgr IDs × 16 possible record types. This accommodates core resource managers and any custom WAL resource managers from extensions without configuration. The 32 KB per-backend cost is modest. Uses rm_identify() for human-readable names. The SRF calls each resource manager's rm_identify callback to translate the info byte into a readable record type name (for example INSERT, COMMIT, VACUUM, HOT_UPDATE). Added the reset functionality via pg_stat_reset_shared('wal_records'), consistent with the existing pattern for wal, bgwriter, archiver, etc.
View skips zero-count entries, keeping output clean.
Performance overhead
Benchmarked with pgbench (scale 50, 16 clients, 16 threads, 30s, synchronous_commit=off) on 64 vCPU machine with data and WAL on NVMe:
Configuration Avg TPS
With patch 42,266
Without patch 42,053
The overhead is within measurement noise (~0.5%). The increment hits a backend-local, L1-hot array and is dwarfed by XLogInsert's existing CRC, locking, and memcpy work.
Benchmarked with pgbench (scale 50, 16 clients, 16 threads, 30s, synchronous_commit=off) on 64 vCPU machine with data and WAL on NVMe:
Configuration Avg TPS
With patch 42,266
Without patch 42,053
The overhead is within measurement noise (~0.5%). The increment hits a backend-local, L1-hot array and is dwarfed by XLogInsert's existing CRC, locking, and memcpy work.
Attached a draft patch, please share your thoughts.
Thanks,
Satya
Вложения
Re: [Proposal] pg_stat_wal_records – per-record-type WAL generation statistics
От
Michael Paquier
Дата:
On Thu, Mar 26, 2026 at 05:29:25PM -0700, SATYANARAYANA NARLAPURAM wrote: > Postgre already has pg_stat_wal for aggregate WAL volume (bytes, full-page > images, buffers), and pg_walinspect (superuser access required) for > post-hoc forensic analysis of individual WAL segments. But I don't see a > lightweight, observability tool that answers in real time which record > types are responsible for the WAL. Additionally, pg_walinspect runs against > on-disk WAL files, which is expensive. This view will be useful for > monitoring systems to poll cheaply. > > *Use cases:* > WAL volume investigation: see which record types dominate WAL generation in > real time without touching disk. > Monitoring integration: Prometheus/Grafana can poll the view to track WAL > composition over time and alert on anomalies. > Replication tuning: identify whether WAL volume is dominated by data > changes, index maintenance, FPIs, or vacuum activity to guide tuning. > Extension debugging: custom WAL resource managers get visibility > automatically. Why is it useful to have access to this information in live for monitoring purposes, divided by RMGR? What do you define as an anomaly in this context and what can be changed on the server side to get out of an anomaly, based on the definition you would give to it? The current WAL and IO stats are directly useful because they provide numbers about flush, read and write quantity and timings. These are useful metrics for benchmarking. This proposal only informs about the number of records, and we have a various number of record types that have a variable length, that can influence the distribution of the data written on disk. As a whole, I am doubtful that this information is worth counting in live, eating a stats kind ID. One could also implement a background worker that provides more advanced aggregate stats outside the WAL insert path, with a custom stats kind. No need to have that into core, especially if the code that increments the stats eats more and more cycles. -- Michael
Вложения
Re: [Proposal] pg_stat_wal_records – per-record-type WAL generation statistics
От
SATYANARAYANA NARLAPURAM
Дата:
Thank you for your feedback!
On Thu, Mar 26, 2026 at 7:59 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Mar 26, 2026 at 05:29:25PM -0700, SATYANARAYANA NARLAPURAM wrote:
> Postgre already has pg_stat_wal for aggregate WAL volume (bytes, full-page
> images, buffers), and pg_walinspect (superuser access required) for
> post-hoc forensic analysis of individual WAL segments. But I don't see a
> lightweight, observability tool that answers in real time which record
> types are responsible for the WAL. Additionally, pg_walinspect runs against
> on-disk WAL files, which is expensive. This view will be useful for
> monitoring systems to poll cheaply.
>
> *Use cases:*
> WAL volume investigation: see which record types dominate WAL generation in
> real time without touching disk.
> Monitoring integration: Prometheus/Grafana can poll the view to track WAL
> composition over time and alert on anomalies.
> Replication tuning: identify whether WAL volume is dominated by data
> changes, index maintenance, FPIs, or vacuum activity to guide tuning.
> Extension debugging: custom WAL resource managers get visibility
> automatically.
Why is it useful to have access to this information in live for
monitoring purposes, divided by RMGR?
Per-RMGR breakdown is essentially a real-time X-ray of what the database
is actually doing. Existing view help us understand what queries were submitted,
but what physical operations resulted from them can be seen with this view.
What do you define as an
anomaly in this context and what can be changed on the server side to
get out of an anomaly, based on the definition you would give to it?
A few examples, HOT ratio dropped, BTree page splits, some app adding
logical decoding messages, GIN generates more WAL than the corresponding
heap modifications, high lock counts etc.
The current WAL and IO stats are directly useful because they provide
numbers about flush, read and write quantity and timings. These are
useful metrics for benchmarking.
This proposal only informs about the number of records, and we have a
various number of record types that have a variable length, that can
influence the distribution of the data written on disk.
yeah, that was a downside, didn't add sizes to keep the overhead less.
As a whole, I am doubtful that this information is worth counting in
live, eating a stats kind ID. One could also implement a background
worker that provides more advanced aggregate stats outside the WAL
insert path, with a custom stats kind. No need to have that into
core, especially if the code that increments the stats eats more and
more cycles.
Your argument makes sense to me, I was a bit hesitant on
writing a background worker because of the potential side effects of tailing
the WAL. Let me try a different approach by implementing an ondemand
sniffing of the WAL, which can be implemented as an extension without
changes to core. Do you have thoughts on making it a contrib module or
modify existing pg_walinspect to accommodate these requirements?
Thanks,
Satya