Обсуждение: Metadata and record block access stats for indexes

Поиск

Список

Период

Сортировка

Metadata and record block access stats for indexes

От

Mircea Cadariu

Дата:

28 февраля 2025 г., 22:58:32

Hi,

For the purpose of writing a blog post I was checking the index stats 
recorded for a workload, but found them rather confusing. Following 
along the code with the debugger it eventually made sense, and I could 
eventually understand what's counted.  Looking around a bit, I 
discovered an older discussion [1] in the mailing lists and learned that 
the issue is known.  The proposal in that thread is to start counting 
separate metadata and record stats depending on what type of index block 
is retrieved.

I realized those would have helped me better understand the collected 
index stats, so I started working on a patch to add these in the system 
views. Attached is a WIP patch file with partial coverage of the B-Tree 
index code. The implementation follows the existing stats collection 
approach and the naming convention proposed in [1].  Let me know if what 
I'm doing is feasible and if there's any concerns I could address. Next 
steps would be to replace all places where I currently pass in NULL with 
proper counting, as well as update tests and docs.

Looking forward to your feedback! Thanks!

Cheers,
Mircea

[1]: 
https://www.postgresql.org/message-id/flat/CAH2-WzmdZqxCS1widYzjDAM%2BZ-Jz%3DejJoaWXDVw9Qy1UsK0tLA%40mail.gmail.com

Вложения

v1-0001-Preliminary-work-to-capture-and-expose-separate-r.patch

Re: Metadata and record block access stats for indexes

От

Mircea Cadariu

Дата:

19 апреля 2025 г., 17:28:35

Hi,

Just attaching the complete patch which now covers all index types, docs 
and tests.

You can run the following to see it in action:

create table test (id serial primary key);
insert into test select * from generate_series(0,30000);
select pg_stat_reset();
select * from test where id=3000;
select * from pg_statio_all_indexes where indexrelname = 'test_pkey';

This will show that there were 2 index blocks read from shared buffers 
(hit): 1 metadata and one record.

Cheers,

Mircea

On 28/02/2025 21:58, Mircea Cadariu wrote:
> Hi,
>
> For the purpose of writing a blog post I was checking the index stats 
> recorded for a workload, but found them rather confusing. Following 
> along the code with the debugger it eventually made sense, and I could 
> eventually understand what's counted.  Looking around a bit, I 
> discovered an older discussion [1] in the mailing lists and learned 
> that the issue is known.  The proposal in that thread is to start 
> counting separate metadata and record stats depending on what type of 
> index block is retrieved.
>
> I realized those would have helped me better understand the collected 
> index stats, so I started working on a patch to add these in the 
> system views. Attached is a WIP patch file with partial coverage of 
> the B-Tree index code. The implementation follows the existing stats 
> collection approach and the naming convention proposed in [1].  Let me 
> know if what I'm doing is feasible and if there's any concerns I could 
> address. Next steps would be to replace all places where I currently 
> pass in NULL with proper counting, as well as update tests and docs.
>
> Looking forward to your feedback! Thanks!
>
> Cheers,
> Mircea
>
> [1]: 
> https://www.postgresql.org/message-id/flat/CAH2-WzmdZqxCS1widYzjDAM%2BZ-Jz%3DejJoaWXDVw9Qy1UsK0tLA%40mail.gmail.com 
>

Вложения

0001-Add-separate-record-leaf-and-metadata-stats-for-inde.patch

Re: Metadata and record block access stats for indexes

От

Frédéric Yhuel

Дата:

20 июля 2025 г., 16:54:49


On 7/4/25 18:00, Mircea Cadariu wrote:
> Just attaching v2 of the patch.

Hi Mircea,

Your patch applies cleanly and seems to work well.

IIUC, the index hit ratio should be computed with the following formula:

(idx_blks_hit - idx_metadata_blks) / (idx_blks_hit - idx_metadata_blks + 
idx_blks_read)

because most of the index non-leaf pages should be in the cache. Right?

This should probably be documented somewhere?

Here is my testing:

  # select tree_level, internal_pages, leaf_pages from 
pgstatindex('pgbench_accounts_pkey');
  tree_level | internal_pages | leaf_pages
------------+----------------+------------
           2 |             20 |       5465
(1 row)

  # SELECT DISTINCT pg_buffercache_evict(bufferid)
   FROM pg_buffercache
  WHERE relfilenode = pg_relation_filenode('pgbench_accounts_pkey');
  pg_buffercache_evict
----------------------
  (t,f)
(1 row)

  # SELECT pg_stat_reset();
  pg_stat_reset
---------------

(1 row)

  # SELECT max(abalance) FROM pgbench_accounts WHERE aid = 100;
  max
-----
    0
(1 row)

  # select idx_blks_read, idx_blks_hit, idx_metadata_blks from 
pg_statio_all_indexes where indexrelname = 'pgbench_accounts_pkey';
  idx_blks_read | idx_blks_hit | idx_metadata_blks
---------------+--------------+-------------------
              3 |            0 |                 2
(1 row)

--> 3 pages: the root of the tree, one internal page and one leaf

  #
\q
fyhuel@framework:~$ psql bench
psql (19devel)
Type "help" for help.

  # SELECT max(abalance) FROM pgbench_accounts WHERE aid = 100;
  max
-----
    0
(1 row)

primary sleaf bench [42323] # select idx_blks_read, idx_blks_hit, 
idx_metadata_blks from pg_statio_all_indexes where indexrelname = 
'pgbench_accounts_pkey';
  idx_blks_read | idx_blks_hit | idx_metadata_blks
---------------+--------------+-------------------
              4 |            3 |                 5

--> 4 more pages: same as before, already in cache, plus the index meta 
page, read outside shared buffers because we started a new session?

Re: Metadata and record block access stats for indexes

От

Mircea Cadariu

Дата:

23 июля 2025 г., 04:35:55

Hi Frédéric,

Thanks a lot for trying out my (first) patch! Much appreciated.

On 20/07/2025 21:54, Frédéric Yhuel wrote:
> Your patch applies cleanly and seems to work well.

Cool!

> because most of the index non-leaf pages should be in the cache. Right?

Yes indeed, it's an assumption in the implementation, that the non-leaf 
pages will roughly always be in the cache.

> This should probably be documented somewhere?

I'm still familiarising myself about what to document where, whether 
things should be in the official docs or separate blog posts. In the 
patch I only documented the new column next to the existing ones for now.

> --> 3 pages: the root of the tree, one internal page and one leaf

Yes, this is correct.

> primary sleaf bench [42323] # select idx_blks_read, idx_blks_hit, 
> idx_metadata_blks from pg_statio_all_indexes where indexrelname = 
> 'pgbench_accounts_pkey';
>  idx_blks_read | idx_blks_hit | idx_metadata_blks
> ---------------+--------------+-------------------
>              4 |            3 |                 5
>
> --> 4 more pages: same as before, already in cache, plus the index 
> meta page, read outside shared buffers because we started a new session?

Yes, that's my understanding too.

Thanks!

Kind regards,

Mircea Cadariu

Re: Metadata and record block access stats for indexes

От

Mircea Cadariu

Дата:

20 ноября 2025 г., 17:04:30

Rebased and dusted off this patch.

Вложения

v3-0001-Add-metadata-non-leaf-index-block-statistics-to-p.patch

Re: Metadata and record block access stats for indexes

От

Aleksander Alekseev

Дата:

20 ноября 2025 г., 17:56:17

Hi Mircea,

> Rebased and dusted off this patch.

Thanks for the patch. Here are my two cents.

IMO it would be helpful if you could come up with a few more practical
use cases. This change is going to affect pretty much everyone. If
only a few users will benefit from it once in several years, the value
of the patch is arguably low. As an example, can you think of how the
new counters can be used for debugging, checking index integrity,
writing new access methods or perhaps writing property-based tests?
Just several examples that came to my mind first.

Also I'm a bit concerned about the performance impact. It's probably
next to nothing, but if you could measure it on a relatively large
amount of data that would be great. Note that it's not uncommon to
have dozens of different indexes for a single table.

-- 
Best regards,
Aleksander Alekseev

Re: Metadata and record block access stats for indexes

От

Mircea Cadariu

Дата:

28 ноября 2025 г., 18:18:02

Hi, Aleksander!

On 20/11/2025 14:56, Aleksander Alekseev wrote:
> Hi Mircea,
>
>> Rebased and dusted off this patch.
> Thanks for the patch. Here are my two cents.
>
> IMO it would be helpful if you could come up with a few more practical
> use cases. This change is going to affect pretty much everyone. If
> only a few users will benefit from it once in several years, the value
> of the patch is arguably low. As an example, can you think of how the
> new counters can be used for debugging, checking index integrity,
> writing new access methods or perhaps writing property-based tests?
> Just several examples that came to my mind first.
>
> Also I'm a bit concerned about the performance impact. It's probably
> next to nothing, but if you could measure it on a relatively large
> amount of data that would be great. Note that it's not uncommon to
> have dozens of different indexes for a single table.
>
Thanks! It hasn't caught on, might just stay an educational exercise on 
my side on patch writing and we move on. Thanks for thinking out loud, 
your ideas lead me to this: when working on improving the performance of 
index traversals, it could be a direct way to show improvement and even 
write regression tests: lower amount of internal pages are read than 
before. But, one could also just use the existing stats and still be 
able to track this..

For the performance checks, indeed, I'm using this script, I observe 
no meaningful difference on my laptop: 
https://gist.github.com/mcadariu/fc4a6d4eccd56b4447d1d9d05f9b5d79.

-- 
Thanks,
Mircea Cadariu

Re: Metadata and record block access stats for indexes

От

Mircea Cadariu

Дата:

12 декабря 2025 г., 15:55:22

Hi Aleksander,

On 20/11/2025 16:56, Aleksander Alekseev wrote:

Thanks for the patch. Here are my two cents.

This is a follow-up on my earlier answers to your questions. Below is a motivating example and a performance comparison with HEAD.

For database performance we want to ensure our working set fits in memory (shared buffers and OS page cache).

Index leaf-page hit ratios allows a DBA to detect when actual index lookups hit disk, which the overall ratio can mask, because the internal pages are typically cached.

To see this in action, we can do the following. After applying the patch:

CREATE TABLE test_data_large (
id SERIAL PRIMARY KEY,
search_key INTEGER,
data_col TEXT
);

INSERT INTO test_data_large (search_key, data_col)
SELECT i, 'Data-' || i
FROM generate_series(1, 3200000) i;

CREATE INDEX idx_search_key_large ON test_data_large(search_key);
ANALYZE test_data_large;

SELECT pg_stat_reset();

This is our workload we'll run and collect stats for:

DO $$
BEGIN
FOR i IN 1..20000 LOOP
PERFORM 1
FROM test_data_large
WHERE search_key = ((i * 160) % 3200000)
LIMIT 1;
END LOOP;
END $$;

SELECT pg_stat_force_next_flush();

Now we can inspect the stats:

SELECT
ROUND(100.0 * idx_blks_hit / (idx_blks_read + idx_blks_hit), 2) || '%' AS "Overall hit ratio",
ROUND(100.0 * (idx_blks_hit - idx_metadata_blks) / ((idx_blks_read + idx_blks_hit) - idx_metadata_blks), 2) || '%' AS "Leaf hit ratio"
FROM pg_statio_all_indexes
WHERE indexrelname = 'idx_search_key_large';

This is the result:

Cache hit ratio | Leaf hit ratio
-------------+-----------
85.37% | 56.11%
(1 row)

The overall cache hit ratio looks healthy at ~85%, but the leaf-page hit ratio is much lower at ~56%, indicating that a large share of index leaf blocks are actually being read from disk. This suggests that internal index pages remain cached (as expected, B-tree hierarchy) while the working set of leaf pages does not fit in memory, leading to more disk I/O during actual lookups.

Now showing the performance comparison of running this script [1] on my laptop on HEAD and patch.

HEAD
====

Point Queries
-----------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 30098 |
| 2 | 30164 |
| 3 | 29903 |
| 4 | 30024 |
| 5 | 29898 |
| 6 | 30023 |
| 7 | 29952 |
| 8 | 29413 |
| 9 | 30062 |
| 10 | 29821 |
+------+--------+
Median: 29988

Range Scans
----------
+------+------+
| Run | TPS |
+------+------+
| 1 | 586 |
| 2 | 584 |
| 3 | 584 |
| 4 | 584 |
| 5 | 584 |
| 6 | 587 |
| 7 | 578 |
| 8 | 562 |
| 9 | 583 |
| 10 | 586 |
+------+------+
Median: 584

Mixed Load
----------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 16446 |
| 2 | 15842 |
| 3 | 16701 |
| 4 | 16293 |
| 5 | 16633 |
| 6 | 16292 |
| 7 | 16753 |
| 8 | 17047 |
| 9 | 17094 |
| 10 | 17078 |
+------+--------+
Median: 16667

Patch
===

Point Queries
-----------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 30335 |
| 2 | 30448 |
| 3 | 30372 |
| 4 | 30447 |
| 5 | 30478 |
| 6 | 30482 |
| 7 | 30428 |
| 8 | 30443 |
| 9 | 30433 |
| 10 | 30478 |
+------+--------+
Median: 30445

Range Scans
----------
+------+------+
| Run | TPS |
+------+------+
| 1 | 578 |
| 2 | 586 |
| 3 | 585 |
| 4 | 586 |
| 5 | 587 |
| 6 | 585 |
| 7 | 586 |
| 8 | 586 |
| 9 | 586 |
| 10 | 586 |
+------+------+
Median: 586

Mixed Load
---------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 17002 |
| 2 | 17078 |
| 3 | 17042 |
| 4 | 17046 |
| 5 | 17007 |
| 6 | 17023 |
| 7 | 17056 |
| 8 | 17071 |
| 9 | 17084 |
| 10 | 17068 |
+------+--------+
Median: 17051

HEAD vs Patch Summary
===============
+----------------------+-----------+-----------+
| Test | HEAD TPS | Patch TPS |
+----------------------+-----------+-----------+
| Point Queries Median | 29988 | 30445 |
| Range Scans Median | 584 | 586 |
| Mixed Load Median | 16667 | 17051 |
+----------------------+-----------+-----------+

Any feedback appreciated on the performance methodology/patch as a whole welcome.

[1] https://gist.github.com/mcadariu/fc4a6d4eccd56b4447d1d9d05f9b5d79

-- 
Thanks,
Mircea Cadariu

Re: Metadata and record block access stats for indexes

От

lakshmi

Дата:

10 февраля, 14:53:02

Hi,

I tested the latest v3 patch on my source code
The patch applied cleanly,postgreSQL build and ran successfully,and the new idx_metadata_blks column is visible in pg_statio_all_indexes.
To evaluate usefulness,I reproduced a lookup workload on a large indexed table and compared the traditional overall index hit ratio with the leaf-page hit ratio enabled by the new counter.The observed results
were:
overall index hit ratio:85.37%
leaf-page hit ratio:56.11%
This indicates that cached internal/metadata pages can make overall ratio appear healthy while a significant portion of leaf pages are still read from disk.The new statistic therefore provides clear visibility into real index lookup behavior and working set pressure.

I also performed a simple pgbench comparison between patched and unpatched builds under similar conditions and did not observe evidence of a performance regression, suggesting negligible runtime overhead from the additional statistics tracking.

Overall, the patch applied cleanly, behaved as expected in testing, and the new counter appears practically useful for diagnosing index I/O patterns.

Thanks for working on improving index observability.

regards
lakshmi

On Tue, Feb 10, 2026 at 12:29 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

Hi, Aleksander!

On 20/11/2025 14:56, Aleksander Alekseev wrote:
> Hi Mircea,
>
>> Rebased and dusted off this patch.
> Thanks for the patch. Here are my two cents.
>
> IMO it would be helpful if you could come up with a few more practical
> use cases. This change is going to affect pretty much everyone. If
> only a few users will benefit from it once in several years, the value
> of the patch is arguably low. As an example, can you think of how the
> new counters can be used for debugging, checking index integrity,
> writing new access methods or perhaps writing property-based tests?
> Just several examples that came to my mind first.
>
> Also I'm a bit concerned about the performance impact. It's probably
> next to nothing, but if you could measure it on a relatively large
> amount of data that would be great. Note that it's not uncommon to
> have dozens of different indexes for a single table.
>
Thanks! It hasn't caught on, might just stay an educational exercise on
my side on patch writing and we move on. Thanks for thinking out loud,
your ideas lead me to this: when working on improving the performance of
index traversals, it could be a direct way to show improvement and even
write regression tests: lower amount of internal pages are read than
before. But, one could also just use the existing stats and still be
able to track this..

For the performance checks, indeed, I'm using this script, I observe
no meaningful difference on my laptop:
https://gist.github.com/mcadariu/fc4a6d4eccd56b4447d1d9d05f9b5d79.

--
Thanks,
Mircea Cadariu

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Metadata and record block access stats for indexes

Вложения

Вложения

Вложения