Обсуждение: Add pg_stat_vfdcache view for VFD cache statistics
Hello hackers,
PostgreSQL's virtual file descriptor (VFD) maintains a
per-backend cache of open file descriptors, bounded by
max_files_per_process (default 1000). When the cache is full, the
least-recently-used entry is evicted so its OS fd is closed, so a new
file can be opened. On the next access to that file, open() must be
called again, incurring a syscall that a larger cache would have
avoided.
A trivial example is with partitioned tables: a table with 1500
partitions requires even more than 1500 file descriptors per full scan (main
fork, vm ...), which is more than the default limit, causing potential evictions and reopens.
The problem is well-understood and the fix is straightforward: raise
max_files_per_process. Tomas showed a 4-5x throughput
improvement in [1] sometimes, on my end i see something less than that, depending on the query itself, but we get the idea.
This comes from Tomas's patch idea from his website[1], i thought this patch makes sense to have.
per-backend cache of open file descriptors, bounded by
max_files_per_process (default 1000). When the cache is full, the
least-recently-used entry is evicted so its OS fd is closed, so a new
file can be opened. On the next access to that file, open() must be
called again, incurring a syscall that a larger cache would have
avoided.
A trivial example is with partitioned tables: a table with 1500
partitions requires even more than 1500 file descriptors per full scan (main
fork, vm ...), which is more than the default limit, causing potential evictions and reopens.
The problem is well-understood and the fix is straightforward: raise
max_files_per_process. Tomas showed a 4-5x throughput
improvement in [1] sometimes, on my end i see something less than that, depending on the query itself, but we get the idea.
AFAIK there is currently no way from inside PostgreSQL to know whether fd cache pressure is occurring.
Implementation is trivial, because the VFD cache is strictly per-backend, the counters are also
per-backend and require no shared memory or locking. Three macros (pgstat_count_vfd_hit/miss/eviction) update fields in PendingVfdCacheStats directly from fd.c.
per-backend and require no shared memory or locking. Three macros (pgstat_count_vfd_hit/miss/eviction) update fields in PendingVfdCacheStats directly from fd.c.
I find this a bit useful, I would love to hear about anyone's thoughts whether this is useful or not.
Regards,
Ayoub
Вложения
On Sat, Mar 21, 2026 at 5:59 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote: > > Hello hackers, > > This comes from Tomas's patch idea from his website[1], i thought this patch makes sense to have. > > PostgreSQL's virtual file descriptor (VFD) maintains a > per-backend cache of open file descriptors, bounded by > max_files_per_process (default 1000). When the cache is full, the > least-recently-used entry is evicted so its OS fd is closed, so a new > file can be opened. On the next access to that file, open() must be > called again, incurring a syscall that a larger cache would have > avoided. > > A trivial example is with partitioned tables: a table with 1500 > partitions requires even more than 1500 file descriptors per full scan (main > fork, vm ...), which is more than the default limit, causing potential evictions and reopens. > > The problem is well-understood and the fix is straightforward: raise > max_files_per_process. Tomas showed a 4-5x throughput > improvement in [1] sometimes, on my end i see something less than that, depending on the query itself, but we get the idea. > > AFAIK there is currently no way from inside PostgreSQL to know whether fd cache pressure is occurring. > > Implementation is trivial, because the VFD cache is strictly per-backend, the counters are also > per-backend and require no shared memory or locking. Three macros (pgstat_count_vfd_hit/miss/eviction) update fields inPendingVfdCacheStats directly from fd.c. > > I find this a bit useful, I would love to hear about anyone's thoughts whether this is useful or not. Hi, My $0.02, for that for that to being useful it would need to allow viewing global vfd cache picture (across all backends), not just from *current* backend. Applicaiton wouldn't call this function anyway, because they would have to be modified. In order to get that you technically should collect the hits/misses in local pending pgstat io area (see e.g. pgstat_io or simpler pgstat_bgwriter/ checkpointer) like you do already with PendingVfdCacheStats, but then copy them to shared memory pgstat area (with some LWLock* protection) that would be queryable. -J.
Hi! On 23.03.2026 12:22, Jakub Wartak wrote: > On Sat, Mar 21, 2026 at 5:59 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote: >> >> Hello hackers, >> >> This comes from Tomas's patch idea from his website[1], i thought this patch makes sense to have. >> >> PostgreSQL's virtual file descriptor (VFD) maintains a >> per-backend cache of open file descriptors, bounded by >> max_files_per_process (default 1000). When the cache is full, the >> least-recently-used entry is evicted so its OS fd is closed, so a new >> file can be opened. On the next access to that file, open() must be >> called again, incurring a syscall that a larger cache would have >> avoided. That's one use-case. The other one that I've recently come across is just knowing how many VFD cache entries there are in the first place. While the number of open files is bounded by max_files_per_process, the number of cache entries is unbounded. Large database can easily have hundreds of thousands of files due to our segmentation scheme. Workloads that access a big portion of these files can end up spending very considerable amounts of memory on the VFD cache. For example, with 100,000 VFD entries per backend * 80 bytes per VFD = ~7.6 MiB. With 1000 backends that almost 10 GiB just for VFD entries; assuming that each backend over time accumulates that many files. A production database I looked recently had ~300,000 files and many thousand backends. It spent close to 30 GiBs on VFD cache. I've looked at struct vfd and some simple changes to the struct would already cut memory consumption in half. I can look into that. Thoughts? >> A trivial example is with partitioned tables: a table with 1500 >> partitions requires even more than 1500 file descriptors per full scan (main >> fork, vm ...), which is more than the default limit, causing potential evictions and reopens. >> >> The problem is well-understood and the fix is straightforward: raise >> max_files_per_process. Tomas showed a 4-5x throughput >> improvement in [1] sometimes, on my end i see something less than that, depending on the query itself, but we get theidea. The question is what the kernel makes out of that, especially in aforementioned case where the number of total files and backends is large. In the Linux kernel each process that open some file gets its own struct file. sizeof(struct file) is ~200 bytes. Hence, increasing max_files_per_process can measurably impact memory consumption if changed lightheartedly. We should document that. But I guess in most cases it's rather about changing it from 1k to 2k, rather than changing it from 1k to 100k. >> AFAIK there is currently no way from inside PostgreSQL to know whether fd cache pressure is occurring. >> >> Implementation is trivial, because the VFD cache is strictly per-backend, the counters are also >> per-backend and require no shared memory or locking. Three macros (pgstat_count_vfd_hit/miss/eviction) update fields inPendingVfdCacheStats directly from fd.c. >> >> I find this a bit useful, I would love to hear about anyone's thoughts whether this is useful or not. > > Hi, > > My $0.02, for that for that to being useful it would need to allow viewing > global vfd cache picture (across all backends), not just from *current* backend. > Applicaiton wouldn't call this function anyway, because they would have to be > modified. +1 > In order to get that you technically should collect the hits/misses in local > pending pgstat io area (see e.g. pgstat_io or simpler pgstat_bgwriter/ > checkpointer) like you do already with PendingVfdCacheStats, but then copy them > to shared memory pgstat area (with some LWLock* protection) that would be > queryable. I would include here the sum of VFD cache entries across all backend and the total VFD cache size. -- David Geier
Hello David and Jakub,
On Mon, Mar 23, 2026 at 1:35 PM David Geier <geidav.pg@gmail.com> wrote:
Hi!
On 23.03.2026 12:22, Jakub Wartak wrote:
> On Sat, Mar 21, 2026 at 5:59 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>>
>> Hello hackers,
>>
>> This comes from Tomas's patch idea from his website[1], i thought this patch makes sense to have.
>>
>> PostgreSQL's virtual file descriptor (VFD) maintains a
>> per-backend cache of open file descriptors, bounded by
>> max_files_per_process (default 1000). When the cache is full, the
>> least-recently-used entry is evicted so its OS fd is closed, so a new
>> file can be opened. On the next access to that file, open() must be
>> called again, incurring a syscall that a larger cache would have
>> avoided.
That's one use-case. The other one that I've recently come across is
just knowing how many VFD cache entries there are in the first place.
While the number of open files is bounded by max_files_per_process, the
number of cache entries is unbounded. Large database can easily have
hundreds of thousands of files due to our segmentation scheme.
Workloads that access a big portion of these files can end up spending
very considerable amounts of memory on the VFD cache. For example, with
100,000 VFD entries per backend * 80 bytes per VFD = ~7.6 MiB. With 1000
backends that almost 10 GiB just for VFD entries; assuming that each
backend over time accumulates that many files.
A production database I looked recently had ~300,000 files and many
thousand backends. It spent close to 30 GiBs on VFD cache.
I've looked at struct vfd and some simple changes to the struct would
already cut memory consumption in half. I can look into that.
Thoughts?
Looking forward to this.
What also bothers me in that space is if a backend allocates 100K entries in VFD cache, that cache is never shrank ever again,
the cache only grows (if it needs more than its lifetime maximum) until the backend dies, although this is useful as entries are reused if free instead of
allocating entries, whether a spike in files openings effects a long living backend to keep holding a useless amount of
cache size it will need in the future, i don't imagine this to be common though, what do you think about this issue from your experience ?
In the Linux kernel each process that open some file gets its own struct
file. sizeof(struct file) is ~200 bytes. Hence, increasing
max_files_per_process can measurably impact memory consumption if
changed lightheartedly. We should document that.
But I guess in most cases it's rather about changing it from 1k to 2k,
rather than changing it from 1k to 100k.
Indeed, I agree with this.
>> AFAIK there is currently no way from inside PostgreSQL to know whether fd cache pressure is occurring.
>>
>> Implementation is trivial, because the VFD cache is strictly per-backend, the counters are also
>> per-backend and require no shared memory or locking. Three macros (pgstat_count_vfd_hit/miss/eviction) update fields in PendingVfdCacheStats directly from fd.c.
>>
>> I find this a bit useful, I would love to hear about anyone's thoughts whether this is useful or not.
>
> Hi,
>
> My $0.02, for that for that to being useful it would need to allow viewing
> global vfd cache picture (across all backends), not just from *current* backend.
> Applicaiton wouldn't call this function anyway, because they would have to be
> modified.
+1
Would it be reasonable to have both ? I changed the way i was thinking about it,
Its clear one would want global vfd cache behavior monitoring as its configuration is global but its effect is mostly backend specific,
That leads me to think it could also be useful to ALSO maintain per-backend metrics, to help identify which backends are going crazy with the VFD cache.
> In order to get that you technically should collect the hits/misses in local
> pending pgstat io area (see e.g. pgstat_io or simpler pgstat_bgwriter/
> checkpointer) like you do already with PendingVfdCacheStats, but then copy them
> to shared memory pgstat area (with some LWLock* protection) that would be
> queryable.
I would include here the sum of VFD cache entries across all backend and
the total VFD cache size.
I'll be doing this soon.
--
David Geier
Regards,
Ayoub Kazar
On 24.03.2026 00:36, KAZAR Ayoub wrote: >> I've looked at struct vfd and some simple changes to the struct would >> already cut memory consumption in half. I can look into that. >> >> Thoughts? > > Looking forward to this. I try to come up with something the next days. > What also bothers me in that space is if a backend allocates 100K entries > in VFD cache, that cache is never shrank ever again, > the cache only grows (if it needs more than its lifetime maximum) until the > backend dies, although this is useful as entries are reused if free instead > of > allocating entries, whether a spike in files openings effects a long living > backend to keep holding a useless amount of > cache size it will need in the future, i don't imagine this to be common > though, what do you think about this issue from your experience ? Currently the cache is directly mapped by the VFD index. That means we could only resize down to the maximum used VFD index. Being able to resize independently of the maximum VFD index would require changing to a hash map like simplehash.h. I can take a look how invasive such a change would be. -- David Geier
Hello,
On Tue, Mar 24, 2026 at 3:09 PM David Geier <geidav.pg@gmail.com> wrote:
On 24.03.2026 00:36, KAZAR Ayoub wrote:
>> I've looked at struct vfd and some simple changes to the struct would
>> already cut memory consumption in half. I can look into that.
>>
>> Thoughts?
>
> Looking forward to this.
I try to come up with something the next days.
> What also bothers me in that space is if a backend allocates 100K entries
> in VFD cache, that cache is never shrank ever again,
> the cache only grows (if it needs more than its lifetime maximum) until the
> backend dies, although this is useful as entries are reused if free instead
> of
> allocating entries, whether a spike in files openings effects a long living
> backend to keep holding a useless amount of
> cache size it will need in the future, i don't imagine this to be common
> though, what do you think about this issue from your experience ?
Currently the cache is directly mapped by the VFD index. That means we
could only resize down to the maximum used VFD index.
Being able to resize independently of the maximum VFD index would
require changing to a hash map like simplehash.h. I can take a look how
invasive such a change would be.
--
David Geier
I've implemented the recommended global stats view on vfd cache, the implementation should be also straightforward as it follows the same cumulative shared statistics infrastructure like pgstat_bgwriter and others do.
Attached is v2 patch also contains what David suggested for global cache size and entries in the view.
Kind regards,
Ayoub
Вложения
Rebased v2.
On Sun, Mar 29, 2026 at 8:23 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello,On Tue, Mar 24, 2026 at 3:09 PM David Geier <geidav.pg@gmail.com> wrote:On 24.03.2026 00:36, KAZAR Ayoub wrote:
>> I've looked at struct vfd and some simple changes to the struct would
>> already cut memory consumption in half. I can look into that.
>>
>> Thoughts?
>
> Looking forward to this.
I try to come up with something the next days.
> What also bothers me in that space is if a backend allocates 100K entries
> in VFD cache, that cache is never shrank ever again,
> the cache only grows (if it needs more than its lifetime maximum) until the
> backend dies, although this is useful as entries are reused if free instead
> of
> allocating entries, whether a spike in files openings effects a long living
> backend to keep holding a useless amount of
> cache size it will need in the future, i don't imagine this to be common
> though, what do you think about this issue from your experience ?
Currently the cache is directly mapped by the VFD index. That means we
could only resize down to the maximum used VFD index.
Being able to resize independently of the maximum VFD index would
require changing to a hash map like simplehash.h. I can take a look how
invasive such a change would be.
--
David GeierI've implemented the recommended global stats view on vfd cache, the implementation should be also straightforward as it follows the same cumulative shared statistics infrastructure like pgstat_bgwriter and others do.Attached is v2 patch also contains what David suggested for global cache size and entries in the view.Kind regards,Ayoub
Вложения
Hi! On 29.03.2026 21:46, KAZAR Ayoub wrote: >>>>> I've looked at struct vfd and some simple changes to the struct would >>>>> already cut memory consumption in half. I can look into that. >>>>> >>>>> Thoughts? >>>> >>>> Looking forward to this. >>> >>> I try to come up with something the next days. >>>> What also bothers me in that space is if a backend allocates 100K >>> entries >>>> in VFD cache, that cache is never shrank ever again, >>>> the cache only grows (if it needs more than its lifetime maximum) until >>> the >>>> backend dies, although this is useful as entries are reused if free >>> instead >>>> of >>>> allocating entries, whether a spike in files openings effects a long >>> living >>>> backend to keep holding a useless amount of >>>> cache size it will need in the future, i don't imagine this to be common >>>> though, what do you think about this issue from your experience ? >>> >>> Currently the cache is directly mapped by the VFD index. That means we >>> could only resize down to the maximum used VFD index. >>> >>> Being able to resize independently of the maximum VFD index would >>> require changing to a hash map like simplehash.h. I can take a look how >>> invasive such a change would be. That would actually be doable without too much code churn. It would, however, add some more overhead to each cache entry: 1 byte for the simplehash.h status and a 4 byte for the hash, if we want to avoid rehashing on each access. Probably we can get away without storing the hash. We would then have to monitor the cache size and recreate the hash table when the size has shrunk by enough. Another alternative is pallocing vfd entries so that we can freely move them around in the vfd cache array. That would mean an ABI change (File would be an 8-byte pointer instead of a 4-byte integer) but give us much more flexibility for possible improvements. For example, apart from compacting without hash map, this would allow us to allocate variable amounts of memory per entry to, e.g. - store the file name inline of the struct as variable length array (instead of a pstrdup() pointer) and - depending on type allocate more or less memory. This is because non-temporary files don't use the ResOwner and the file size. >> I've implemented the recommended global stats view on vfd cache, the >> implementation should be also straightforward as it follows the same >> cumulative shared statistics infrastructure like pgstat_bgwriter and others >> do. >> >> Attached is v2 patch also contains what David suggested for global cache >> size and entries in the view. I'll review the patch the next days. On quick inspection I saw that you compute the size of a struct vfd by using sizeof(). That works, except for the filename which is stored as a pointer to a pstrdup() piece of memory. I guess you can just say: sizeof(VfdCache[i]) + GetMemoryChunkSpace(VfdCache[i].fileName) -- David Geier
Hi, Thanks for working on this, I think having some stats about the vfd cache would be quite helpful. I took a quick look at the patch, and in general it goes in the right direction. Here's a couple comments / suggestions: 1) sgml docs - Formatting seems a bit wrong, both for the entry in the first table, and then for the view description later. Clearly different from the nearby tables after the documentation is built. Did you generate the tables somehow? - The ordering seems a bit random, but I'd argue the pg_stat_vfdcache view should go before pg_stat_wal (at least in the first table). 2) system_views.sql - missing the REVOKE command - We don't align the AS clauses with spaces (I like to align my queries, but here it's a question of consistency with the other commands, and the alignment makes future diffs larger) 3) fd.c - Wouldn't it be better to have pgstat_count_vfd_access(hit bool)? That way you would't even need the new else, it'd ne enough to do pgstat_count_vfd_access(!FileIsNotOpen(file)) at the beginning. - GetVfdCacheOccupancy does not seem to be called from anywhere 4) pgstat_vfdcache.c - I'm not sure usagecount makes sense for these stats, because that's for cases with a single writer. These stats are written by backends, so it probably needs a lwlock. - I'm not sure updating PgStatShared_Backend from pgstat_vfdcache.c is a good idea, when it's already synced from pgstat_backend. We don't do that for WAL either, but there's pgstat_flush_backend_entry_wal in pgstat_backend.c. I suppose vfdcache should do it the same way. 5) pg_proc.dat - formatting seems a bit inconsistent 6) pgstat.h - Aren't evictions mostly the same as misses, at least after a while? - I think it would be useful to report how many file descriptors we are allowed to open (it's less than max_files_per_process, depending on the ulimits etc.) - I know io_uring can consume quite a few descriptors, and it can cause issues, I wonder if this would make it easier to observe I also suggest to split the patch into smaller patches, to make it easier to review and evaluate. Not because of size - the patch is fairly small. But it's better to not mix multiple features with different cost/benefit trade offs, because then it's possible to evaluate them separately. Maybe even commit the first part and continue discussion about the following one(s). This patch seems to mix two different types of stats - global stats of the vfd cache, and then also per-backend stats. Those seems like very different things, both in terms of overhead and benefits. The global cache stats is going to be virtually free (at least the hits/misses, I'm not sure about the number of entries and bytes), and it's obviously useful for tuning the max_files_per_process GUC. I'd even contemplate getting this into PG19, maybe. The per-backend stats seem like a much harder sell to me, but I can be convinced. Maybe it's not an issue in terms of overhead, maybe the stats we get from that are worth it. Not sure. But I'd keep it in a separate 0002 patch, on top of 0001 with just the "global" stats. regards -- Tomas Vondra
Hello,
Thanks for the review!
On Tue, Mar 31, 2026 at 8:27 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,
Thanks for working on this, I think having some stats about the vfdcache would be quite helpful. I took a quick look at the patch, and in
general it goes in the right direction.
Here's a couple comments / suggestions:
6) pgstat.h
- Aren't evictions mostly the same as misses, at least after a while?
Correct, the time where they are not the same is pretty much meaningless info, i removed it.
- I think it would be useful to report how many file descriptors we
are allowed to open (it's less than max_files_per_process, depending
on the ulimits etc.)
Agree, This should be max_safe_fds calculated by postmaster, I added this but let me know if its acceptable to export max_safe_fds in the way I did.
- I know io_uring can consume quite a few descriptors, and it can cause
issues, I wonder if this would make it easier to observe
I also suggest to split the patch into smaller patches, to make it
easier to review and evaluate. Not because of size - the patch is fairly
small. But it's better to not mix multiple features with different
cost/benefit trade offs, because then it's possible to evaluate them
separately. Maybe even commit the first part and continue discussion
about the following one(s).
This patch seems to mix two different types of stats - global stats of
the vfd cache, and then also per-backend stats. Those seems like very
different things, both in terms of overhead and benefits.
The global cache stats is going to be virtually free (at least the
hits/misses, I'm not sure about the number of entries and bytes), and
it's obviously useful for tuning the max_files_per_process GUC. I'd even
contemplate getting this into PG19, maybe.
The per-backend stats seem like a much harder sell to me, but I can be
convinced. Maybe it's not an issue in terms of overhead, maybe the stats
we get from that are worth it. Not sure. But I'd keep it in a separate
0002 patch, on top of 0001 with just the "global" stats.
regards
--
Tomas Vondra
I fixed style related issues and followed your suggestions on splitting the patch to do global stats first then the per-backend stats of cache size and entries count reporting, attached is v3-0001 that does just the global stats counting.
When we make sure this is correct i'll proceed with the per-backend stats patch.
Regards,
Ayoub
Вложения
Hi!
I finally got around taking a look at this patch.
On 03.04.2026 15:53, KAZAR Ayoub wrote:
>> - I think it would be useful to report how many file descriptors we
>> are allowed to open (it's less than max_files_per_process, depending
>> on the ulimits etc.)
>>
> Agree, This should be max_safe_fds calculated by postmaster, I added this
> but let me know if its acceptable to export max_safe_fds in the way I did.
An alternative to including it in the view would be using a GUC of type
PGC_INTERNAL. That seems more inline with how we expose other PostgreSQL
internal read-only variables that don't change.
Or is there an advantage to including max_safe_fds in the view?
>> The global cache stats is going to be virtually free (at least the
>> hits/misses, I'm not sure about the number of entries and bytes), and
>> it's obviously useful for tuning the max_files_per_process GUC. I'd even
>> contemplate getting this into PG19, maybe.
The number of used entries already exists, see nfile in fd.c.
Including the total cache size would also be virtually free if we don't
iterate over all VFDs each time, but update the size as we go. That
would have to happen when resizing the cache and when populating /
freeing a cache entry because extra memory is allocated / freed for
Vfd::fileName.
I'm happy to code this up if there's agreement that it's sensible to
include it, in the current version of the patch or a follow-up patch.
Beyond that:
While looking through the code I saw a mistake (repetition of "that") in
a comment in existing code. Maybe you want to fix that as well right away?
/*
* For variable-numbered stats: flush pending stats. Required if pending
* data is used. See flush_static_cb when dealing with stats data that
* that cannot use PgStat_EntryRef->pending.
*/
bool (*flush_pending_cb) (PgStat_EntryRef *sr, bool nowait);
The indentation of the type at the end of the following two structs is
inconsistent with the rest of the files.
typedef struct PgStatShared_VfdCache
{
/* lock protects ->stats */
LWLock lock;
PgStat_VfdCacheStats stats;
} PgStatShared_VfdCache;
typedef struct PgStat_VfdCacheStats
{
PgStat_Counter vfd_hits; /* fd was open, no open() was needed */
PgStat_Counter vfd_misses; /* fd was VFD_CLOSED, open() was required */
TimestampTz stat_reset_timestamp;
} PgStat_VfdCacheStats;
Apart from these nit comments the patch looks good to me.
--
David Geier