Обсуждение: The ability of postgres to determine loss of files of the main fork

Поиск
Список
Период
Сортировка

The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
 Hi hackers,

Postgres determines the availability of the main fork, the actual data file,
upon reading it. This is also what amcheck will detect.

However, if a relation exceeds 1GB and has more than a single data file
segment, there is the option of silent data loss.  For example, if a table
consists of five segments and the third one goes missing, a sequential scan
will happily conclude that the table consists only of two segments and won't
report an error.  Only an index scan that tries to return a row in the
missing segment will report an error.

Currently, this kind of data loss cannot be detected if you check the
integrity of the table using verify_heapam().  Only if you manually use the
primary key index to obtain data from a missing segment, you will get an error.

Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

An alternative might be to track the number of segments of a relation in
pg_class, but that may be difficult to make crash-safe.

Frits Hoogland




Re: The ability of postgres to determine loss of files of the main fork

От
Aleksander Alekseev
Дата:
Hi Frits,

> Therefore, I would like to request an enhancement: add an option to
> verify_heapam() that causes the primary key index to be scanned and makes
> sure that all line pointers in the index point to existing tuples.

I'm a bit puzzled by your emphasis on primary keys. In Postgres it is
legal to have tables without PKs, indexes, or even columns:

=# create table my_table();
=# select * from my_table;

To clarify, are you proposing not to check such tables?

> An alternative might be to track the number of segments of a relation in
> pg_class, but that may be difficult to make crash-safe.

Hm... the fact that we have a segment on disk doesn't mean it is not
empty or not corrupted. Let's say we will add a check you are
proposing. The next person will complain that we don't check the size
of the segments. The next one - about the fact that we don't verify
checksums.

So IMO there is little value in adding a check for the existence of
the segments for a single table. And the *real* check will not differ
much from something like SELECT * FROM my_table, or from making a
complete backup of the database.

-- 
Best regards,
Aleksander Alekseev



Re: The ability of postgres to determine loss of files of the main fork

От
Tom Lane
Дата:
Aleksander Alekseev <aleksander@tigerdata.com> writes:
>> Therefore, I would like to request an enhancement: add an option to
>> verify_heapam() that causes the primary key index to be scanned and makes
>> sure that all line pointers in the index point to existing tuples.

> ... IMO there is little value in adding a check for the existence of
> the segments for a single table. And the *real* check will not differ
> much from something like SELECT * FROM my_table, or from making a
> complete backup of the database.

As Frits mentioned, neither of those actions will really notice if a
table has been truncated via loss of a segment.

However, I think the requested functionality already exists via
contrib/amcheck (see the heapallindexed option).  The user does have
to make a decision which index to check with, but I think that'd be
required anyway --- as you say, there isn't necessarily a primary key.

            regards, tom lane



Re: The ability of postgres to determine loss of files of the main fork

От
Laurenz Albe
Дата:
On Tue, 2025-09-30 at 18:55 +0300, Aleksander Alekseev wrote:
> > Therefore, I would like to request an enhancement: add an option to
> > verify_heapam() that causes the primary key index to be scanned and makes
> > sure that all line pointers in the index point to existing tuples.
>
> I'm a bit puzzled by your emphasis on primary keys. In Postgres it is
> legal to have tables without PKs, indexes, or even columns:
>
> =# create table my_table();
> =# select * from my_table;
>
> To clarify, are you proposing not to check such tables?

If there is no index on a table, there is no way to determine a missing
segment.  I don't think that Frits deliberately wants to *not* check
such tables, it's just that there is no way to do it as things are now.

Just because a check is not possible in certain cases is no good reason
to forgo a check in those cases where it is possible.

Yours,
Laurenz Albe



Re: The ability of postgres to determine loss of files of the main fork

От
Michael Banck
Дата:
Hi,

wow, this is one of the most terrifying threads I've ever seen...

On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
> Aleksander Alekseev <aleksander@tigerdata.com> writes:
> >> Therefore, I would like to request an enhancement: add an option to
> >> verify_heapam() that causes the primary key index to be scanned and makes
> >> sure that all line pointers in the index point to existing tuples.
> 
> > ... IMO there is little value in adding a check for the existence of
> > the segments for a single table. And the *real* check will not differ
> > much from something like SELECT * FROM my_table, or from making a
> > complete backup of the database.
> 
> As Frits mentioned, neither of those actions will really notice if a
> table has been truncated via loss of a segment.

Is there a valid case for a missing segment? If not, couldn't this be
caught somewhere in the storage manager?
 
> However, I think the requested functionality already exists via
> contrib/amcheck (see the heapallindexed option).  

It doesn't balk for me, am I doing something wrong?

|mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
|  count
|----------
| 20000000
|(1 row)
|
|mbanck@mbanck-lin-1:~$ rm /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.1
|mbanck@mbanck-lin-1:~$ ls /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462* | grep -v 16462_
|/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462
|/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.2
|mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
|  count  
|---------
| 7995392
|(1 row)
|
|mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -t pgbench_accounts
|pg_amcheck: including database "postgres"
|pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
|pg_amcheck: checking heap table "postgres.public.pgbench_accounts"
|pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
|mbanck@mbanck-lin-1:~$ echo $?
|0
|mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -i pgbench_accounts_pkey
|pg_amcheck: including database "postgres"
|pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
|pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
|mbanck@mbanck-lin-1:~$ echo $?
|0
|mbanck@mbanck-lin-1:~$

And neither pg_checksums nor pg_basebackup catch it either...


Michael



Re: The ability of postgres to determine loss of files of the main fork

От
Arseniy Mukhin
Дата:
Hi,

On Wed, Oct 1, 2025 at 10:02 AM Michael Banck <mbanck@gmx.net> wrote:
>
> Hi,
>
> wow, this is one of the most terrifying threads I've ever seen...
>
> On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
> > Aleksander Alekseev <aleksander@tigerdata.com> writes:
> > >> Therefore, I would like to request an enhancement: add an option to
> > >> verify_heapam() that causes the primary key index to be scanned and makes
> > >> sure that all line pointers in the index point to existing tuples.
> >
> > > ... IMO there is little value in adding a check for the existence of
> > > the segments for a single table. And the *real* check will not differ
> > > much from something like SELECT * FROM my_table, or from making a
> > > complete backup of the database.
> >
> > As Frits mentioned, neither of those actions will really notice if a
> > table has been truncated via loss of a segment.
>
> Is there a valid case for a missing segment? If not, couldn't this be
> caught somewhere in the storage manager?
>
> > However, I think the requested functionality already exists via
> > contrib/amcheck (see the heapallindexed option).
>
> It doesn't balk for me, am I doing something wrong?
>
> |mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
> |  count
> |----------
> | 20000000
> |(1 row)
> |
> |mbanck@mbanck-lin-1:~$ rm /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.1
> |mbanck@mbanck-lin-1:~$ ls /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462* | grep -v 16462_
> |/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462
> |/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.2
> |mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
> |  count
> |---------
> | 7995392
> |(1 row)
> |
> |mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -t pgbench_accounts
> |pg_amcheck: including database "postgres"
> |pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
> |pg_amcheck: checking heap table "postgres.public.pgbench_accounts"
> |pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
> |mbanck@mbanck-lin-1:~$ echo $?
> |0
> |mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -i pgbench_accounts_pkey
> |pg_amcheck: including database "postgres"
> |pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
> |pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
> |mbanck@mbanck-lin-1:~$ echo $?
> |0
> |mbanck@mbanck-lin-1:~$

I tried to repeat it and has the same results with bt_index_check().
IIUC the reason amcheck doesn't show any corruption here is that
allheapindexed check just builds bloom filter for all index tuples and
then test every heap tuple against it. So we actually never check that
every index tuple points to the existing segment/page/heap_tuple here.


Best regards,
Arseniy Mukhin



Re: The ability of postgres to determine loss of files of the main fork

От
Jakub Wartak
Дата:
On Wed, Oct 1, 2025 at 9:02 AM Michael Banck <mbanck@gmx.net> wrote:
>
> Hi,
>
> wow, this is one of the most terrifying threads I've ever seen...

Same.

> On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
> > Aleksander Alekseev <aleksander@tigerdata.com> writes:
> > >> Therefore, I would like to request an enhancement: add an option to
> > >> verify_heapam() that causes the primary key index to be scanned and makes
> > >> sure that all line pointers in the index point to existing tuples.
> >
> > > ... IMO there is little value in adding a check for the existence of
> > > the segments for a single table. And the *real* check will not differ
> > > much from something like SELECT * FROM my_table, or from making a
> > > complete backup of the database.
> >
> > As Frits mentioned, neither of those actions will really notice if a
> > table has been truncated via loss of a segment.
>
> Is there a valid case for a missing segment? If not, couldn't this be
> caught somewhere in the storage manager?
>

I've took a look on PG17 and in _mfd_openseg() there's if fd < 0
return NULL after open(), but out of it's callers only _mdfd_getseg()
seems to be alerting on that NULL. To me this seems like a bug,
because i've seen way too many times people and software deleting
files randomly. Even simple crashes (with e2fsck, xfs_repair) could
put orphaned inodes into /lost+found. IMHO all files should be opened
at least on startup to check integrity, because the non-zero return
code (during such SELECT) for openat(2) seems o be coming out of
RelationGetNumberOfBlocksInFork()->table_block_relation_size()->smgrnblocks()->mdnblocks()->_mdfd_openseg().
Now if the 1st seg file would be missing we would complain in
mdopenfork(). mdnblocks() says even "all active segments of the
relation are opened...", but even that apparently is not true.

The bigger context seems to be be that 049469e7e7cfe0c69 (2015) could
be culprit here as well, as it is stated there that mdnblocks() could
earlier create zero-length files back in day and it removed that
ereport(ERROR) when unable to access that file.

Another idea (than this being a bug) is that Thomas had a large
relation patchset back in [1], but I wouldn't be a fan of us operating
on 31-32TB files ;)

-J.

[1] -
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com



Re: The ability of postgres to determine loss of files of the main fork

От
Aleksander Alekseev
Дата:
Hi Jakub,

> IMHO all files should be opened at least on startup to check integrity,

That might be a lot of files to open. Even if you can open a file it
doesn't mean it's not empty or is not corrupted.

-- 
Best regards,
Aleksander Alekseev



Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
I am proposing the database to have the ability to detect when it has missing segments.
One of the main and basic properties of a database is to store data consistently and reliably.
If the database cannot detect that is missing things, this is a huge breach to safety and reliability.

And you're right that other things can be checked, this is just about this single ability.
Like Laurenz says, that other checks could possibly be done is not a reason not to look or consider this.

Laurenz makes the right assumption about my message, and sorry to suggest an emphasis on a primary key
for doing the check. A primary key has a high chance of existing, and must contain all table rows pretty much
guaranteed, and therefore is a good candidate for validating the table, but this is really just a practical
way to be able to check if a table has all its segments available without requiring changes to the catalog.

My aim actually is to point out this, what I think, is an omission, where we are missing out.
I don't know what would fit best in the current engine, my suggestion is to keep track of segments, or the last
page that contains data and thus should be available, so that metadata is available to allow the database
to be able to validate if the main fork and all segments are available.


Frits Hoogland




On 1 Oct 2025, at 08:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Tue, 2025-09-30 at 18:55 +0300, Aleksander Alekseev wrote:
Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

I'm a bit puzzled by your emphasis on primary keys. In Postgres it is
legal to have tables without PKs, indexes, or even columns:

=# create table my_table();
=# select * from my_table;

To clarify, are you proposing not to check such tables?

If there is no index on a table, there is no way to determine a missing
segment.  I don't think that Frits deliberately wants to *not* check
such tables, it's just that there is no way to do it as things are now.

Just because a check is not possible in certain cases is no good reason
to forgo a check in those cases where it is possible.

Yours,
Laurenz Albe

Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
Thank you for your answer Tom,

As pointed out in another thread of this topic: using the heapallindexed option, it is 
not possible to detect that the table has missing segments and thus missing data.
What it will detect is if the index is missing data that is existing in the table, it validates
table->index.

Frits Hoogland




On 30 Sep 2025, at 18:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aleksander Alekseev <aleksander@tigerdata.com> writes:
Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

... IMO there is little value in adding a check for the existence of
the segments for a single table. And the *real* check will not differ
much from something like SELECT * FROM my_table, or from making a
complete backup of the database.

As Frits mentioned, neither of those actions will really notice if a
table has been truncated via loss of a segment.

However, I think the requested functionality already exists via
contrib/amcheck (see the heapallindexed option).  The user does have
to make a decision which index to check with, but I think that'd be
required anyway --- as you say, there isn't necessarily a primary key.

regards, tom lane

Re: The ability of postgres to determine loss of files of the main fork

От
Jakub Wartak
Дата:
On Wed, Oct 1, 2025 at 1:46 PM Aleksander Alekseev
<aleksander@tigerdata.com> wrote:
>
> Hi Jakub,
>
> > IMHO all files should be opened at least on startup to check integrity,
>
> That might be a lot of files to open.

I was afraid of that, but let's say modern high-end is 200TB big DB,
that's like 200*1024 1GB files, but I'm getting such time(1) timings
for 204k files on ext4:

$ time ./createfiles                      # real    0m2.157s, it's
open(O_CREAT)+close()
$ time ls -l many_files_dir/ > /dev/null # real    0m0.734s
$ time ./openfiles                          # real    0m0.297s , for
already existing ones (hot)
$ time ./openfiles                          # real    0m1.456s , for
already existing ones (cold, echo 3 > drop_caches sysctl)

Not bad in my book as a one time activity. It could pose a problem
potentially with some high latency open() calls, maybe NFS or
something remote I guess.

> Even if you can open a file it doesn't mean it's not empty

Correct, I haven't investigated that rabbithole...

> or is not corrupted.

I think checksums guard users well in this case as they would get
notified that stuff is wonky (much better than wrong result/silent
data loss)

-J.



Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
Thank you for looking into this Jakub, Aleksander, Michael, Tom,

There are more scenario's where files might be missing: how about a backup somehow missing files?
But also deliberate sabotage, you can hide data in this way, and it would be hard to detect, and even
harder to understand what happened, why it happened and how it happened. I dare to say quite ideal
if you want a name to be removed from a database a covert way, and know something about the
database and have access, or somehow can manipulate in the data directory.

Without having the exact sourcecode files and functions handy, the way I think using the segments
works in postgres, is that the database, because of the absence of any indicator of size, will scan 
the relfilenode, and when it reaches the set limit (1GB), it will try to open the next segment. If that 
segment doesn't exist, the database assumes it's the end of the segment. That is all. That is why there
is no error: it's exactly alike when there truly is no more data, and there is no metadata to understand
there is supposed to be more data.

And therefore I talked about the primary key: that contains "indirect" data to be able to detect if a
table has a missing segment. But I think really what is needed is a max page number to indicate
the known size.



Frits Hoogland




On 1 Oct 2025, at 13:20, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:

On Wed, Oct 1, 2025 at 9:02 AM Michael Banck <mbanck@gmx.net> wrote:

Hi,

wow, this is one of the most terrifying threads I've ever seen...

Same.

On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
Aleksander Alekseev <aleksander@tigerdata.com> writes:
Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

... IMO there is little value in adding a check for the existence of
the segments for a single table. And the *real* check will not differ
much from something like SELECT * FROM my_table, or from making a
complete backup of the database.

As Frits mentioned, neither of those actions will really notice if a
table has been truncated via loss of a segment.

Is there a valid case for a missing segment? If not, couldn't this be
caught somewhere in the storage manager?


I've took a look on PG17 and in _mfd_openseg() there's if fd < 0
return NULL after open(), but out of it's callers only _mdfd_getseg()
seems to be alerting on that NULL. To me this seems like a bug,
because i've seen way too many times people and software deleting
files randomly. Even simple crashes (with e2fsck, xfs_repair) could
put orphaned inodes into /lost+found. IMHO all files should be opened
at least on startup to check integrity, because the non-zero return
code (during such SELECT) for openat(2) seems o be coming out of
RelationGetNumberOfBlocksInFork()->table_block_relation_size()->smgrnblocks()->mdnblocks()->_mdfd_openseg().
Now if the 1st seg file would be missing we would complain in
mdopenfork(). mdnblocks() says even "all active segments of the
relation are opened...", but even that apparently is not true.

The bigger context seems to be be that 049469e7e7cfe0c69 (2015) could
be culprit here as well, as it is stated there that mdnblocks() could
earlier create zero-length files back in day and it removed that
ereport(ERROR) when unable to access that file.

Another idea (than this being a bug) is that Thomas had a large
relation patchset back in [1], but I wouldn't be a fan of us operating
on 31-32TB files ;)

-J.

[1] - https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com

Re: The ability of postgres to determine loss of files of the main fork

От
Michael Banck
Дата:
Hi,

On Wed, Oct 01, 2025 at 02:05:53PM +0200, Jakub Wartak wrote:
> On Wed, Oct 1, 2025 at 1:46 PM Aleksander Alekseev
> <aleksander@tigerdata.com> wrote:
> > > IMHO all files should be opened at least on startup to check
> > > integrity,

I would say s/startup/crash recovery/, if any.

> > That might be a lot of files to open.
> 
> I was afraid of that, but let's say modern high-end is 200TB big DB,
> that's like 200*1024 1GB files, but I'm getting such time(1) timings
> for 204k files on ext4:
> 
> $ time ./createfiles                      # real    0m2.157s, it's
> open(O_CREAT)+close()
> $ time ls -l many_files_dir/ > /dev/null # real    0m0.734s
> $ time ./openfiles                          # real    0m0.297s , for
> already existing ones (hot)
> $ time ./openfiles                          # real    0m1.456s , for
> already existing ones (cold, echo 3 > drop_caches sysctl)
>
> Not bad in my book as a one time activity. It could pose a problem
> potentially with some high latency open() calls, maybe NFS or
> something remote I guess.

Yeah, did you try on SAN as well? I am doubtful that will be performant.


Michael



Re: The ability of postgres to determine loss of files of the main fork

От
Laurenz Albe
Дата:
On Wed, 2025-10-01 at 13:58 +0200, Frits Hoogland wrote:
> I am proposing the database to have the ability to detect when it has missing segments.

Just a random idea: one solution would be if each segment has a flag that indicates
if that is the last segment or not.  But that would break the on-disk storage format,
unless there is room left for an extra flag somewhere in the current layout.

Yours,
Laurenz Albe



Re: The ability of postgres to determine loss of files of the main fork

От
Andres Freund
Дата:
Hi,

On 2025-10-01 15:39:04 +0200, Laurenz Albe wrote:
> On Wed, 2025-10-01 at 13:58 +0200, Frits Hoogland wrote:
> > I am proposing the database to have the ability to detect when it has missing segments.
> 
> Just a random idea: one solution would be if each segment has a flag that indicates
> if that is the last segment or not.  But that would break the on-disk storage format,
> unless there is room left for an extra flag somewhere in the current layout.

It'd also make extensions / truncations more complicated. I rather doubt we're
going there. Right now relation extension aren't WAL logged. While there might
be reasons to change that, I don't think this is enough justification for
doing so.

Greetings,

Andres



Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
> On 1 Oct 2025, at 15:49, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-10-01 15:39:04 +0200, Laurenz Albe wrote:
>> On Wed, 2025-10-01 at 13:58 +0200, Frits Hoogland wrote:
>>> I am proposing the database to have the ability to detect when it has missing segments.
>>
>> Just a random idea: one solution would be if each segment has a flag that indicates
>> if that is the last segment or not.  But that would break the on-disk storage format,
>> unless there is room left for an extra flag somewhere in the current layout.
>
> It'd also make extensions / truncations more complicated. I rather doubt we're
> going there. Right now relation extension aren't WAL logged. While there might
> be reasons to change that, I don't think this is enough justification for
> doing so.
>

What would be a achievable way of making postgres under the relation size?
How about a field in pg_class that keeps the final data page, so that the catalog
keeps the size, which then allows utilities and the database itself to understand how
many segments should exist?

> Greetings,
>
> Andres




Re: The ability of postgres to determine loss of files of the main fork

От
Jakub Wartak
Дата:
On Wed, Oct 1, 2025 at 2:15 PM Michael Banck <mbanck@gmx.net> wrote:

> > > That might be a lot of files to open.
> >
> > I was afraid of that, but let's say modern high-end is 200TB big DB,
> > that's like 200*1024 1GB files, but I'm getting such time(1) timings
> > for 204k files on ext4:
> >
> > $ time ./createfiles                      # real    0m2.157s, it's
> > open(O_CREAT)+close()
> > $ time ls -l many_files_dir/ > /dev/null # real    0m0.734s
> > $ time ./openfiles                          # real    0m0.297s , for
> > already existing ones (hot)
> > $ time ./openfiles                          # real    0m1.456s , for
> > already existing ones (cold, echo 3 > drop_caches sysctl)
> >
> > Not bad in my book as a one time activity. It could pose a problem
> > potentially with some high latency open() calls, maybe NFS or
> > something remote I guess.
>
> Yeah, did you try on SAN as well? I am doubtful that will be performant.


OK you I guess you wanted to have some latency + big fs size, so I do
not have real SAN here, but I've tried that on noisy Azure's Cloud 4x
disk Standard __HDD__ (probably SSD/flash but just throttled way
down). Each LUN of size: 32TB (max), giving me in total 128TB striped
fs:
- lvcreate -L 128G -n lv -i 4 -I 1M vgbig + mkfs.xfs
- ioping reports there the following latencies for Direct IO:
min/avg/max/mdev = 3.25 ms / 18.0 ms / 193.8 ms / 39.5 ms (yay!)
- so as per above it's doesn't have low latency at all
- simple fio (fio --name=rand_read_iops
--filename=/xfs/many_files_dir/test1 --rw=randread --bs=4k --direct=1
--ioengine=io_uring --iodepth=64 --size=4G --runtime=60 --time_based
--group_reporting) reports just:  IOPS=959, BW=3837KiB/s

Yet, to create/open 204k empty files to simulate that 200TB cluster on
that 128TB fs:

$ time ./createfiles                            # real 0m2.140s
$ time ls -l /xfs/many_files_dir/ > /dev/null  # real 0m0.697s
$ time ./openfiles                             # real 0m0.440s (hot)
$ time ./openfiles                               # real 0m29.872s
(after 3 to drop_caches)
$ time ./openfiles                             # real 0m0.443s (hot)
$ time ./openfiles                               # real 0m31.050s
(after 3 to drop_caches)

$ echo 3 > /proc/sys/vm/drop_caches
$ time ls -l /xfs/many_files_dir/ > /dev/null  # real 1m2.345s
$ time ./openfiles                             # real 0m0.437s (hot)

Anyway, I do not know if opening all the files on startup (or just
crash-recovery?) is the proper way, personally I would take +30s to
open a 200TB database any day over the risk of silent data loss and
wrong results given the option. I'm sure however that ignoring missing
files on access is certainly not how things should be looking like and
I would speculate that you might even get some accidents that files
may go missing when up & running and you won't be notified in any way
(due to human error: imagine someone rsyncing the wrong [opposite]
direction he wanted by accident or something like that).

-J.



Re: The ability of postgres to determine loss of files of the main fork

От
Laurenz Albe
Дата:
On Mon, 2025-10-06 at 11:19 +0200, Jakub Wartak wrote:
> Anyway, I do not know if opening all the files on startup (or just
> crash-recovery?) is the proper way

I am not sure if you understand the problem at hand: how can you
tell that a segment of a relation is missing?  You have to know that
there should be a file before you can try to open it.

Yours,
Laurenz Albe



Re: The ability of postgres to determine loss of files of the main fork

От
Jakub Wartak
Дата:
On Mon, Oct 6, 2025 at 2:07 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Mon, 2025-10-06 at 11:19 +0200, Jakub Wartak wrote:
> > Anyway, I do not know if opening all the files on startup (or just
> > crash-recovery?) is the proper way
>
> I am not sure if you understand the problem at hand: how can you
> tell that a segment of a relation is missing?  You have to know that
> there should be a file before you can try to open it.

I'm pretty aware that PG doesnt track the relation segment count, but
possibly it should as without that nasty stuff can happen. Then the
discussion was mostly on how practical it would be to just open all
files on big DBs during startup (if we know them in advance), but it
is just one of the ideas I suspect, I've just checked those timings
out of curiosity.

Anyway there's also a referenced earlier idea for single file relation
model by Thomas.

Another fun-fact: see READ_ONLY_OPEN_DELAYED details in Oracle.

-J.



Re: The ability of postgres to determine loss of files of the main fork

От
Robert Haas
Дата:
On Wed, Oct 1, 2025 at 11:25 AM Frits Hoogland <frits.hoogland@gmail.com> wrote:
> What would be a achievable way of making postgres under the relation size?
> How about a field in pg_class that keeps the final data page, so that the catalog
> keeps the size, which then allows utilities and the database itself to understand how
> many segments should exist?

I think that would definitely be impractical. Your idea of having an
option for amcheck that is the reverse of heapallindexed
(indexallheaped?) seems perfectly reasonable as a debugging tool and
probably not that hard to implement, but actually noticing organically
would be tricky, both in terms of code complexity and also in terms of
performance.

Updating pg_class every time we extend any relation in the system by a
block is definitely going to be painfully slow -- and there's also the
problem that you can't very well track the length of pg_class itself
by updating pg_class, because you might not be able to update pg_class
without extending it. What seems more practical is to store metadata
in a metapage within each relation or in some separate storage.
However, even that is far from problem-free. Even in the best case
where there are no other problems, you're talking about emitting WAL
records upon relation extension, which I suspect would cause a quite
noticeable impact if you did it for every block.

An idea that I had was to keep track of the number of segments rather
than the entire length of the relation. That's not as good, because
then you can't detect truncation of the last file, but it would be
good enough to detect the disappearance of entire files, and it would
mean that the metadata only needs to be updated once per GB of the
relation rather than every time you extend.

But even this has a lot of engineering challenges. To really be able
to do the cross-checks in a meaningful way, you'd want the md*
functions to have access to the information -- and I'm having some
difficulty imagining how we would arrange for that. For instance, if
mdread() is asked for a block and first needs to know whether that
block (or the containing segment) should exist, it's not going to have
access to the relcache to check some cached data. We could possibly
cache something in the SMgrRelation, but if the cache is not
populated, then we'd have to read the data from the original source.
But surely we can't have mdread() calling ReadBuffer(); that would be
a huge layering violation and would likely cause some very unpleasant
problems.

I expect there is some way to rejigger things so that the md.c layer
has to be told by the higher layers how many segments can exist and
then to figure out a way to bootstrap things, but it's probably all
quite complicated so I am definitely not volunteering to be the one to
do the work...

--
Robert Haas
EDB: http://www.enterprisedb.com