Обсуждение: Re: Make relfile tombstone files conditional on WAL level

Поиск
Список
Период
Сортировка

Re: Make relfile tombstone files conditional on WAL level

От
Heikki Linnakangas
Дата:
On 05/03/2021 00:02, Thomas Munro wrote:
> Hi,
> 
> I'm starting a new thread for this patch that originated as a
> side-discussion in [1], to give it its own CF entry in the next cycle.
> This is a WIP with an open question to research: what could actually
> break if we did this?

I don't see a problem.

It would indeed be nice to have some other mechanism to prevent the 
issue with wal_level=minimal, the tombstone files feel hacky and 
complicated. Maybe a new shared memory hash table to track the 
relfilenodes of dropped tables.

- Heikki



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It would indeed be nice to have some other mechanism to prevent the
> issue with wal_level=minimal, the tombstone files feel hacky and
> complicated. Maybe a new shared memory hash table to track the
> relfilenodes of dropped tables.

Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.

I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.

So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Andres Freund
Дата:
Hi,

On 2021-08-02 16:03:31 -0400, Robert Haas wrote:
> The two most principled solutions to this problem that I can see are
> (1) remove wal_level=minimal and

I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.


> (2) use 64-bit relfilenodes. I have
> been reluctant to support #1 because it's hard for me to believe that
> there aren't cases where being able to skip a whole lot of WAL-logging
> doesn't work out to a nice performance win, but I realize opinions on
> that topic vary. And I'm pretty sure that Andres, at least, will hate
> #2 because he's unhappy with the width of buffer tags already.

Yep :/

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.


> So I don't really have a good idea. I agree this tombstone system is a
> bit of a wart, but I'm not sure that this patch really makes anything
> any better, and I'm not really seeing another idea that seems better
> either.

> Maybe I am missing something...

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes.  It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.


It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Greetings,

Andres Freund



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
> What I proposed in the past was to have a new shared table that tracks
> relfilenodes. I still think that's a decent solution for just the problem at
> hand.

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

> But it'd also potentially be the way to redesign relation forks and even
> slim down buffer tags:
>
> Right now a buffer tag is:
> - 4 byte tablespace oid
> - 4 byte database oid
> - 4 byte "relfilenode oid" (don't think we have a good name for this)
> - 4 byte fork number
> - 4 byte block number
>
> If we had such a shared table we could put at least tablespace, fork number
> into that table mapping them to an 8 byte "new relfilenode". That'd only make
> the "new relfilenode" unique within a database, but that'd be sufficient for
> our purposes.  It'd give use a buffertag consisting out of the following:
> - 4 byte database oid
> - 8 byte "relfilenode"
> - 4 byte block number

Yep. I think this is a good direction.

> Of course, it'd add some complexity too, because a buffertag alone wouldn't be
> sufficient to read data (as you'd need the tablespace oid from elsewhere). But
> that's probably ok, I think all relevant places would have that information.

I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.

> It's probably possible to remove the database oid from the tag as well, but
> it'd make CREATE DATABASE tricker - we'd need to change the filenames of
> tables as we copy, to adjust them to the differing oid.

Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Thomas Munro
Дата:
On Fri, Mar 5, 2021 at 11:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> This is a WIP with an open question to research: what could actually
> break if we did this?

I thought this part of bgwriter.c might be a candidate:

       if (FirstCallSinceLastCheckpoint())
       {
           /*
            * After any checkpoint, close all smgr files.  This is so we
            * won't hang onto smgr references to deleted files indefinitely.
            */
            smgrcloseall();
       }

Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already?  What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running?  Not sinval, and not the
above code that only runs between BgBufferSync() invocations.



Re: Make relfile tombstone files conditional on WAL level

От
Thomas Munro
Дата:
On Wed, Aug 4, 2021 at 3:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
> It's not really clear to me what problem is at hand. The problems that
> the tombstone system created for the async I/O stuff weren't really
> explained properly, IMHO. And I don't think the current system is all
> that ugly. it's not the most beautiful thing in the world but we have
> lots of way worse hacks. And, it's easy to understand, requires very
> little code, and has few moving parts that can fail. As hacks go it's
> a quality hack, I would say.

It's not really an AIO problem.  It's just that while testing the AIO
stuff across a lot of operating systems, we had tests failing on
Windows because the extra worker processes you get if you use
io_method=worker were holding cached descriptors and causing stuff
like DROP TABLESPACE to fail.  AFAIK every problem we discovered in
that vein is a current live bug in all versions of PostgreSQL for
Windows (it just takes other backends or the bgwriter to hold an fd at
the wrong moment).  The solution I'm proposing to that general class
of problem is https://commitfest.postgresql.org/34/2962/ .

In the course of thinking about that, it seemed natural to look into
the possibility of getting rid of the tombstones, so that at least
Unix systems don't find themselves having to suffer through a
CHECKPOINT just to drop a tablespace that happens to contain a
tombstone.



Re: Make relfile tombstone files conditional on WAL level

От
Thomas Munro
Дата:
On Wed, Sep 29, 2021 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Hmm, on closer inspection, isn't the lack of real interlocking with
> checkpoints a bit suspect already?  What stops bgwriter from writing
> to the previous relfilenode generation's fd, if a relfilenode is
> recycled while BgBufferSync() is running?  Not sinval, and not the
> above code that only runs between BgBufferSync() invocations.

I managed to produce a case where live data is written to an unlinked
file and lost, with a couple of tweaks to get the right timing and
simulate OID wraparound.  See attached.  If you run the following
commands repeatedly with shared_buffers=256kB and
bgwriter_lru_multiplier=10, you should see a number lower than 10,000
from the last query in some runs, depending on timing.

create extension if not exists chaos;
create extension if not exists pg_prewarm;

drop table if exists t1, t2;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t1 as select 42 i from generate_series(1, 10000);
select pg_prewarm('t1'); -- fill buffer pool with t1
update t1 set i = i; -- dirty t1 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

drop table t1;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t2 as select 0 i from generate_series(1, 10000);
select pg_prewarm('t2'); -- fill buffer pool with t2
update t2 set i = 1 where i = 0; -- dirty t2 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

select pg_prewarm('pg_attribute'); -- evict all clean t2 buffers
select sum(i) as t2_sum_should_be_10000 from t2; -- have any updates been lost?

Вложения

Re: Make relfile tombstone files conditional on WAL level

От
Thomas Munro
Дата:
On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I managed to produce a case where live data is written to an unlinked
> file and lost

I guess this must have been broken since release 9.2 moved checkpoints
out of here[1].  The connection between checkpoints, tombstone files
and file descriptor cache invalidation in auxiliary (non-sinval)
backends was not documented as far as I can see (or at least not
anywhere near the load-bearing parts).

How could it be fixed, simply and backpatchably?  If BgSyncBuffer()
did if-FirstCallSinceLastCheckpoint()-then-smgrcloseall() after
locking each individual buffer and before flushing, then I think it
might logically have the correct interlocking against relfilenode
wraparound, but that sounds a tad expensive :-(  I guess it could be
made cheaper by using atomics for the checkpoint counter instead of
spinlocks.  Better ideas?

[1]
https://www.postgresql.org/message-id/flat/CA%2BU5nMLv2ah-HNHaQ%3D2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA%40mail.gmail.com



Re: Make relfile tombstone files conditional on WAL level

От
Thomas Munro
Дата:
On Tue, Oct 5, 2021 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I managed to produce a case where live data is written to an unlinked
> > file and lost

In conclusion, there *is* something else that would break, so I'm
withdrawing this CF entry (#3030) for now.  Also, that something else
is already subtly broken, so I'll try to come up with a fix for that
separately.



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
> I guess there's a somewhat hacky way to get somewhere without actually
> increasing the size. We could take 3 bytes from the fork number and use that
> to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
> everyone.
>
> It's not like we can use those bytes in a useful way, due to alignment
> requirements. Declaring that the high 7 bytes are for the relNode portion and
> the low byte for the fork would still allow efficient comparisons and doesn't
> seem too ugly.

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse. On Windows, we still need
https://commitfest.postgresql.org/36/2962/ because of the problem that
Windows won't remove files from the directory listing until they are
both unlinked and closed. But in general this seems like it would lead
to cleaner code. For example, GetNewRelFileNode() needn't loop. If it
allocate the smallest unsigned integer that the cluster (or database)
has never previously assigned, the file should definitely not exist on
disk, and if it does, an ERROR is appropriate, as the database is
corrupted. This does assume that allocations from this new 56-bit
relfilenode counter are properly WAL-logged.

I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.

I think, anyway.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Thu, Jan 6, 2022 at 3:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
> > I guess there's a somewhat hacky way to get somewhere without actually
> > increasing the size. We could take 3 bytes from the fork number and use that
> > to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
> > everyone.
> >
> > It's not like we can use those bytes in a useful way, due to alignment
> > requirements. Declaring that the high 7 bytes are for the relNode portion and
> > the low byte for the fork would still allow efficient comparisons and doesn't
> > seem too ugly.
>
> I think this idea is worth more consideration. It seems like 2^56
> relfilenodes ought to be enough for anyone, recalling that you can
> only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
> bunch of code that is there to guard against relfilenodes being
> reused. In particular, we can remove the code that leaves a 0-length
> tombstone file around until the next checkpoint to guard against
> relfilenode reuse.

+1

>
> I think this would also solve a problem Dilip mentioned to me today:
> suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
> been trying to do. Then suppose you do "ALTER DATABASE foo SET
> TABLESPACE used_recently_but_not_any_more". You might get an error
> complaining that “some relations of database \“%s\” are already in
> tablespace \“%s\“” because there could be tombstone files in that
> database. With this combination of changes, you could just use the
> barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
> wait for those files to disappear, because they've got to be
> previously-unliked files that Windows is still returning because
> they're still opening -- or else they could be a sign of a corrupted
> database, but there are no other possibilities.

Yes, this approach will solve the problem for the WAL-logged ALTER
DATABASE we are facing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> >
> > I think this idea is worth more consideration. It seems like 2^56
> > relfilenodes ought to be enough for anyone, recalling that you can
> > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
> > bunch of code that is there to guard against relfilenodes being
> > reused. In particular, we can remove the code that leaves a 0-length
> > tombstone file around until the next checkpoint to guard against
> > relfilenode reuse.
>
> +1
>

I IMHO a few top level point for implementing this idea would be as listed here,

1) the "relfilenode" member inside the RelFileNode will be now 64
bytes and remove the "forkNum" all together from the BufferTag.  So
now whenever we want to use the relfilenode or fork number we can use
the respective mask and fetch their values.
2) GetNewRelFileNode() will not loop for checking the file existence
and retry with other relfilenode.
3) Modify mdunlinkfork() so that we immediately perform the unlink
request, make sure to register_forget_request() before unlink.
4) In checkpointer, now we don't need any handling for pendingUnlinks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Thomas Munro
Дата:
On Thu, Jan 6, 2022 at 9:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I think this idea is worth more consideration. It seems like 2^56
> > > relfilenodes ought to be enough for anyone, recalling that you can
> > > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
> > > bunch of code that is there to guard against relfilenodes being
> > > reused. In particular, we can remove the code that leaves a 0-length
> > > tombstone file around until the next checkpoint to guard against
> > > relfilenode reuse.
> >
> > +1

+1

> I IMHO a few top level point for implementing this idea would be as listed here,
>
> 1) the "relfilenode" member inside the RelFileNode will be now 64
> bytes and remove the "forkNum" all together from the BufferTag.  So
> now whenever we want to use the relfilenode or fork number we can use
> the respective mask and fetch their values.
> 2) GetNewRelFileNode() will not loop for checking the file existence
> and retry with other relfilenode.
> 3) Modify mdunlinkfork() so that we immediately perform the unlink
> request, make sure to register_forget_request() before unlink.
> 4) In checkpointer, now we don't need any handling for pendingUnlinks.

Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Another problem is that relfilenodes are normally allocated with
> GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
> a new allocator, and they won't be able to match the OID in general
> (while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Andres Freund
Дата:
On 2022-01-06 08:52:01 -0500, Robert Haas wrote:
> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Another problem is that relfilenodes are normally allocated with
> > GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
> > a new allocator, and they won't be able to match the OID in general
> > (while we have 32 bit OIDs at least).
> 
> Personally I'm not sad about that. Values that are the same in simple
> cases but diverge in more complex cases are kind of a trap for the
> unwary.

+1



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Another problem is that relfilenodes are normally allocated with
> GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
> a new allocator, and they won't be able to match the OID in general
> (while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.

Make sense, I have started working on this idea, I will try to post the first version by early next week.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> > Another problem is that relfilenodes are normally allocated with
>> > GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
>> > a new allocator, and they won't be able to match the OID in general
>> > (while we have 32 bit OIDs at least).
>>
>> Personally I'm not sad about that. Values that are the same in simple
>> cases but diverge in more complex cases are kind of a trap for the
>> unwary. There's no real reason to have them ever match. Yeah, in
>> theory, it makes it easier to tell which file matches which relation,
>> but in practice, you always have to double-check in case the table has
>> ever been rewritten. It doesn't seem worth continuing to contort the
>> code for a property we can't guarantee anyway.
>
>
> Make sense, I have started working on this idea, I will try to post the first version by early next week.

Here is the first working patch, with that now we don't need to
maintain the TombStone file until the next checkpoint.  This is still
a WIP patch with this I can see my problem related to ALTER DATABASE
SET TABLESPACE WAL-logged problem is solved which Robert reported a
couple of mails above in the same thread.

General idea of the patch:
- Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits
for fork number and 56 bits for the relNode as shown below. [1]
- GetNewRelFileNode() will just generate a new unique relfilenode and
check the file existence and if it already exists then throw an error,
so no loop.  We also need to add the logic for preserving the
nextRelNode across restart and also WAL logging it but that is similar
to the preserving nextOid.
- mdunlinkfork, will directly forget the relfilenode, so we get rid of
all unlinking code from the code.
- Now, we don't need any post checkpoint unlinking activity.

[1]
/*
* RelNodeId:
*
* this is a storage type for RelNode. The reasoning behind using this is same
* as using the BlockId so refer comment atop BlockId.
*/
typedef struct RelNodeId
{
      uint32 rn_hi;
      uint32 rn_lo;
} RelNodeId;
typedef struct RelFileNode
{
   Oid spcNode; /* tablespace */
   Oid dbNode; /* database */
   RelNodeId relNode; /* relation */
} RelFileNode;

TODO:

There are a couple of TODOs and FIXMEs which I am planning to improve
by next week.  I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose.  And,
Improve comments.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Fri, Jan 28, 2022 at 8:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

>
> TODO:
>
> There are a couple of TODOs and FIXMEs which I am planning to improve
> by next week.  I am also planning to do the testing where relfilenode
> consumes more than 32 bits, maybe for that we can set the
> FirstNormalRelfileNode to higher value for the testing purpose.  And,
> Improve comments.
>

I have fixed most of TODO and FIXMEs but there are still a few which I
could not decide, the main one currently we do not have uint8 data
type only int8 is there so I have used int8 for storing relfilenode +
forknumber.  Although this is sufficient because I don't think we will
ever get more than 128 fork numbers.  But my question is should we
think for adding uint8 as new data type or infect make RelNode itself
as new data type like we have Oid.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> the main one currently we do not have uint8 data
> type only int8 is there so I have used int8 for storing relfilenode +
> forknumber.

I'm confused. We use int8 in tons of places, so I feel like it must exist.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Jan 31, 2022 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > the main one currently we do not have uint8 data
> > type only int8 is there so I have used int8 for storing relfilenode +
> > forknumber.
>
> I'm confused. We use int8 in tons of places, so I feel like it must exist.

Rather, we use uint8 in tons of places, so I feel like it must exist.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Mon, Jan 31, 2022 at 7:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 31, 2022 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > the main one currently we do not have uint8 data
> > > type only int8 is there so I have used int8 for storing relfilenode +
> > > forknumber.
> >
> > I'm confused. We use int8 in tons of places, so I feel like it must exist.
>
> Rather, we use uint8 in tons of places, so I feel like it must exist.

Hmm, at least pg_type doesn't have anything with a name like uint8.

postgres[101702]=# select oid, typname from pg_type where typname like '%int8';
 oid  | typname
------+---------
   20 | int8
 1016 | _int8
(2 rows)

postgres[101702]=# select oid, typname from pg_type where typname like '%uint%';
 oid | typname
-----+---------
(0 rows)

I agree that we are using 8 bytes unsigned int multiple places in code
as uint64.  But I don't see it as an exposed data type and not used as
part of any exposed function.  But we will have to use the relfilenode
in the exposed c function e.g.
binary_upgrade_set_next_heap_relfilenode().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I agree that we are using 8 bytes unsigned int multiple places in code
> as uint64.  But I don't see it as an exposed data type and not used as
> part of any exposed function.  But we will have to use the relfilenode
> in the exposed c function e.g.
> binary_upgrade_set_next_heap_relfilenode().

Oh, I thought we were talking about the C data type uint8 i.e. an
8-bit unsigned integer. Which in retrospect was a dumb thought because
you said you wanted to store the relfilenode AND the fork number
there, which only make sense if you were talking about SQL data types
rather than C data types. It is confusing that we have an SQL data
type called int8 and a C data type called int8 and they're not the
same.

But if you're talking about SQL data types, why? pg_class only stores
the relfilenode and not the fork number currently, and I don't see why
that would change. I think that the data type for the relfilenode
column would change to a 64-bit signed integer (i.e. bigint or int8)
that only ever uses the low-order 56 bits, and then when you need to
store a relfilenode and a fork number in the same 8-byte quantity
you'd do that using either a struct with bit fields or by something
like combined = ((uint64) signed_representation_of_relfilenode) |
(((int) forknumber) << 56);

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Wed, Feb 2, 2022 at 6:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I agree that we are using 8 bytes unsigned int multiple places in code
> > as uint64.  But I don't see it as an exposed data type and not used as
> > part of any exposed function.  But we will have to use the relfilenode
> > in the exposed c function e.g.
> > binary_upgrade_set_next_heap_relfilenode().
>
> Oh, I thought we were talking about the C data type uint8 i.e. an
> 8-bit unsigned integer. Which in retrospect was a dumb thought because
> you said you wanted to store the relfilenode AND the fork number
> there, which only make sense if you were talking about SQL data types
> rather than C data types. It is confusing that we have an SQL data
> type called int8 and a C data type called int8 and they're not the
> same.
>
> But if you're talking about SQL data types, why? pg_class only stores
> the relfilenode and not the fork number currently, and I don't see why
> that would change. I think that the data type for the relfilenode
> column would change to a 64-bit signed integer (i.e. bigint or int8)
> that only ever uses the low-order 56 bits, and then when you need to
> store a relfilenode and a fork number in the same 8-byte quantity
> you'd do that using either a struct with bit fields or by something
> like combined = ((uint64) signed_representation_of_relfilenode) |
> (((int) forknumber) << 56);

Yeah you're right.  I think whenever we are using combined then we can
use uint64 C type and in pg_class we can keep it as int64 because that
is only representing the relfilenode part.  I think I was just
confused and tried to use the same data type everywhere whether it is
combined with fork number or not.  Thanks for your input, I will
change this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Wed, Feb 2, 2022 at 7:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Feb 2, 2022 at 6:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have splitted the patch into multiple patches which can be
independently committable and easy to review. I have explained the
purpose and scope of each patch in the respective commit messages.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Feb 7, 2022 at 12:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have splitted the patch into multiple patches which can be
> independently committable and easy to review. I have explained the
> purpose and scope of each patch in the respective commit messages.

Hmm. The parts of this I've looked at seem reasonably clean, but I
don't think I like the design choice. You're inventing
RelFileNodeSetFork(), but at present the RelFileNode struct doesn't
include a fork number. I feel like we should leave that alone, and
only change the definition of a BufferTag. What about adding accessors
for all of the BufferTag fields in 0001, and then in 0002 change it to
look like something this:

typedef struct BufferTag
{
    Oid     dbOid;
    Oid     tablespaceOid;
    uint32  fileNode_low;
    uint32  fileNode_hi:24;
    uint32  forkNumber:8;
    BlockNumber blockNumber;
} BufferTag;

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Mon, Feb 7, 2022 at 9:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 7, 2022 at 12:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have splitted the patch into multiple patches which can be
> > independently committable and easy to review. I have explained the
> > purpose and scope of each patch in the respective commit messages.
>
> Hmm. The parts of this I've looked at seem reasonably clean, but I
> don't think I like the design choice. You're inventing
> RelFileNodeSetFork(), but at present the RelFileNode struct doesn't
> include a fork number. I feel like we should leave that alone, and
> only change the definition of a BufferTag. What about adding accessors
> for all of the BufferTag fields in 0001, and then in 0002 change it to
> look like something this:
>
> typedef struct BufferTag
> {
>     Oid     dbOid;
>     Oid     tablespaceOid;
>     uint32  fileNode_low;
>     uint32  fileNode_hi:24;
>     uint32  forkNumber:8;
>     BlockNumber blockNumber;
> } BufferTag;

Okay, we can do that.  But we can not leave RelFileNode untouched I
mean inside RelFileNode also we will have to change the relNode as 2
32 bit integers, I mean like below.

> typedef struct RelFileNode
> {
>     Oid     spcNode;
>     Oid     dbNode;
>     uint32  relNode_low;
>     uint32  relNode_hi;
} RelFileNode;

For RelFileNode also we need to use 2, 32-bit integers so that we do
not add extra alignment padding because there are a few more
structures that include RelFileNode e.g. xl_xact_relfilenodes,
RelFileNodeBackend, and many other structures.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Feb 7, 2022 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> For RelFileNode also we need to use 2, 32-bit integers so that we do
> not add extra alignment padding because there are a few more
> structures that include RelFileNode e.g. xl_xact_relfilenodes,
> RelFileNodeBackend, and many other structures.

Are you sure that kind of stuff is really important enough to justify
the code churn? I don't think RelFileNodeBackend is used widely enough
or in sufficiently performance-critical places that we really need to
care about a few bytes of alignment padding. xl_xact_relfilenodes is
more concerning because that goes into the WAL format, but I don't
know that we use it often enough for an extra 4 bytes per record to
really matter, especially considering that this proposal also adds 4
bytes *per relfilenode* which has to be a much bigger deal than a few
padding bytes after 'nrels'. The reason why BufferTag matters a lot is
because (1) we have an array of this struct that can easily contain a
million or eight entries, so the alignment padding adds up a lot more
and (2) access to that array is one of the most performance-critical
parts of PostgreSQL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Mon, Feb 7, 2022 at 10:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 7, 2022 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > For RelFileNode also we need to use 2, 32-bit integers so that we do
> > not add extra alignment padding because there are a few more
> > structures that include RelFileNode e.g. xl_xact_relfilenodes,
> > RelFileNodeBackend, and many other structures.
>
> Are you sure that kind of stuff is really important enough to justify
> the code churn? I don't think RelFileNodeBackend is used widely enough
> or in sufficiently performance-critical places that we really need to
> care about a few bytes of alignment padding. xl_xact_relfilenodes is
> more concerning because that goes into the WAL format, but I don't
> know that we use it often enough for an extra 4 bytes per record to
> really matter, especially considering that this proposal also adds 4
> bytes *per relfilenode* which has to be a much bigger deal than a few
> padding bytes after 'nrels'. The reason why BufferTag matters a lot is
> because (1) we have an array of this struct that can easily contain a
> million or eight entries, so the alignment padding adds up a lot more
> and (2) access to that array is one of the most performance-critical
> parts of PostgreSQL.

I agree with you that adding 4 extra bytes to these structures might
not be really critical.  I will make the changes based on this idea
and see how the changes look.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Thu, Jan 6, 2022 at 1:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

 2) GetNewRelFileNode() will not loop for checking the file existence
> and retry with other relfilenode.

While working on this I realized that even if we make the relfilenode
56 bits we can not remove the loop inside GetNewRelFileNode() for
checking the file existence.  Because it is always possible that the
file reaches to the disk even before the WAL for advancing the next
relfilenode and if the system crashes in between that then we might
generate the duplicate relfilenode right?

I think the second paragraph in XLogPutNextOid() function explain this
issue and now even after we get the wider relfilenode we will have
this issue.  Correct?

I am also attaching the latest set of patches for reference, these
patches fix the review comments given by Robert about moving the
dbOid, tbsOid and RelNode directly into the buffer tag.

Open Issues- there are currently 2 open issues in the patch 1) Issue
as discussed above about removing the loop, so currently in this patch
the loop is removed.  2) During upgrade from the previous version we
need to advance the nextrelfilenode to the current relfilenode we are
setting for the object in order to avoid the conflict.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Mon, Feb 21, 2022 at 1:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 6, 2022 at 1:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>  2) GetNewRelFileNode() will not loop for checking the file existence
> > and retry with other relfilenode.
>
>
> Open Issues- there are currently 2 open issues in the patch 1) Issue
> as discussed above about removing the loop, so currently in this patch
> the loop is removed.  2) During upgrade from the previous version we
> need to advance the nextrelfilenode to the current relfilenode we are
> setting for the object in order to avoid the conflict.


In this version I have fixed both of these issues.  Thanks Robert for
suggesting the solution for both of these problems in our offlist
discussion.  Basically, for the first problem we can flush the xlog
immediately because we are actually logging the WAL every time after
we allocate 64 relfilenode so this should not have much impact on the
performance and I have added the same in the comments.  And during
pg_upgrade, whenever we are assigning the relfilenode as part of the
upgrade we will set that relfilenode + 1 as nextRelFileNode to be
assigned so that we never generate the conflicting relfilenode.

The only part I do not like in the patch is that before this patch we
could directly access the buftag->rnode.  But since now we are not
having directly relfilenode as part of the buffertag and instead of
that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
in the buffer tag.  So if we have to directly get the relfilenode we
need to generate it.  However those changes are very limited to just 1
or 2 file so maybe not that bad.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Mon, Feb 21, 2022 at 2:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> While working on this I realized that even if we make the relfilenode
> 56 bits we can not remove the loop inside GetNewRelFileNode() for
> checking the file existence.  Because it is always possible that the
> file reaches to the disk even before the WAL for advancing the next
> relfilenode and if the system crashes in between that then we might
> generate the duplicate relfilenode right?

I agree.

> I think the second paragraph in XLogPutNextOid() function explain this
> issue and now even after we get the wider relfilenode we will have
> this issue.  Correct?

I think you are correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> In this version I have fixed both of these issues.  Thanks Robert for
> suggesting the solution for both of these problems in our offlist
> discussion.  Basically, for the first problem we can flush the xlog
> immediately because we are actually logging the WAL every time after
> we allocate 64 relfilenode so this should not have much impact on the
> performance and I have added the same in the comments.  And during
> pg_upgrade, whenever we are assigning the relfilenode as part of the
> upgrade we will set that relfilenode + 1 as nextRelFileNode to be
> assigned so that we never generate the conflicting relfilenode.

Anyone else have an opinion on this?

> The only part I do not like in the patch is that before this patch we
> could directly access the buftag->rnode.  But since now we are not
> having directly relfilenode as part of the buffertag and instead of
> that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
> in the buffer tag.  So if we have to directly get the relfilenode we
> need to generate it.  However those changes are very limited to just 1
> or 2 file so maybe not that bad.

You're talking here about just needing to introduce BufTagGetFileNode
and BufTagSetFileNode, or something else? I don't find those macros to
be problematic.

BufTagSetFileNode could maybe assert that the OID isn't too big,
though. We should ereport() before we get to this point if we somehow
run out of values, but it might be nice to have a check here as a
backup.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Tue, Mar 8, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:

> > The only part I do not like in the patch is that before this patch we
> > could directly access the buftag->rnode.  But since now we are not
> > having directly relfilenode as part of the buffertag and instead of
> > that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
> > in the buffer tag.  So if we have to directly get the relfilenode we
> > need to generate it.  However those changes are very limited to just 1
> > or 2 file so maybe not that bad.
>
> You're talking here about just needing to introduce BufTagGetFileNode
> and BufTagSetFileNode, or something else? I don't find those macros to
> be problematic.

Yeah, I was talking about BufTagGetFileNode macro only.  The reason I
did not like it is that earlier we could directly use buftag->rnode,
but now whenever we wanted to use rnode first we need to use a
separate variable for preparing the rnode using BufTagGetFileNode
macro.  But these changes are very localized and a very few places so
I don't have much problem with those.

>
> BufTagSetFileNode could maybe assert that the OID isn't too big,
> though. We should ereport() before we get to this point if we somehow
> run out of values, but it might be nice to have a check here as a
> backup.

Yeah, we could do that, I will do that in the next version.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Robert Haas
Дата:
On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> In this version I have fixed both of these issues.

Here's a bit of review for these patches:

- The whole relnode vs. relfilenode thing is really confusing. I
realize that there is some precedent for calling the number that
pertains to the file on disk "relnode" and that value when combined
with the database and tablespace OIDs "relfilenode," but it's
definitely not the most obvious thing, especially since
pg_class.relfilenode is a prominent case where we don't even adhere to
that convention. I'm kind of tempted to think that we should go the
other way and rename the RelFileNode struct to something like
RelFileLocator, and then maybe call the new data type RelFileNumber.
And then we could work toward removing references to "filenode" and
"relfilenode" in favor of either (rel)filelocator or (rel)filenumber.
Now the question (even assuming other people like this general
direction) is how far do we go with it? Renaming pg_class.relfilenode
itself wouldn't be the worst compatibility break we've ever had, but
it would definitely cause some pain. I'd be inclined to leave the
user-visible catalog column alone and just push in this direction for
internal stuff.

- What you're doing to pg_buffercache here is completely unacceptable.
You can't change the definition of an already-released version of the
extension. Please study how such issues have been handled in the past.

- It looks to me like you need to give significantly more thought to
the proper way of adjusting the relfilenode-related test cases in
alter_table.out.

- I think BufTagGetFileNode and BufTagGetSetFileNode should be
introduced in 0001 and then just update the definition in 0002 as
required. Note that as things stand you end up with both
BufTagGetFileNode and BuffTagGetRelFileNode which is an artifact of
the relnode/filenode/relfilenode confusion I mention above, and just
to make matters worse, one returns a value while the other produces an
out parameter. I think the renaming I'm talking about up above might
help somewhat here, but it seems like it might also be good to change
the one that uses an out parameter by doing Get -> Copy, just to help
the reader get a clue a little more easily.

- GetNewRelNode() needs to error out if we would wrap around, not wrap
around. Probably similar to what happens if we exhaust 2^64 bytes of
WAL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Amul Sul
Дата:
Hi Dilip,

On Fri, Mar 4, 2022 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Feb 21, 2022 at 1:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jan 6, 2022 at 1:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >  2) GetNewRelFileNode() will not loop for checking the file existence
> > > and retry with other relfilenode.
> >
> >
> > Open Issues- there are currently 2 open issues in the patch 1) Issue
> > as discussed above about removing the loop, so currently in this patch
> > the loop is removed.  2) During upgrade from the previous version we
> > need to advance the nextrelfilenode to the current relfilenode we are
> > setting for the object in order to avoid the conflict.
>
>
> In this version I have fixed both of these issues.  Thanks Robert for
> suggesting the solution for both of these problems in our offlist
> discussion.  Basically, for the first problem we can flush the xlog
> immediately because we are actually logging the WAL every time after
> we allocate 64 relfilenode so this should not have much impact on the
> performance and I have added the same in the comments.  And during
> pg_upgrade, whenever we are assigning the relfilenode as part of the
> upgrade we will set that relfilenode + 1 as nextRelFileNode to be
> assigned so that we never generate the conflicting relfilenode.
>
> The only part I do not like in the patch is that before this patch we
> could directly access the buftag->rnode.  But since now we are not
> having directly relfilenode as part of the buffertag and instead of
> that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
> in the buffer tag.  So if we have to directly get the relfilenode we
> need to generate it.  However those changes are very limited to just 1
> or 2 file so maybe not that bad.
>

v5 patch needs a rebase and here are a few comments for 0002, I found
while reading that, hope that helps:

+/* Number of RelFileNode to prefetch (preallocate) per XLOG write */
+#define VAR_RFN_PREFETCH       8192
+

Should it be 64, as per comment in XLogPutNextRelFileNode for XLogFlush() ?
---

+   /*
+    * Check for the wraparound for the relnode counter.
+    *
+    * XXX Actually the relnode is 56 bits wide so we don't need to worry about
+    * the wraparound case.
+    */
+   if (ShmemVariableCache->nextRelNode > MAX_RELFILENODE)

Very rare case, should use unlikely()?
---

+/*
+ * Max value of the relfilnode.  Relfilenode will be of 56bits wide for more
+ * details refer comments atop BufferTag.
+ */
+#define MAX_RELFILENODE        ((((uint64) 1) << 56) - 1)

Should there be 57-bit shifts here? Instead, I think we should use
INT64CONST(0xFFFFFFFFFFFFFF) to be consistent with PG_*_MAX
declarations, thoughts?
---

+   /* If we run out of logged for use RelNode then we must log more */
+   if (ShmemVariableCache->relnodecount == 0)

Might relnodecount never go below, but just to be safer should check
<= 0 instead.
---

Few typos:
Simmialr
Simmilar
agains
idealy

Regards,
Amul



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Thu, May 12, 2022 at 4:27 PM Amul Sul <sulamul@gmail.com> wrote:
>
Hi Amul,

Thanks for the review, actually based on some comments from Robert we
have planned to make some design changes.  So I am planning to work on
that for the July commitfest.  I will try to incorporate all your
review comments in the new version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Thomas Munro
Дата:
I think you can get rid of SYNC_UNLINK_REQUEST, sync_unlinkfiletag,
mdunlinkfiletag as these are all now unused.
Are there any special hazards here if the plan in [1] goes ahead?  If
the relfilenode allocation is logged and replayed then it should be
fine to crash and recover multiple times in a row while creating and
dropping tables, with wal_level=minimal, I think.  It would be bad if
the allocator restarted from a value from the checkpoint, though.

[1]
https://www.postgresql.org/message-id/flat/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Mon, May 16, 2022 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I think you can get rid of SYNC_UNLINK_REQUEST, sync_unlinkfiletag,
> mdunlinkfiletag as these are all now unused.

Correct.

> Are there any special hazards here if the plan in [1] goes ahead?

IMHO we should not have any problem.  In fact, we need this for [1]
right?  Otherwise, there is a risk of reusing the same relfilenode
within the same checkpoint cycle as discussed in [2].

> [1]
https://www.postgresql.org/message-id/flat/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CA+TgmoZZDL_2E_zuahqpJ-WmkuxmUi8+g7=dLEny=18r-+c-iQ@mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Make relfile tombstone files conditional on WAL level

От
Dilip Kumar
Дата:
On Tue, Mar 8, 2022 at 10:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > In this version I have fixed both of these issues.
>
> Here's a bit of review for these patches:
>
> - The whole relnode vs. relfilenode thing is really confusing. I
> realize that there is some precedent for calling the number that
> pertains to the file on disk "relnode" and that value when combined
> with the database and tablespace OIDs "relfilenode," but it's
> definitely not the most obvious thing, especially since
> pg_class.relfilenode is a prominent case where we don't even adhere to
> that convention. I'm kind of tempted to think that we should go the
> other way and rename the RelFileNode struct to something like
> RelFileLocator, and then maybe call the new data type RelFileNumber.
> And then we could work toward removing references to "filenode" and
> "relfilenode" in favor of either (rel)filelocator or (rel)filenumber.
> Now the question (even assuming other people like this general
> direction) is how far do we go with it? Renaming pg_class.relfilenode
> itself wouldn't be the worst compatibility break we've ever had, but
> it would definitely cause some pain. I'd be inclined to leave the
> user-visible catalog column alone and just push in this direction for
> internal stuff.

I have worked on this renaming stuff first and once we agree with that
then I will rebase the other patches on top of this and will also work
on the other review comments for those patches.
So basically in this patch
- The "RelFileNode" structure to "RelFileLocator" and also renamed
other internal member as below
typedef struct RelFileLocator
{
      Oid spcOid; /* tablespace */
      Oid dbOid; /* database */
      Oid relNumber; /* relation */
} RelFileLocator;
- All variables and internal functions which are using name as
relfilenode/rnode and referring to this structure are renamed to
relfilelocator/rlocator.
- relNode/relfilenode which are referring to the actual file name on
disk is renamed to relNumber/relfilenumber.
- Based on the new terminology, I have renamed the file names as well, e.g.
relfilenode.h -> relfilelocator.h
relfilenodemap.h -> relfilenumbermap.h

I haven't renamed the exposed catalog variable and exposed function
here is the high level list
- pg_class.relfilenode
- pg_catalog.pg_relation_filenode()
- All test cases variables referring to pg_class.relfilenode.
- exposed option for tool which are w.r.t pg_class relfilenode (e.g.
-f, --filenode=FILENODE)
- exposed functions
pg_catalog.binary_upgrade_set_next_heap_relfilenode() and friends
- pg_filenode.map file name, maybe we can rename this but this is used
by other tools so I left this alone.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

making relfilenodes 56 bits

От
Robert Haas
Дата:
[ changing subject line so nobody misses what's under discussion ]

For a quick summary of the overall idea being discussed here and some
discussion of the problems it solves, see
http://postgr.es/m/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com

For discussion of the proposed renaming of non-user-visible references
to relfilenode to either RelFileLocator or RelFileNumber as
preparatory refactoring work for that change, see
http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com

On Thu, Jun 23, 2022 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have worked on this renaming stuff first and once we agree with that
> then I will rebase the other patches on top of this and will also work
> on the other review comments for those patches.
> So basically in this patch
> - The "RelFileNode" structure to "RelFileLocator" and also renamed
> other internal member as below
> typedef struct RelFileLocator
> {
>       Oid spcOid; /* tablespace */
>       Oid dbOid; /* database */
>       Oid relNumber; /* relation */
> } RelFileLocator;

I like those structure member names fine, but I'd like to see this
preliminary patch also introduce the RelFileNumber typedef as an alias
for Oid. Then the main patch can change it to be uint64.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Fri, Jun 24, 2022 at 1:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> [ changing subject line so nobody misses what's under discussion ]
>
> For a quick summary of the overall idea being discussed here and some
> discussion of the problems it solves, see
> http://postgr.es/m/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com
>
> For discussion of the proposed renaming of non-user-visible references
> to relfilenode to either RelFileLocator or RelFileNumber as
> preparatory refactoring work for that change, see
> http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
>
> On Thu, Jun 23, 2022 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have worked on this renaming stuff first and once we agree with that
> > then I will rebase the other patches on top of this and will also work
> > on the other review comments for those patches.
> > So basically in this patch
> > - The "RelFileNode" structure to "RelFileLocator" and also renamed
> > other internal member as below
> > typedef struct RelFileLocator
> > {
> >       Oid spcOid; /* tablespace */
> >       Oid dbOid; /* database */
> >       Oid relNumber; /* relation */
> > } RelFileLocator;
>
> I like those structure member names fine, but I'd like to see this
> preliminary patch also introduce the RelFileNumber typedef as an alias
> for Oid. Then the main patch can change it to be uint64.

I have changed that. PFA, the updated patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Jun 24, 2022 at 7:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed that. PFA, the updated patch.

Apart from one minor nitpick (see below) I don't see a problem with
this in isolation. It seems like a pretty clean renaming. So I think
we need to move onto the question of how clean the rest of the patch
series looks with this as a base.

A preliminary refactoring that was discussed in the past and was
originally in 0001 was to move the fields included in BufferTag via
RelFileNode/Locator directly into the struct. I think maybe it doesn't
make sense to include that in 0001 as you have it here, but maybe that
could be 0002 with the main patch to follow as 0003, or something like
that. I wonder if we can get by with redefining RelFileNode like this
in 0002:

typedef struct buftag
{
    Oid     spcOid;
    Oid     dbOid;
    RelFileNumber   fileNumber;
    ForkNumber  forkNum;
} BufferTag;

And then like this in 0003:

typedef struct buftag
{
    Oid     spcOid;
    Oid     dbOid;
    RelFileNumber   fileNumber:56;
    ForkNumber  forkNum:8;
} BufferTag;

- * from catalog OIDs to filenode numbers.  Each database has a map file for
+ * from catalog OIDs to filenumber.  Each database has a map file for

should be filenumbers

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
Hi,

On 2022-06-24 10:59:25 -0400, Robert Haas wrote:
> A preliminary refactoring that was discussed in the past and was
> originally in 0001 was to move the fields included in BufferTag via
> RelFileNode/Locator directly into the struct. I think maybe it doesn't
> make sense to include that in 0001 as you have it here, but maybe that
> could be 0002 with the main patch to follow as 0003, or something like
> that. I wonder if we can get by with redefining RelFileNode like this
> in 0002:
> 
> typedef struct buftag
> {
>     Oid     spcOid;
>     Oid     dbOid;
>     RelFileNumber   fileNumber;
>     ForkNumber  forkNum;
> } BufferTag;

If we "inline" RelFileNumber, it's probably worth reorder the members so that
the most distinguishing elements come first, to make it quicker to detect hash
collisions. It shows up in profiles today...

I guess it should be blockNum, fileNumber, forkNumber, dbOid, spcOid? I think
as long as blockNum, fileNumber are first, the rest doesn't matter much.


> And then like this in 0003:
> 
> typedef struct buftag
> {
>     Oid     spcOid;
>     Oid     dbOid;
>     RelFileNumber   fileNumber:56;
>     ForkNumber  forkNum:8;
> } BufferTag;

Probably worth checking the generated code / the performance effects of using
bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
byte boundary, so it might be ok.

Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Jun 24, 2022 at 9:30 PM Andres Freund <andres@anarazel.de> wrote:
> If we "inline" RelFileNumber, it's probably worth reorder the members so that
> the most distinguishing elements come first, to make it quicker to detect hash
> collisions. It shows up in profiles today...
>
> I guess it should be blockNum, fileNumber, forkNumber, dbOid, spcOid? I think
> as long as blockNum, fileNumber are first, the rest doesn't matter much.

Hmm, I guess we could do that. Possibly as a separate, very small patch.

> > And then like this in 0003:
> >
> > typedef struct buftag
> > {
> >     Oid     spcOid;
> >     Oid     dbOid;
> >     RelFileNumber   fileNumber:56;
> >     ForkNumber  forkNum:8;
> > } BufferTag;
>
> Probably worth checking the generated code / the performance effects of using
> bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
> byte boundary, so it might be ok.

One advantage of using bitfields is that it might mean we don't need
to introduce accessor macros. Now, if that's going to lead to terrible
performance I guess we should go ahead and add the accessor macros -
Dilip had those in an earlier patch anyway. But it'd be nice if it
weren't necessary.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Simon Riggs
Дата:
On Sat, 25 Jun 2022 at 02:30, Andres Freund <andres@anarazel.de> wrote:

> > And then like this in 0003:
> >
> > typedef struct buftag
> > {
> >     Oid     spcOid;
> >     Oid     dbOid;
> >     RelFileNumber   fileNumber:56;
> >     ForkNumber  forkNum:8;
> > } BufferTag;
>
> Probably worth checking the generated code / the performance effects of using
> bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
> byte boundary, so it might be ok.

Another approach would be to condense spcOid and dbOid into a single
4-byte Oid-like number, since in most cases they are associated with
each other, and not often many of them anyway. So this new number
would indicate both the database and the tablespace. I know that we
want to be able to make file changes without doing catalog lookups,
but since the number of combinations is usually 1, but even then, low,
it can be cached easily in a smgr array and included in the checkpoint
record (or nearby) for ease of use.

typedef struct buftag
{
     Oid     db_spcOid;
     ForkNumber  uint32;
     RelFileNumber   uint64;
} BufferTag;

That way we could just have a simple 64-bit RelFileNumber, without
restriction, and probably some spare bytes on the ForkNumber, if we
needed them later.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Jun 28, 2022 at 7:45 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> Another approach would be to condense spcOid and dbOid into a single
> 4-byte Oid-like number, since in most cases they are associated with
> each other, and not often many of them anyway. So this new number
> would indicate both the database and the tablespace. I know that we
> want to be able to make file changes without doing catalog lookups,
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.
>
> typedef struct buftag
> {
>      Oid     db_spcOid;
>      ForkNumber  uint32;
>      RelFileNumber   uint64;
> } BufferTag;

I've thought about this before too, because it does seem like the DB
OID and tablespace OID are a poor use of bit space. You might not even
need to keep the db_spcOid value in any persistent place, because it
could just be an alias for buffer mapping lookups that might change on
every restart. That does have the problem that you now need a
secondary hash table - in theory of unbounded size - to store mappings
from <dboid,tsoid> to db_spcOid, and that seems complicated and hard
to get right. It might be possible, though. Alternatively, you could
imagine a durable mapping that also affects the on-disk structure, but
I don't quite see how to make that work: for example, pg_basebackup
wants to produce a tar file for each tablespace directory, and if the
pathnames no longer contain the tablespace OID but only the db_spcOid,
then that doesn't work any more.

But the primary problem we're trying to solve here is that right now
we sometimes reuse the same filename for a whole new file, and that
results in bugs that only manifest themselves in obscure
circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f.
There are residual failure modes even now related to the "tombstone"
files that are created when you drop a relation: remove everything but
the first file from the main fork but then keep that file (only)
around until after the next checkpoint. OID wraparound is another
annoyance that has influenced the design of quite a bit of code over
the years and where we probably still have bugs. If we don't reuse
relfilenodes, we can avoid a lot of that pain. Combining the DB OID
and TS OID fields doesn't solve that problem.

> That way we could just have a simple 64-bit RelFileNumber, without
> restriction, and probably some spare bytes on the ForkNumber, if we
> needed them later.

In my personal opinion, the ForkNumber system is an ugly wart which
has nothing to recommend it except that the VM and FSM forks are
awesome. But if we could have those things without needing forks, I
think that would be way better. Forks add code complexity in tons of
places, and it's barely possible to scale it to the 4 forks we have
already, let alone any larger number. Furthermore, there are really
negative performance effects from creating 3 files per small relation
rather than 1, and we sure can't afford to have that number get any
bigger. I'd rather kill the ForkNumber system with fire that expand it
further, but even if we do expand it, we're not close to being able to
cope with more than 256 forks per relation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Jun 28, 2022 at 11:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But the primary problem we're trying to solve here is that right now
> we sometimes reuse the same filename for a whole new file, and that
> results in bugs that only manifest themselves in obscure
> circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f.
> There are residual failure modes even now related to the "tombstone"
> files that are created when you drop a relation: remove everything but
> the first file from the main fork but then keep that file (only)
> around until after the next checkpoint. OID wraparound is another
> annoyance that has influenced the design of quite a bit of code over
> the years and where we probably still have bugs. If we don't reuse
> relfilenodes, we can avoid a lot of that pain. Combining the DB OID
> and TS OID fields doesn't solve that problem.

Oh wait, I'm being stupid. You were going to combine those fields but
then also widen the relfilenode, so that would solve this problem
after all. Oops, I'm dumb.

I still think this is a lot more complicated though, to the point
where I'm not sure we can really make it work at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Matthias van de Meent
Дата:
On Tue, 28 Jun 2022 at 13:45, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.

I was reading the thread to keep up with storage-related prototypes
and patches, and this specifically doesn't sound quite right to me. I
do not know what values you considered to be 'low' or what 'can be
cached easily', so here's some field data:

I have seen PostgreSQL clusters that utilized the relative isolation
of seperate databases within the same cluster (instance / postmaster)
to provide higher guarantees of data access isolation while still
being able to share a resource pool, which resulted in several
clusters containing upwards of 100 databases.

I will be the first to admit that it is quite unlikely to be common
practise, but this workload increases the number of dbOid+spcOid
combinations to 100s (even while using only a single tablespace),
which in my opinion requires some more thought than just handwaving it
into an smgr array and/or checkpoint records.


Kind regards,

Matthias van de Meent



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Fri, Jun 24, 2022 at 8:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jun 24, 2022 at 7:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have changed that. PFA, the updated patch.
>
> Apart from one minor nitpick (see below) I don't see a problem with
> this in isolation. It seems like a pretty clean renaming. So I think
> we need to move onto the question of how clean the rest of the patch
> series looks with this as a base.
>

PFA, the remaining set of patches.   It might need to fix some
indentation but lets first see how is the overall idea then we can
work on it.  I have fixed all the open review comment from the
previous thread except this comment from Robert.

>- It looks to me like you need to give significantly more thought to
> the proper way of adjusting the relfilenode-related test cases in
> alter_table.out.

It seems to me that this test case is just testing whether the
table/child table are rewritten or not after the alter table.  And for
that it is comparing the oid with the relfilenode, now that is not
possible so I think it's quite reasonable to just compare the current
relfilenode with the old relfilenode and if they are same the table is
not rewritten.  So I am not sure why the original test case had two
cases 'own' and 'orig'.  With respect to this test case they both have
the same meaning, in fact comparing old relfilenode with current
relfilenode is better way of testing than comparing the oid with
relfilenode.

diff --git a/src/test/regress/expected/alter_table.out
b/src/test/regress/expected/alter_table.out
index 5ede56d..80af97e 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2164,7 +2164,6 @@ select relname,
   c.oid = oldoid as orig_oid,
   case relfilenode
     when 0 then 'none'
-    when c.oid then 'own'
     when oldfilenode then 'orig'
     else 'OTHER'
     end as storage,
@@ -2175,10 +2174,10 @@ select relname,
            relname            | orig_oid | storage |     desc
 ------------------------------+----------+---------+---------------
  at_partitioned               | t        | none    |
- at_partitioned_0             | t        | own     |
- at_partitioned_0_id_name_key | t        | own     | child 0 index
- at_partitioned_1             | t        | own     |
- at_partitioned_1_id_name_key | t        | own     | child 1 index
+ at_partitioned_0             | t        | orig    |
+ at_partitioned_0_id_name_key | t        | orig    | child 0 index
+ at_partitioned_1             | t        | orig    |
+ at_partitioned_1_id_name_key | t        | orig    | child 1 index
  at_partitioned_id_name_key   | t        | none    | parent index
 (6 rows)



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Simon Riggs
Дата:
On Tue, 28 Jun 2022 at 19:18, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

> I will be the first to admit that it is quite unlikely to be common
> practise, but this workload increases the number of dbOid+spcOid
> combinations to 100s (even while using only a single tablespace),

Which should still fit nicely in 32bits then. Why does that present a
problem to this idea?

The reason to mention this now is that it would give more space than
56bit limit being suggested here. I am not opposed to the current
patch, just finding ways to remove some objections mentioned by
others, if those became blockers.

> which in my opinion requires some more thought than just handwaving it
> into an smgr array and/or checkpoint records.

The idea is that we would store the mapping as an array, with the
value in the RelFileNode as the offset in the array. The array would
be mostly static, so would cache nicely.

For convenience, I imagine that the mapping could be included in WAL
in or near the checkpoint record, to ensure that the mapping was
available in all backups.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: making relfilenodes 56 bits

От
Matthias van de Meent
Дата:
On Wed, 29 Jun 2022 at 14:41, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>
> On Tue, 28 Jun 2022 at 19:18, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>
> > I will be the first to admit that it is quite unlikely to be common
> > practise, but this workload increases the number of dbOid+spcOid
> > combinations to 100s (even while using only a single tablespace),
>
> Which should still fit nicely in 32bits then. Why does that present a
> problem to this idea?

It doesn't, or at least not the bitspace part. I think it is indeed
quite unlikely anyone will try to build as many tablespaces as the 100
million tables project, which utilized 1000 tablespaces to get around
file system limitations [0].

The potential problem is 'where to store such mapping efficiently'.
Especially considering that this mapping might (and likely: will)
change across restarts and when database churn (create + drop
database) happens in e.g. testing workloads.

> The reason to mention this now is that it would give more space than
> 56bit limit being suggested here. I am not opposed to the current
> patch, just finding ways to remove some objections mentioned by
> others, if those became blockers.
>
> > which in my opinion requires some more thought than just handwaving it
> > into an smgr array and/or checkpoint records.
>
> The idea is that we would store the mapping as an array, with the
> value in the RelFileNode as the offset in the array. The array would
> be mostly static, so would cache nicely.

That part is not quite clear to me. Any cluster may have anywhere
between 3 and hundreds or thousands of entries in that mapping. Do you
suggest to dynamically grow that (presumably shared, considering the
addressing is shared) array, or have a runtime parameter limiting the
amount of those entries (similar to max_connections)?

> For convenience, I imagine that the mapping could be included in WAL
> in or near the checkpoint record, to ensure that the mapping was
> available in all backups.

Why would we need this mapping in backups, considering that it seems
to be transient state that is lost on restart? Won't we still use full
dbOid and spcOid in anything we communicate or store on disk (file
names, WAL, pg_class rows, etc.), or did I misunderstand your
proposal?

Kind regards,

Matthias van de Meent


[0] https://www.pgcon.org/2013/schedule/attachments/283_Billion_Tables_Project-PgCon2013.pdf



Re: making relfilenodes 56 bits

От
Thomas Munro
Дата:
On Thu, Jun 30, 2022 at 12:41 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> The reason to mention this now is that it would give more space than
> 56bit limit being suggested here.

Isn't 2^56 enough, though?  Remembering that cluster time runs out
when we've generated 2^64 bytes of WAL, if you want to run out of 56
bit relfile numbers before the end of time you'll need to find a way
to allocate them in less than 2^8 bytes of WAL.  That's technically
possible, since SMgr CREATE records are only 42 bytes long, so you
could craft some C code to do nothing but create (and leak)
relfilenodes, but real usage is always accompanied by catalogue
insertions to connect the new relfilenode to a database object,
without which they are utterly useless.  So in real life, it takes
many hundreds or typically thousands of bytes, much more than 256.



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Jun 28, 2022 at 5:15 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
>
> On Sat, 25 Jun 2022 at 02:30, Andres Freund <andres@anarazel.de> wrote:
>
> > > And then like this in 0003:
> > >
> > > typedef struct buftag
> > > {
> > >     Oid     spcOid;
> > >     Oid     dbOid;
> > >     RelFileNumber   fileNumber:56;
> > >     ForkNumber  forkNum:8;
> > > } BufferTag;
> >
> > Probably worth checking the generated code / the performance effects of using
> > bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
> > byte boundary, so it might be ok.
>
> Another approach would be to condense spcOid and dbOid into a single
> 4-byte Oid-like number, since in most cases they are associated with
> each other, and not often many of them anyway. So this new number
> would indicate both the database and the tablespace. I know that we
> want to be able to make file changes without doing catalog lookups,
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.
>
> typedef struct buftag
> {
>      Oid     db_spcOid;
>      ForkNumber  uint32;
>      RelFileNumber   uint64;
> } BufferTag;
>
> That way we could just have a simple 64-bit RelFileNumber, without
> restriction, and probably some spare bytes on the ForkNumber, if we
> needed them later.

Yeah this is possible but I am not seeing the clear advantage.  Of
Course we can widen the RelFileNumber to 64 instead of 56 but with the
added complexity of storing the mapping.  I am not sure if it is
really worth it?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Simon Riggs
Дата:
On Thu, 30 Jun 2022 at 03:43, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Jun 30, 2022 at 12:41 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > The reason to mention this now is that it would give more space than
> > 56bit limit being suggested here.
>
> Isn't 2^56 enough, though?

For me, yes.

To the above comment, I followed with:

> I am not opposed to the current
> patch, just finding ways to remove some objections mentioned by
> others, if those became blockers.

So it seems we can continue with the patch.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >- It looks to me like you need to give significantly more thought to
> > the proper way of adjusting the relfilenode-related test cases in
> > alter_table.out.
>
> It seems to me that this test case is just testing whether the
> table/child table are rewritten or not after the alter table.  And for
> that it is comparing the oid with the relfilenode, now that is not
> possible so I think it's quite reasonable to just compare the current
> relfilenode with the old relfilenode and if they are same the table is
> not rewritten.  So I am not sure why the original test case had two
> cases 'own' and 'orig'.  With respect to this test case they both have
> the same meaning, in fact comparing old relfilenode with current
> relfilenode is better way of testing than comparing the oid with
> relfilenode.

I think you're right. However, I don't really like OTHER showing up in
the output, because that looks like a string that was chosen to be
slightly alarming, especially given that it's in ALL CAPS. How about
if we change 'ORIG' to 'new'?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> PFA, the remaining set of patches.   It might need to fix some
> indentation but lets first see how is the overall idea then we can
> work on it

So just playing around with this patch set, and also looking at the
code a bit, here are a few random observations:

- The patch assigns relfilenumbers starting with 1. I don't see any
specific problem with that, but I wonder if it would be a good idea to
start with a random larger value just in case we ever need some fixed
values for some purpose or other. Maybe we should start with 100000 or
something?

- If I use ALTER TABLE .. SET TABLESPACE to move a table around, then
the relfilenode changes each time, but if I use ALTER DATABASE .. SET
TABLESPACE to move a database around, the relfilenodes don't change.
So, what this guarantees is that if the same filename is used twice,
it will be for the same relation and not some unrelated relation.
That's enough to avoid the hazard described in the comments for
mdunlink(), because that scenario intrinsically involves confusion
caused by two relations using the same filename after an OID
wraparound. And it also means that if we pursue the idea of using an
end-of-recovery record in all cases, we don't need to start creating
tombstones during crash recovery. The forced checkpoint at the end of
crash recovery means we don't currently need to do that, but if we
change that, then the same hazard would exist there as we already have
in normal running, and this fixes it. However, I don't find it
entirely obvious that there are no hazards of any kind stemming from
repeated use of ALTER DATABASE .. SET TABLESPACE resulting in
filenames getting reused. On the other hand avoiding filename reuse
completely would be more work, not closely related to what the rest of
the patch set does, probably somewhat controversial in terms of what
it would have to do, and I'm not sure that we really need it. It does
seem like it would be quite a bit easier to reason about, though,
because the current guarantee is suspiciously similar to "we don't do
X, except when we do." This is not really so much a review comment for
Dilip as a request for input from others ... thoughts?

- Again, not a review comment for this patch specifically, but I'm
wondering if we could use this as infrastructure for a tool to clean
orphaned files out of the data directory. Suppose we create a file for
a new relation and then crash, leaving a potentially large file on
disk that will never be removed. Well, if the relfilenumber as it
exists on disk is not in pg_class and old enough that a transaction
inserting into pg_class can't still be running, then it must be safe
to remove that file. Maybe that's safe even today, but it's a little
hard to reason about it in the face of a possible OID wraparound that
might result in reusing the same numbers over again. It feels like
this makes easier to identify which files are old stuff that can never
again be touched.

- I might be missing something here, but this isn't actually making
the relfilenode 56 bits, is it? The reason to do that is to make the
BufferTag smaller, so I expected to see that BufferTag either used
bitfields like RelFileNumber relNumber:56 and ForkNumber forkNum:8, or
else that it just declared a single field for both as uint64 and used
accessor macros or static inlines to separate them out. But it doesn't
seem to do either of those things, which seems like it can't be right.
On a related note, I think it would be better to declare RelFileNumber
as an unsigned type even though we have no use for the high bit; we
have, equally, no use for negative values. It's easier to reason about
bit-shifting operations with unsigned types.

- I also think that the cross-version compatibility stuff in
pg_buffercache isn't quite right. It does values[1] =
ObjectIdGetDatum(fctx->record[i].relfilenumber). But I think what it
ought to do is dependent on the output type. If the output type is
int8, then it ought to do values[1] = Int64GetDatum((int64)
fctx->record[i].relfilenumber), and if it's OID, then it ought to do
values[1] = ObjectIdGetDatum((Oid) fctx->record[i].relfilenumber)).
The  macro that you use needs to be based on the output SQL type, not
the C data type.

- I think it might be a good idea to allocate RelFileNumbers in much
smaller batches than we do OIDs. 8192 feels wasteful to me. It
shouldn't practically matter, because if we have 56 bits of bit space
and so even if we repeatedly allocate 2^13 RelFileNumbers and then
crash, we can still crash 2^41 times before we completely run out of
numbers, and 2 trillion crashes ought to be enough for anyone. But I
see little benefit from being so profligate. You can allocate an OID
as an identifier for a catalog tuple or a TOAST chunk, but a
RelFileNumber requires a filesystem operation, so the amount of work
that is needed to use up 8192 RelFileNumbers is a lot bigger than the
amount of work required to use up 8192 OIDs. If we dropped this down
to 128, or 64, or 256, would anything bad happen?

- Do we really want GetNewRelFileNumber() to call access() just for a
can't-happen scenario? Can't we catch this problem later when we
actually go to create the files on disk?

- The patch updates the comments in XLogPrefetcherNextBlock to talk
about relfilenumbers being reused rather than relfilenodes being
reused, which is fine except that we're sorta kinda not doing that any
more as noted above. I don't really know what these comments ought to
say instead but perhaps more than a mechanical update is in order.
This applies, even more, to the comments above mdunlink(). Apart from
updating the existing comments, I think that the patch needs a good
explanation of the new scheme someplace, and what it does and doesn't
guarantee, which relates to the point above about making sure we know
exactly what we're guaranteeing and why. I don't know where exactly
this text should be positioned yet, or what it should say, but it
needs to go someplace. This is a fairly significant change and needs
to be talked about somewhere.

- I think there's still a bit of a terminology problem here. With the
patch set, we use RelFileNumber to refer to a single, 56-bit integer
and RelFileLocator to refer to that integer combined with the DB and
TS OIDs. But sometimes in the comments we want to talk about the
logical sequence of files that is identified by a RelFileLocator, and
that's not quite the same as either of those things. For example, in
tableam.h we currently say "This callback needs to create a new
relation filenode for `rel`" and how should that be changed in this
new naming? We're not creating a new RelFileNumber - those would need
to be allocated, not created, as all the numbers in the universe exist
already. Neither are we creating a new locator; that sounds like it
means assembling it from pieces. What we're doing is creating the
first of what may end up being a series of similarly-named files on
disk. I'm not exactly sure how we can refer to that in a way that is
clear, but it's a problem that arises here and here throughout the
patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Jun 30, 2022 at 10:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >- It looks to me like you need to give significantly more thought to
> > > the proper way of adjusting the relfilenode-related test cases in
> > > alter_table.out.
> >
> > It seems to me that this test case is just testing whether the
> > table/child table are rewritten or not after the alter table.  And for
> > that it is comparing the oid with the relfilenode, now that is not
> > possible so I think it's quite reasonable to just compare the current
> > relfilenode with the old relfilenode and if they are same the table is
> > not rewritten.  So I am not sure why the original test case had two
> > cases 'own' and 'orig'.  With respect to this test case they both have
> > the same meaning, in fact comparing old relfilenode with current
> > relfilenode is better way of testing than comparing the oid with
> > relfilenode.
>
> I think you're right. However, I don't really like OTHER showing up in
> the output, because that looks like a string that was chosen to be
> slightly alarming, especially given that it's in ALL CAPS. How about
> if we change 'ORIG' to 'new'?

I think you meant, rename 'OTHER' to 'new', yeah that makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Fri, Jul 1, 2022 at 12:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > PFA, the remaining set of patches.   It might need to fix some
> > indentation but lets first see how is the overall idea then we can
> > work on it
>
> So just playing around with this patch set, and also looking at the
> code a bit, here are a few random observations:
>
> - The patch assigns relfilenumbers starting with 1. I don't see any
> specific problem with that, but I wonder if it would be a good idea to
> start with a random larger value just in case we ever need some fixed
> values for some purpose or other. Maybe we should start with 100000 or
> something?

Yeah we can do that, I have changed to 100000.

> - If I use ALTER TABLE .. SET TABLESPACE to move a table around, then
> the relfilenode changes each time, but if I use ALTER DATABASE .. SET
> TABLESPACE to move a database around, the relfilenodes don't change.
> So, what this guarantees is that if the same filename is used twice,
> it will be for the same relation and not some unrelated relation.
> That's enough to avoid the hazard described in the comments for
> mdunlink(), because that scenario intrinsically involves confusion
> caused by two relations using the same filename after an OID
> wraparound. And it also means that if we pursue the idea of using an
> end-of-recovery record in all cases, we don't need to start creating
> tombstones during crash recovery. The forced checkpoint at the end of
> crash recovery means we don't currently need to do that, but if we
> change that, then the same hazard would exist there as we already have
> in normal running, and this fixes it. However, I don't find it
> entirely obvious that there are no hazards of any kind stemming from
> repeated use of ALTER DATABASE .. SET TABLESPACE resulting in
> filenames getting reused. On the other hand avoiding filename reuse
> completely would be more work, not closely related to what the rest of
> the patch set does, probably somewhat controversial in terms of what
> it would have to do, and I'm not sure that we really need it. It does
> seem like it would be quite a bit easier to reason about, though,
> because the current guarantee is suspiciously similar to "we don't do
> X, except when we do." This is not really so much a review comment for
> Dilip as a request for input from others ... thoughts?

Yeah that can be done, but maybe as a separate patch.  One option is
that when we will support the WAL method for the ALTER TABLE .. SET
TABLESPACE like we did for CREATE DATABASE, as part of that we will
generate the new relfilenumber.

> - Again, not a review comment for this patch specifically, but I'm
> wondering if we could use this as infrastructure for a tool to clean
> orphaned files out of the data directory. Suppose we create a file for
> a new relation and then crash, leaving a potentially large file on
> disk that will never be removed. Well, if the relfilenumber as it
> exists on disk is not in pg_class and old enough that a transaction
> inserting into pg_class can't still be running, then it must be safe
> to remove that file. Maybe that's safe even today, but it's a little
> hard to reason about it in the face of a possible OID wraparound that
> might result in reusing the same numbers over again. It feels like
> this makes easier to identify which files are old stuff that can never
> again be touched.

Correct.

> - I might be missing something here, but this isn't actually making
> the relfilenode 56 bits, is it? The reason to do that is to make the
> BufferTag smaller, so I expected to see that BufferTag either used
> bitfields like RelFileNumber relNumber:56 and ForkNumber forkNum:8, or
> else that it just declared a single field for both as uint64 and used
> accessor macros or static inlines to separate them out. But it doesn't
> seem to do either of those things, which seems like it can't be right.
> On a related note, I think it would be better to declare RelFileNumber
> as an unsigned type even though we have no use for the high bit; we
> have, equally, no use for negative values. It's easier to reason about
> bit-shifting operations with unsigned types.

Opps, somehow missed to merge that change in the patch.  Changed that
like below and adjusted the macros.
typedef struct buftag
{
Oid spcOid; /* tablespace oid. */
Oid dbOid; /* database oid. */
uint32 relNumber_low; /* relfilenumber 32 lower bits */
uint32 relNumber_hi:24; /* relfilenumber 24 high bits */
uint32 forkNum:8; /* fork number */
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;

I think we need to break like this to keep the BufferTag 4 byte
aligned otherwise the size of the structure will be increased.

> - I also think that the cross-version compatibility stuff in
> pg_buffercache isn't quite right. It does values[1] =
> ObjectIdGetDatum(fctx->record[i].relfilenumber). But I think what it
> ought to do is dependent on the output type. If the output type is
> int8, then it ought to do values[1] = Int64GetDatum((int64)
> fctx->record[i].relfilenumber), and if it's OID, then it ought to do
> values[1] = ObjectIdGetDatum((Oid) fctx->record[i].relfilenumber)).
> The  macro that you use needs to be based on the output SQL type, not
> the C data type.

Fixed

> - I think it might be a good idea to allocate RelFileNumbers in much
> smaller batches than we do OIDs. 8192 feels wasteful to me. It
> shouldn't practically matter, because if we have 56 bits of bit space
> and so even if we repeatedly allocate 2^13 RelFileNumbers and then
> crash, we can still crash 2^41 times before we completely run out of
> numbers, and 2 trillion crashes ought to be enough for anyone. But I
> see little benefit from being so profligate. You can allocate an OID
> as an identifier for a catalog tuple or a TOAST chunk, but a
> RelFileNumber requires a filesystem operation, so the amount of work
> that is needed to use up 8192 RelFileNumbers is a lot bigger than the
> amount of work required to use up 8192 OIDs. If we dropped this down
> to 128, or 64, or 256, would anything bad happen?

This makes sense so I have changed to 64.

> - Do we really want GetNewRelFileNumber() to call access() just for a
> can't-happen scenario? Can't we catch this problem later when we
> actually go to create the files on disk?

Yeah we don't need to, actually we can completely get rid of
GetNewRelFileNumber() function and we can directly call
GenerateNewRelFileNumber() and in fact we can rename
GenerateNewRelFileNumber() to GetNewRelFileNumber().  So I have done
these changes.

> - The patch updates the comments in XLogPrefetcherNextBlock to talk
> about relfilenumbers being reused rather than relfilenodes being
> reused, which is fine except that we're sorta kinda not doing that any
> more as noted above. I don't really know what these comments ought to
> say instead but perhaps more than a mechanical update is in order.

Changed

> This applies, even more, to the comments above mdunlink(). Apart from
> updating the existing comments, I think that the patch needs a good
> explanation of the new scheme someplace, and what it does and doesn't
> guarantee, which relates to the point above about making sure we know
> exactly what we're guaranteeing and why. I don't know where exactly
> this text should be positioned yet, or what it should say, but it
> needs to go someplace. This is a fairly significant change and needs
> to be talked about somewhere.

For now, in v4_0004**, I have removed the comment which is explaining
why we need to keep the Tombstone file and added some note that why we
do not need to keep those files from PG16 onwards.

> - I think there's still a bit of a terminology problem here. With the
> patch set, we use RelFileNumber to refer to a single, 56-bit integer
> and RelFileLocator to refer to that integer combined with the DB and
> TS OIDs. But sometimes in the comments we want to talk about the
> logical sequence of files that is identified by a RelFileLocator, and
> that's not quite the same as either of those things. For example, in
> tableam.h we currently say "This callback needs to create a new
> relation filenode for `rel`" and how should that be changed in this
> new naming? We're not creating a new RelFileNumber - those would need
> to be allocated, not created, as all the numbers in the universe exist
> already. Neither are we creating a new locator; that sounds like it
> means assembling it from pieces. What we're doing is creating the
> first of what may end up being a series of similarly-named files on
> disk. I'm not exactly sure how we can refer to that in a way that is
> clear, but it's a problem that arises here and here throughout the
> patch.

I think the comment can say
"This callback needs to create a new relnumber file for 'rel' " ?

I have not modified this yet, I will check other places where we have
such terminology issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
Hi,

I'm not feeling inspired by "locator", tbh. But I don't really have a great
alternative, so ...


On 2022-07-01 16:12:01 +0530, Dilip Kumar wrote:
> From f07ca9ef19e64922c6ee410707e93773d1a01d7c Mon Sep 17 00:00:00 2001
> From: dilip kumar <dilipbalaut@localhost.localdomain>
> Date: Sat, 25 Jun 2022 10:43:12 +0530
> Subject: [PATCH v4 2/4] Preliminary refactoring for supporting larger
>  relfilenumber

I don't think we have abbreviated buffer as 'buff' in many places? I think we
should either spell buffer out or use 'buf'. This is in regard to BuffTag etc.



> Subject: [PATCH v4 3/4] Use 56 bits for relfilenumber to avoid wraparound

>  /*
> + * GenerateNewRelFileNumber
> + *
> + * Similar to GetNewObjectId but instead of new Oid it generates new
> + * relfilenumber.
> + */
> +RelFileNumber
> +GetNewRelFileNumber(void)
> +{
> +    RelFileNumber        result;
> +
> +    /* Safety check, we should never get this far in a HS standby */

Normally we don't capitalize the first character of a comment that's not a
full sentence (i.e. ending with a punctuation mark).

> +    if (RecoveryInProgress())
> +        elog(ERROR, "cannot assign RelFileNumber during recovery");
> +
> +    LWLockAcquire(RelFileNumberGenLock, LW_EXCLUSIVE);
> +
> +    /* Check for the wraparound for the relfilenumber counter */
> +    if (unlikely (ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER))
> +        elog(ERROR, "relfilenumber is out of bound");
> +
> +    /* If we run out of logged for use RelFileNumber then we must log more */

"logged for use" - looks like you reformulated the sentence incompletely.


> +    if (ShmemVariableCache->relnumbercount == 0)
> +    {
> +        XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> +                                 VAR_RFN_PREFETCH);

I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.


> diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
> index e1f4eef..1cf039c 100644
> --- a/src/include/catalog/pg_class.h
> +++ b/src/include/catalog/pg_class.h
> @@ -31,6 +31,10 @@
>   */
>  CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,RelationRelation_Rowtype_Id)
BKI_SCHEMA_MACRO
>  {
> +    /* identifier of physical storage file */
> +    /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
> +    int64        relfilenode BKI_DEFAULT(0);
> +
>      /* oid */
>      Oid            oid;
>  
> @@ -52,10 +56,6 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
>      /* access method; 0 if not a table / index */
>      Oid            relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
>  
> -    /* identifier of physical storage file */
> -    /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
> -    Oid            relfilenode BKI_DEFAULT(0);
> -
>      /* identifier of table space for relation (0 means default for database) */
>      Oid            reltablespace BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_tablespace);
>

What's the story behind moving relfilenode to the front? Alignment
consideration? Seems odd to move the relfilenode before the oid. If there's an
alignment issue, can't you just swap it with reltablespace or such to resolve
it?



> From f6e8e0e7412198b02671e67d1859a7448fe83f38 Mon Sep 17 00:00:00 2001
> From: dilip kumar <dilipbalaut@localhost.localdomain>
> Date: Wed, 29 Jun 2022 13:24:32 +0530
> Subject: [PATCH v4 4/4] Don't delay removing Tombstone file until next
>  checkpoint
> 
> Currently, we can not remove the unused relfilenode until the
> next checkpoint because if we remove them immediately then
> there is a risk of reusing the same relfilenode for two
> different relations during single checkpoint due to Oid
> wraparound.

Well, not quite "currently", because at this point we've fixed that in a prior
commit ;)


> Now as part of the previous patch set we have made relfilenode
> 56 bit wider and removed the risk of wraparound so now we don't
> need to wait till the next checkpoint for removing the unused
> relation file and we can clean them up on commit.

Hm. Wasn't there also some issue around crash-restarts benefiting from having
those files around until later? I think what I'm remembering is what is
referenced in this comment:


> - * For regular relations, we don't unlink the first segment file of the rel,
> - * but just truncate it to zero length, and record a request to unlink it after
> - * the next checkpoint.  Additional segments can be unlinked immediately,
> - * however.  Leaving the empty file in place prevents that relfilenumber
> - * from being reused.  The scenario this protects us from is:
> - * 1. We delete a relation (and commit, and actually remove its file).
> - * 2. We create a new relation, which by chance gets the same relfilenumber as
> - *      the just-deleted one (OIDs must've wrapped around for that to happen).
> - * 3. We crash before another checkpoint occurs.
> - * During replay, we would delete the file and then recreate it, which is fine
> - * if the contents of the file were repopulated by subsequent WAL entries.
> - * But if we didn't WAL-log insertions, but instead relied on fsyncing the
> - * file after populating it (as we do at wal_level=minimal), the contents of
> - * the file would be lost forever.  By leaving the empty file until after the
> - * next checkpoint, we prevent reassignment of the relfilenumber until it's
> - * safe, because relfilenumber assignment skips over any existing file.

This isn't related to oid wraparound, just crashes. It's possible that the
XLogFlush() in XLogPutNextRelFileNumber() prevents such a scenario, but if so
it still ought to be explained here, I think.



> + * Note that now we can immediately unlink the first segment of the regular
> + * relation as well because the relfilenumber is 56 bits wide since PG 16.  So
> + * we don't have to worry about relfilenumber getting reused for some unrelated
> + * relation file.

I'm doubtful it's a good idea to start dropping at the first segment. I'm
fairly certain that there's smgrexists() checks in some places, and they'll
now stop working, even if there are later segments that remained, e.g. because
of an error in the middle of removing later segments.



Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sat, Jul 2, 2022 at 9:38 AM Andres Freund <andres@anarazel.de> wrote:

Thanks for the review,

> I'm not feeling inspired by "locator", tbh. But I don't really have a great
> alternative, so ...
>
>
> On 2022-07-01 16:12:01 +0530, Dilip Kumar wrote:
> > From f07ca9ef19e64922c6ee410707e93773d1a01d7c Mon Sep 17 00:00:00 2001
> > From: dilip kumar <dilipbalaut@localhost.localdomain>
> > Date: Sat, 25 Jun 2022 10:43:12 +0530
> > Subject: [PATCH v4 2/4] Preliminary refactoring for supporting larger
> >  relfilenumber
>
> I don't think we have abbreviated buffer as 'buff' in many places? I think we
> should either spell buffer out or use 'buf'. This is in regard to BuffTag etc.

Okay, I will change it to 'buf'

> > Subject: [PATCH v4 3/4] Use 56 bits for relfilenumber to avoid wraparound

> Normally we don't capitalize the first character of a comment that's not a
> full sentence (i.e. ending with a punctuation mark).

Okay.

> "logged for use" - looks like you reformulated the sentence incompletely.

Right, I will fix it.

> > +     if (ShmemVariableCache->relnumbercount == 0)
> > +     {
> > +             XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> > +                                                              VAR_RFN_PREFETCH);
>
> I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.

Maybe we can change to LogNextRelFileNumber()?

> What's the story behind moving relfilenode to the front? Alignment
> consideration? Seems odd to move the relfilenode before the oid. If there's an
> alignment issue, can't you just swap it with reltablespace or such to resolve
> it?

Because of a test case added in this commit
(79b716cfb7a1be2a61ebb4418099db1258f35e30).  Even I did not like to
move relfilenode before oid, but under this commit it is expected this
to aligned as well as before any NameData check this comments

===
+--
+--  Keep such columns before the first NameData column of the
+-- catalog, since packagers can override NAMEDATALEN to an odd number.
+--
===

>
> > From f6e8e0e7412198b02671e67d1859a7448fe83f38 Mon Sep 17 00:00:00 2001
> > From: dilip kumar <dilipbalaut@localhost.localdomain>
> > Date: Wed, 29 Jun 2022 13:24:32 +0530
> > Subject: [PATCH v4 4/4] Don't delay removing Tombstone file until next
> >  checkpoint
> >
> > Currently, we can not remove the unused relfilenode until the
> > next checkpoint because if we remove them immediately then
> > there is a risk of reusing the same relfilenode for two
> > different relations during single checkpoint due to Oid
> > wraparound.
>
> Well, not quite "currently", because at this point we've fixed that in a prior
> commit ;)

Right, I will change, but I'm not sure whether we want to commit 0003
and 0004 as an independent patch or as a simple patch.

> > Now as part of the previous patch set we have made relfilenode
> > 56 bit wider and removed the risk of wraparound so now we don't
> > need to wait till the next checkpoint for removing the unused
> > relation file and we can clean them up on commit.
>
> Hm. Wasn't there also some issue around crash-restarts benefiting from having
> those files around until later? I think what I'm remembering is what is
> referenced in this comment:

I think due to wraparound if relfilenode gets reused by another
relation in the same checkpoint then there was an issue in crash
recovery with wal level minimal.  But the root of the issue is a
wraparound, right?

>
> > - * For regular relations, we don't unlink the first segment file of the rel,
> > - * but just truncate it to zero length, and record a request to unlink it after
> > - * the next checkpoint.  Additional segments can be unlinked immediately,
> > - * however.  Leaving the empty file in place prevents that relfilenumber
> > - * from being reused.  The scenario this protects us from is:
> > - * 1. We delete a relation (and commit, and actually remove its file).
> > - * 2. We create a new relation, which by chance gets the same relfilenumber as
> > - *     the just-deleted one (OIDs must've wrapped around for that to happen).
> > - * 3. We crash before another checkpoint occurs.
> > - * During replay, we would delete the file and then recreate it, which is fine
> > - * if the contents of the file were repopulated by subsequent WAL entries.
> > - * But if we didn't WAL-log insertions, but instead relied on fsyncing the
> > - * file after populating it (as we do at wal_level=minimal), the contents of
> > - * the file would be lost forever.  By leaving the empty file until after the
> > - * next checkpoint, we prevent reassignment of the relfilenumber until it's
> > - * safe, because relfilenumber assignment skips over any existing file.
>
> This isn't related to oid wraparound, just crashes. It's possible that the
> XLogFlush() in XLogPutNextRelFileNumber() prevents such a scenario, but if so
> it still ought to be explained here, I think.

I think the root cause of the problem is oid reuse which is due to
relfilenode wraparound, and the problem is created if there is a crash
after that.  Now, we have removed the wraparound so there won't be any
more reuse of the relfilenode so there is no problem during crash
recovery.

 In XLogPutNextRelFileNumber() we need XLogFlush() to just ensure that
after the crash recovery we do not reuse the relfilenode because now
we are not checking for the existing disk file as we have completely
removed the wraparound.

So in short the problem this comment was explaining is related to if
relfilenode get reused in same checkpoint due to wraparound then crash
recovery will loose the content of the new related which has reused
the relfilenode at wal level minimal. But, by adding XLogFlush() in
XLogPutNextRelFileNumber() we are ensuring that after crash recovery
we do not reuse the same relfilenode so we ensure this wal go to disk
first before we create the relation file on the disk.

>
> > + * Note that now we can immediately unlink the first segment of the regular
> > + * relation as well because the relfilenumber is 56 bits wide since PG 16.  So
> > + * we don't have to worry about relfilenumber getting reused for some unrelated
> > + * relation file.
>
> I'm doubtful it's a good idea to start dropping at the first segment. I'm
> fairly certain that there's smgrexists() checks in some places, and they'll
> now stop working, even if there are later segments that remained, e.g. because
> of an error in the middle of removing later segments.

Okay, so you mean to say that we can first drop the remaining segment
and at last we drop the segment 0 right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Sat, Jul 2, 2022 at 4:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I'm doubtful it's a good idea to start dropping at the first segment. I'm
> > fairly certain that there's smgrexists() checks in some places, and they'll
> > now stop working, even if there are later segments that remained, e.g. because
> > of an error in the middle of removing later segments.
>
> Okay, so you mean to say that we can first drop the remaining segment
> and at last we drop the segment 0 right?

I think we need to do it in descending order, starting with the
highest-numbered segment and working down. md.c isn't smart about gaps
in the sequence of files, so it's better if we don't create any gaps.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
Hi,

On 2022-07-02 14:23:08 +0530, Dilip Kumar wrote:
> > > +     if (ShmemVariableCache->relnumbercount == 0)
> > > +     {
> > > +             XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> > > +                                                              VAR_RFN_PREFETCH);
> >
> > I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.
> 
> Maybe we can change to LogNextRelFileNumber()?

Much better.


Hm. Now that I think about it, isn't the XlogFlush() in
XLogPutNextRelFileNumber() problematic performance wise? Yes, we'll spread the
cost across a number of GetNewRelFileNumber() calls, but still, an additional
f[data]sync for every 64 relfilenodes assigned isn't cheap - today there's
zero fsyncs when creating a sequence or table inside a transaction (there are
some for indexes, but there's patches to fix that).

Not that I really see an obvious alternative.

I guess we could try to invent a flush-log-before-write type logic for
relfilenodes somehow? So that the first block actually written to a file needs
to ensure the WAL record that created the relation is flushed. But getting
that to work reliably seems nontrivial.


One thing that would be good is to add an assertion to a few places ensuring
that relfilenodes aren't above ->nextRelFileNumber, most importantly somewhere
in the recovery path.


Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
is 8192, but you chose 64 for VAR_RFN_PREFETCH?

I'd spell out RFN in VAR_RFN_PREFETCH btw, it took me a bit to expand RFN to
relfilenode.


> > What's the story behind moving relfilenode to the front? Alignment
> > consideration? Seems odd to move the relfilenode before the oid. If there's an
> > alignment issue, can't you just swap it with reltablespace or such to resolve
> > it?
> 
> Because of a test case added in this commit
> (79b716cfb7a1be2a61ebb4418099db1258f35e30).  Even I did not like to
> move relfilenode before oid, but under this commit it is expected this
> to aligned as well as before any NameData check this comments
> 
> ===
> +--
> +--  Keep such columns before the first NameData column of the
> +-- catalog, since packagers can override NAMEDATALEN to an odd number.
> +--
> ===

This is embarassing. Trying to keep alignment match between C and catalog
alignment on AIX, without actually making the system understand the alignment
rules, is a remarkably shortsighted approach.

I started a separate thread about it, since it's not really relevant to this thread:
https://postgr.es/m/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de

Maybe we could at least make the field order to be something like
  oid, relam, relfilenode, relname

that should be ok alignment wise, keep oid first, and seems to make sense from
an "importance" POV? Can't really interpret later fields without knowing relam
etc.



> > > Now as part of the previous patch set we have made relfilenode
> > > 56 bit wider and removed the risk of wraparound so now we don't
> > > need to wait till the next checkpoint for removing the unused
> > > relation file and we can clean them up on commit.
> >
> > Hm. Wasn't there also some issue around crash-restarts benefiting from having
> > those files around until later? I think what I'm remembering is what is
> > referenced in this comment:
> 
> I think due to wraparound if relfilenode gets reused by another
> relation in the same checkpoint then there was an issue in crash
> recovery with wal level minimal.  But the root of the issue is a
> wraparound, right?

I'm not convinced the tombstones were required solely in the oid wraparound
case before, despite the comment saying so, with wal_level=minimal. I gotta do
some non-work stuff for a bit, so I need to stop pondering this now :)

I think it might be a good idea to have a few weeks in which we do *not*
remove the tombstones, but have assertion checks against such files existing
when we don't expect them to. I.e. commit 1-3, add the asserts, then commit 4
a bit later.


> > I'm doubtful it's a good idea to start dropping at the first segment. I'm
> > fairly certain that there's smgrexists() checks in some places, and they'll
> > now stop working, even if there are later segments that remained, e.g. because
> > of an error in the middle of removing later segments.
> 
> Okay, so you mean to say that we can first drop the remaining segment
> and at last we drop the segment 0 right?

I'd use the approach Robert suggested and delete from the end, going down.

Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sun, Jul 3, 2022 at 12:59 AM Andres Freund <andres@anarazel.de> wrote:

> Hm. Now that I think about it, isn't the XlogFlush() in
> XLogPutNextRelFileNumber() problematic performance wise? Yes, we'll spread the
> cost across a number of GetNewRelFileNumber() calls, but still, an additional
> f[data]sync for every 64 relfilenodes assigned isn't cheap - today there's
> zero fsyncs when creating a sequence or table inside a transaction (there are
> some for indexes, but there's patches to fix that).
>
> Not that I really see an obvious alternative.

I think to see the impact we need a workload which frequently creates
the relfilenode, maybe we can test where parallel to pgbench we are
frequently creating the relation/indexes and check how much
performance hit we see.  And if we see the impact then increasing
VAR_RFN_PREFETCH value can help in resolving that.

> I guess we could try to invent a flush-log-before-write type logic for
> relfilenodes somehow? So that the first block actually written to a file needs
> to ensure the WAL record that created the relation is flushed. But getting
> that to work reliably seems nontrivial.

>
> One thing that would be good is to add an assertion to a few places ensuring
> that relfilenodes aren't above ->nextRelFileNumber, most importantly somewhere
> in the recovery path.

Yes, it makes sense.

> Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
> is 8192, but you chose 64 for VAR_RFN_PREFETCH?

Earlier it was 8192, then there was a comment from Robert that we can
use Oid for many other things like an identifier for a catalog tuple
or a TOAST chunk, but a RelFileNumber requires a filesystem operation,
so the amount of work that is needed to use up 8192 RelFileNumbers is
a lot bigger than the amount of work required to use up 8192 OIDs.

I think that make sense so I reduced it to 64, but now I tends to
think that we also need to consider the point that after consuming
VAR_RFN_PREFETCH we are going to do XlogFlush(), so it's better that
we keep it high.  And as Robert told upthread, even with keeping it
8192 we can still crash 2^41 (2 trillian) times before we completely
run out of the number.  So I think we can easily keep it up to 8192
and I don't think that we really need to worry much about the
performance impact by XlogFlush()?

> I'd spell out RFN in VAR_RFN_PREFETCH btw, it took me a bit to expand RFN to
> relfilenode.

Okay.

> This is embarassing. Trying to keep alignment match between C and catalog
> alignment on AIX, without actually making the system understand the alignment
> rules, is a remarkably shortsighted approach.
>
> I started a separate thread about it, since it's not really relevant to this thread:
> https://postgr.es/m/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de
>
> Maybe we could at least make the field order to be something like
>   oid, relam, relfilenode, relname

Yeah that we can do.

> that should be ok alignment wise, keep oid first, and seems to make sense from
> an "importance" POV? Can't really interpret later fields without knowing relam
> etc.

Right.

> > I think due to wraparound if relfilenode gets reused by another
> > relation in the same checkpoint then there was an issue in crash
> > recovery with wal level minimal.  But the root of the issue is a
> > wraparound, right?
>
> I'm not convinced the tombstones were required solely in the oid wraparound
> case before, despite the comment saying so, with wal_level=minimal. I gotta do
> some non-work stuff for a bit, so I need to stop pondering this now :)
>
> I think it might be a good idea to have a few weeks in which we do *not*
> remove the tombstones, but have assertion checks against such files existing
> when we don't expect them to. I.e. commit 1-3, add the asserts, then commit 4
> a bit later.

I think this is a good idea.

> > Okay, so you mean to say that we can first drop the remaining segment
> > and at last we drop the segment 0 right?
>
> I'd use the approach Robert suggested and delete from the end, going down.

Yeah, I got it, thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Sat, Jul 2, 2022 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
> Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
> is 8192, but you chose 64 for VAR_RFN_PREFETCH?

As Dilip mentioned, I suggested a lower value. If that's too low, we
can go higher, but I think there is value in not making this
excessively large. Somebody somewhere is going to have a database
that's crash-restarting like mad, and I don't want that person to run
through an insane number of relfilenodes for no reason. I don't think
there are going to be a lot of people creating thousands upon
thousands of relations in a short period of time, and I'm not sure
that it's a big deal if those who do end up having to wait for a few
extra xlog flushes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sun, Jul 3, 2022 at 8:02 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jul 2, 2022 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
> > Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
> > is 8192, but you chose 64 for VAR_RFN_PREFETCH?
>
> As Dilip mentioned, I suggested a lower value. If that's too low, we
> can go higher, but I think there is value in not making this
> excessively large. Somebody somewhere is going to have a database
> that's crash-restarting like mad, and I don't want that person to run
> through an insane number of relfilenodes for no reason. I don't think
> there are going to be a lot of people creating thousands upon
> thousands of relations in a short period of time, and I'm not sure
> that it's a big deal if those who do end up having to wait for a few
> extra xlog flushes.

Here is the updated version of the patch.

Patch 0001-0003 are the same with review comments fixes given by
Andres, 0004 as an extra assert patch suggested by Andres, this can be
merged with 0003.  Basically, during recovery we add asserts checking
"relfilenumbers aren't above ->nextRelFileNumber," and also the assert
for checking that after we allocate a new relfile number the file
should not already exist on the disk so that once we are sure that
this assertion is not hitting then maybe we are safe for removing the
TombStone files immediately what we were doing in 0005.

In 0005 I fixed the file delete order so now we are deleting in
descending order, for that first we need to count the number of
segments by doing stat() on each file and after that we need to go
ahead and unlink in the descending order.

The VAR_RELFILENUMBER_PREFETCH is still 64 as we have not yet
concluded on that, and as discussed I will test some performance to
see whether we have some obvious impact with different values of this.
Maybe I will start with some very small numbers so that we have some
impact.

I thought about this comment from Robert
> that's not quite the same as either of those things. For example, in
> tableam.h we currently say "This callback needs to create a new
> relation filenode for `rel`" and how should that be changed in this
> new naming? We're not creating a new RelFileNumber - those would need
> to be allocated, not created, as all the numbers in the universe exist
> already. Neither are we creating a new locator; that sounds like it
> means assembling it from pieces.

I think that "This callback needs to create a new relation storage
for `rel`" looks better.

I have again reviewed 0001 and 0003 and found some discrepancies in
usage of relfilenumber vs relfilelocator and fixed those, also some
places InvalidOid were use instead of InvalidRelFileNumber.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Jul 1, 2022 at 6:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > - I might be missing something here, but this isn't actually making
> > the relfilenode 56 bits, is it? The reason to do that is to make the
> > BufferTag smaller, so I expected to see that BufferTag either used
> > bitfields like RelFileNumber relNumber:56 and ForkNumber forkNum:8, or
> > else that it just declared a single field for both as uint64 and used
> > accessor macros or static inlines to separate them out. But it doesn't
> > seem to do either of those things, which seems like it can't be right.
> > On a related note, I think it would be better to declare RelFileNumber
> > as an unsigned type even though we have no use for the high bit; we
> > have, equally, no use for negative values. It's easier to reason about
> > bit-shifting operations with unsigned types.
>
> Opps, somehow missed to merge that change in the patch.  Changed that
> like below and adjusted the macros.
> typedef struct buftag
> {
> Oid spcOid; /* tablespace oid. */
> Oid dbOid; /* database oid. */
> uint32 relNumber_low; /* relfilenumber 32 lower bits */
> uint32 relNumber_hi:24; /* relfilenumber 24 high bits */
> uint32 forkNum:8; /* fork number */
> BlockNumber blockNum; /* blknum relative to begin of reln */
> } BufferTag;
>
> I think we need to break like this to keep the BufferTag 4 byte
> aligned otherwise the size of the structure will be increased.

Well, I guess you're right. That's a bummer. In that case I'm a little
unsure whether it's worth using bit fields at all. Maybe we should
just write uint32 something[2] and use macros after that.

Another approach could be to accept the padding and define a constant
SizeOfBufferTag and use that as the hash table element size, like we
do for the sizes of xlog records.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Jul 5, 2022 at 4:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I thought about this comment from Robert
> > that's not quite the same as either of those things. For example, in
> > tableam.h we currently say "This callback needs to create a new
> > relation filenode for `rel`" and how should that be changed in this
> > new naming? We're not creating a new RelFileNumber - those would need
> > to be allocated, not created, as all the numbers in the universe exist
> > already. Neither are we creating a new locator; that sounds like it
> > means assembling it from pieces.
>
> I think that "This callback needs to create a new relation storage
> for `rel`" looks better.

I like the idea, but it would sound better to say "create new relation
storage" rather than "create a new relation storage."

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Jul 6, 2022 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 5, 2022 at 4:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I thought about this comment from Robert
> > > that's not quite the same as either of those things. For example, in
> > > tableam.h we currently say "This callback needs to create a new
> > > relation filenode for `rel`" and how should that be changed in this
> > > new naming? We're not creating a new RelFileNumber - those would need
> > > to be allocated, not created, as all the numbers in the universe exist
> > > already. Neither are we creating a new locator; that sounds like it
> > > means assembling it from pieces.
> >
> > I think that "This callback needs to create a new relation storage
> > for `rel`" looks better.
>
> I like the idea, but it would sound better to say "create new relation
> storage" rather than "create a new relation storage."

Okay, changed that and changed a few more occurrences in 0001 which
were on similar lines.  I also tested the performance of pg_bench
where concurrently I am running the script which creates/drops
relation but I do not see any regression with fairly small values of
VAR_RELNUMBER_PREFETCH, the smallest value I tried was 8.  That
doesn't mean I am suggesting this small value but I think we can keep
the value something like 512 or 1024 without worrying much about the
performance, so changed to 512 in the latest patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Wed, Jul 6, 2022 at 7:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Okay, changed that and changed a few more occurrences in 0001 which
> were on similar lines.  I also tested the performance of pg_bench
> where concurrently I am running the script which creates/drops
> relation but I do not see any regression with fairly small values of
> VAR_RELNUMBER_PREFETCH, the smallest value I tried was 8.  That
> doesn't mean I am suggesting this small value but I think we can keep
> the value something like 512 or 1024 without worrying much about the
> performance, so changed to 512 in the latest patch.

OK, I have committed 0001 now with a few changes. pgindent did not
agree with some of your whitespace changes, and I also cleaned up a
few long lines. I replaced one instance of InvalidOid with
InvalidRelFileNumber also, and changed a word in a comment.

I think 0002 and 0003 need more work yet; I'll try to write a review
of those soon.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Wed, Jul 6, 2022 at 11:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think 0002 and 0003 need more work yet; I'll try to write a review
> of those soon.

Regarding 0002:

I don't particularly like the names BufTagCopyRelFileLocator and
BufTagRelFileLocatorEquals. My suggestion is to rename
BufTagRelFileLocatorEquals to BufTagMatchesRelFileLocator, because it
doesn't really make sense to me to talk about equality between values
of different data types. Instead of BufTagCopyRelFileLocator I would
prefer BufTagGetRelFileLocator. That would make it more similar to
BufTagGetFileNumber and BufTagSetFileNumber, which I think would be a
good thing.

Other than that I think 0002 seems fine.

Regarding 0003:

                                        /*
                                         * Don't try to prefetch
anything in this database until
-                                        * it has been created, or we
might confuse the blocks of
-                                        * different generations, if a
database OID or
-                                        * relfilenumber is reused.
It's also more efficient than
+                                        * it has been created,
because it's more efficient than
                                         * discovering that relations
don't exist on disk yet with
                                         * ENOENT errors.
                                         */

I'm worried that this might not be correct. The comment changes here
(and I think also in some other plces) imply that we've eliminated
relfilenode ruse, but I think that's not true. createdb() and movedb()
don't seem to be modified, so I think it's possible to just copy a
template database over without change, which means that relfilenumbers
and even relfilelocators could be reused. So I feel like maybe this
and similar places shouldn't be modified in this way. Am I
misunderstanding?

        /*
-        * Relfilenumbers are not unique in databases across
tablespaces, so we need
-        * to allocate a new one in the new tablespace.
+        * Generate a new relfilenumber. Although relfilenumber are
unique within a
+        * cluster, we are unable to use the old relfilenumber since unused
+        * relfilenumber are not unlinked until commit.  So if within a
+        * transaction, if we set the old tablespace again, we will
get conflicting
+        * relfilenumber file.
         */
-       newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
-
            rel->rd_rel->relpersistence);
+       newrelfilenumber = GetNewRelFileNumber();

I can't clearly understand this comment. Is it saying that the code
which follows is broken and needs to be fixed by a future patch before
things are OK again? If so, that's not good.

- * callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in
- * catalog/catalog.c.
+ * callers should be GetNewOidWithIndex() in catalog/catalog.c.

If there is only one, it should say "caller", not "callers".

 Orphan files are harmless --- at worst they waste a bit of disk space ---
-because we check for on-disk collisions when allocating new relfilenumber
-OIDs.  So cleaning up isn't really necessary.
+because relfilenumber is 56 bit wide so logically there should not be any
+collisions.  So cleaning up isn't really necessary.

I don't agree that orphaned files are harmless, but changing that is
beyond the scope of this patch. I think that the way you've ended the
sentence isn't sufficiently clear and correct even if we accept the
principle that orphaned files are harmless. What I think we should
stay instead is "because the relfilenode counter is monotonically
increasing. The maximum value is 2^56-1, and there is no provision for
wraparound."

+       /*
+        * Check if we set the new relfilenumber then do we run out of
the logged
+        * relnumber, if so then we need to WAL log again.  Otherwise,
just adjust
+        * the relnumbercount.
+        */
+       relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
+       if (ShmemVariableCache->relnumbercount <= relnumbercount)
+       {
+               LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH);
+               ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
+       }
+       else
+               ShmemVariableCache->relnumbercount -= relnumbercount;

Would it be clearer, here and elsewhere, if VariableCacheData tracked
nextRelFileNumber and nextUnloggedRelFileNumber instead of
nextRelFileNumber and relnumbercount? I'm not 100% sure, but the idea
seems worth considering.

+        * Flush xlog record to disk before returning.  To protect against file
+        * system changes reaching the disk before the
XLOG_NEXT_RELFILENUMBER log.

The way this is worded, you would need it to be just one sentence,
like "Flush xlog record to disk before returning to protect
against...". Or else add "this is," like "This is to protect
against..."

But I'm thinking maybe we could reword it a little more, perhaps
something like this: "Flush xlog record to disk before returning. We
want to be sure that the in-memory nextRelFileNumber value is always
larger than any relfilenumber that is already in use on disk. To
maintain that invariant, we must make sure that the record we just
logged reaches the disk before any new files are created."

This isn't a full review, I think, but I'm kind of out of time and
energy for today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Jul 7, 2022 at 2:54 AM Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for committing the 0001.

> On Wed, Jul 6, 2022 at 11:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think 0002 and 0003 need more work yet; I'll try to write a review
> > of those soon.
>
> Regarding 0002:
>
> I don't particularly like the names BufTagCopyRelFileLocator and
> BufTagRelFileLocatorEquals. My suggestion is to rename
> BufTagRelFileLocatorEquals to BufTagMatchesRelFileLocator, because it
> doesn't really make sense to me to talk about equality between values
> of different data types. Instead of BufTagCopyRelFileLocator I would
> prefer BufTagGetRelFileLocator. That would make it more similar to
> BufTagGetFileNumber and BufTagSetFileNumber, which I think would be a
> good thing.
>
> Other than that I think 0002 seems fine.

Changed as suggested.  Although I feel BufTagCopyRelFileLocator is
actually copying the relfilelocator from buffer tag to an input
variable, I am fine with BufTagGetRelFileLocator so that it is similar
to the other names.

Changed some other macro names as below because field name they are
getting/setting is relNumber
BufTagSetFileNumber -> BufTagSetRelNumber
BufTagGetFileNumber -> BufTagGetRelNumber

> Regarding 0003:

> I'm worried that this might not be correct. The comment changes here
> (and I think also in some other plces) imply that we've eliminated
> relfilenode ruse, but I think that's not true. createdb() and movedb()
> don't seem to be modified, so I think it's possible to just copy a
> template database over without change, which means that relfilenumbers
> and even relfilelocators could be reused. So I feel like maybe this
> and similar places shouldn't be modified in this way. Am I
> misunderstanding?

I think you are right, so I changed it.

>         /*
> -        * Relfilenumbers are not unique in databases across
> tablespaces, so we need
> -        * to allocate a new one in the new tablespace.
> +        * Generate a new relfilenumber. Although relfilenumber are
> unique within a
> +        * cluster, we are unable to use the old relfilenumber since unused
> +        * relfilenumber are not unlinked until commit.  So if within a
> +        * transaction, if we set the old tablespace again, we will
> get conflicting
> +        * relfilenumber file.
>          */
> -       newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
> -
>             rel->rd_rel->relpersistence);
> +       newrelfilenumber = GetNewRelFileNumber();
>
> I can't clearly understand this comment. Is it saying that the code
> which follows is broken and needs to be fixed by a future patch before
> things are OK again? If so, that's not good.

No it is not broken in this patch.  Basically, before our patch the
reason for allocating the new relfilenumber was that if we create the
file with oldrelfilenumber in new tablespace then it is possible that
in the new tablespace file with same name exist because relfilenumber
was unique in databases across tablespaces so there could be conflict.
But now that is not the case but still we can not reuse the old
relfilenumber because from the old tablespace the old relfilenumber
file is not removed until the next checkpoint so if we move the table
back to the old tablespace again then there could be conflict.  And
even after we get the final patch of removing the tombstone file on
commit then also we can not reuse the old relfilenumber because within
a transaction we can switch between the tablespaces multiple times and
the relfilenumber file from the old tablespace will be removed only on
commit.  This is what I am trying to explain in the comment.

Now I have modified the comment slightly, such that in 0002 I am
saying files are not removed until the next checkpoint and in 0004 I
am modifying that and saying not removed until commit.

> - * callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in
> - * catalog/catalog.c.
> + * callers should be GetNewOidWithIndex() in catalog/catalog.c.
>
> If there is only one, it should say "caller", not "callers".
>
>  Orphan files are harmless --- at worst they waste a bit of disk space ---
> -because we check for on-disk collisions when allocating new relfilenumber
> -OIDs.  So cleaning up isn't really necessary.
> +because relfilenumber is 56 bit wide so logically there should not be any
> +collisions.  So cleaning up isn't really necessary.
>
> I don't agree that orphaned files are harmless, but changing that is
> beyond the scope of this patch. I think that the way you've ended the
> sentence isn't sufficiently clear and correct even if we accept the
> principle that orphaned files are harmless. What I think we should
> stay instead is "because the relfilenode counter is monotonically
> increasing. The maximum value is 2^56-1, and there is no provision for
> wraparound."

Done

> +       /*
> +        * Check if we set the new relfilenumber then do we run out of
> the logged
> +        * relnumber, if so then we need to WAL log again.  Otherwise,
> just adjust
> +        * the relnumbercount.
> +        */
> +       relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
> +       if (ShmemVariableCache->relnumbercount <= relnumbercount)
> +       {
> +               LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH);
> +               ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
> +       }
> +       else
> +               ShmemVariableCache->relnumbercount -= relnumbercount;
>
> Would it be clearer, here and elsewhere, if VariableCacheData tracked
> nextRelFileNumber and nextUnloggedRelFileNumber instead of
> nextRelFileNumber and relnumbercount? I'm not 100% sure, but the idea
> seems worth considering.

I think it is in line with oidCount, what do you think?

>
> +        * Flush xlog record to disk before returning.  To protect against file
> +        * system changes reaching the disk before the
> XLOG_NEXT_RELFILENUMBER log.
>
> The way this is worded, you would need it to be just one sentence,
> like "Flush xlog record to disk before returning to protect
> against...". Or else add "this is," like "This is to protect
> against..."
>
> But I'm thinking maybe we could reword it a little more, perhaps
> something like this: "Flush xlog record to disk before returning. We
> want to be sure that the in-memory nextRelFileNumber value is always
> larger than any relfilenumber that is already in use on disk. To
> maintain that invariant, we must make sure that the record we just
> logged reaches the disk before any new files are created."

Done

> This isn't a full review, I think, but I'm kind of out of time and
> energy for today.

I have updated some other comments as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
Trying to compile with 0001 and 0002 applied and -Wall -Werror in use, I get:

buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
                        CLEAR_BUFFERTAG(buf->tag);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~
../../../../src/include/storage/buf_internals.h:122:14: note: expanded
from macro 'CLEAR_BUFFERTAG'
        (a).forkNum = InvalidForkNumber, \
                    ^ ~~~~~~~~~~~~~~~~~
1 error generated.

More review comments:

In pg_buffercache_pages_internal(), I suggest that we add an error
check. If fctx->record[i].relfilenumber is greater than the largest
value that can be represented as an OID, then let's do something like:

ERROR: relfilenode is too large to be represented as an OID
HINT: Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE

That way, instead of confusing people by giving them an incorrect
answer, we'll push them toward a step that they may have overlooked.

In src/backend/access/transam/README, I think the sentence "So
cleaning up isn't really necessary." isn't too helpful. I suggest
replacing it with "Thus, on-disk collisions aren't possible."

> I think it is in line with oidCount, what do you think?

Oh it definitely is, and maybe it's OK the way you have it. But the
OID stuff has wraparound to worry about, and this doesn't; and this
has the SetNextRelFileNumber and that doesn't; so it is not
necessarily the case that the design which is best for that case is
also best for this case.

I believe that the persistence model for SetNextRelFileNumber needs
more thought. Right now I believe it's relying on the fact that, after
we try to restore the dump, we'll try to perform a clean shutdown of
the server before doing anything important, and that will persist the
final value, whatever it ends up being. However, there's no comment
explaining that theory of operation, and it seems pretty fragile
anyway. What if things don't go as planned? Suppose the power goes out
halfway through restoring the dump, and the user for some reason then
gives up on running pg_upgrade and just tries to do random things with
that server? Then I think there will be trouble, because nothing has
updated the nextrelfilenumber value and yet there are potentially new
files on disk. Maybe that's a stretch since I think other things might
also break if you do that, but I'm also not sure that's the only
scenario to worry about, especially if you factor in the possibility
of future code changes, like changes to the timing of when we shut
down and restart the server during pg_upgrade, or other uses of
binary-upgrade mode, or whatever. I don't know. Perhaps it's not
actually broken but I'm inclined to think it should be logging its
changes.

A related thought is that I don't think this patch has as many
cross-checks as it could have. For instance, suppose that when we
replay a WAL record that creates relation storage, we cross-check that
the value is less than the counter. I think you have a check in there
someplace that will error out if there is an actual collision --
although I can't find it at the moment, and possibly we want to add
some comments there even if it's in existing code -- but this kind of
thing would detect bugs that could lead to collisions even if no
collision actually occurs, e.g. because a duplicate relfilenumber is
used but in a different database or tablespace. It might be worth
spending some time thinking about other possible cross-checks too.
We're trying to create a system where the relfilenumber counter is
always ahead of all the relfilenumbers used on disk, but the coupling
between the relfilenumber-advancement machinery and the
make-files-on-disk machinery is pretty loose, and so there is a risk
that bugs could escape detection. Whatever we can do to increase the
probability of noticing when things have gone wrong, and/or to notice
it quicker, will be good.

+       if (!IsBinaryUpgrade)
+               elog(ERROR, "the RelFileNumber can be set only during
binary upgrade");

I think you should remove the word "the". Primary error messages are
written telegram-style and "the" is usually omitted, especially at the
beginning of the message.

+        * This should not impact the performance, since we are not WAL logging
+        * it for every allocation, but only after allocating 512 RelFileNumber.

I think this claim is overly bold, and it would be better if the
current value of the constant weren't encoded in the comment. I'm not
sure we really need this part of the comment at all, but if we do,
maybe it should be reworded to something like: This is potentially a
somewhat expensive operation, but fortunately we only need to do it
for every VAR_RELNUMBER_PREFETCH new relfilenodes. Or maybe it's
better to put this explanation in GetNewRelFileNumber instead, e.g.
"If we run out of logged RelFileNumbers, then we must log more, and
also wait for the xlog record to be flushed to disk. This is somewhat
expensive, but hopefully VAR_RELNUMBER_PREFETCH is large enough that
this doesn't slow things down too much."

One thing that isn't great about this whole scheme is that it can lead
to lock pile-ups. Once somebody is waiting for an
XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend
that tries to get a new relfilenumber is going to block waiting for
RelFileNumberGenLock. I wonder whether this effect is observable in
practice: suppose we just create relations in a tight loop from inside
a stored procedure, and do that simultaneously in multiple backends?
What does the wait event distribution look like? Can we observe a lot
of RelFileNumberGenLock events or not really? I guess if we reduce
VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but
how small a value is needed?

One thing we could think about doing here is try to stagger the xlog
and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
where we are now, and remember the LSN. When we've used up our entire
previous allocation, XLogFlush() that record before allowing the
additional values to be used. The bookkeeping would be a bit more
complicated than currently, but I don't think it would be too bad. I'm
not sure how much it would actually help, though, or whether we need
it. If new relfilenumbers are being used up really quickly, then maybe
the record won't get flushed into the background before we run out of
available numbers anyway, and if they aren't, then maybe it doesn't
matter. On the other hand, even one transaction commit between when
the record is logged and when we run out of the previous allocation is
enough to force a flush, at least with synchronous_commit=on, so maybe
the chances of being able to piggyback on an existing flush are not so
bad after all. I'm not sure.

+        * Generate a new relfilenumber.  We can not reuse the old relfilenumber
+        * because the unused relfilenumber files are not unlinked
until the next
+        * checkpoint.  So if move the relation to the old tablespace again, we
+        * will get the conflicting relfilenumber file.

This is much clearer now but the grammar has some issues, e.g. "the
unused relfilenumber" should be just "unused relfilenumber" and "So if
move" is not right either. I suggest: We cannot reuse the old
relfilenumber because of the possibility that that relation will be
moved back to the original tablespace before the next checkpoint. At
that point, the first segment of the main fork won't have been
unlinked yet, and an attempt to create new relation storage with that
same relfilenumber will fail."

In theory I suppose there's another way we could solve this problem:
keep using the same relfilenumber, and if the scenario described here
occurs, just reuse the old file. The reason why we can't do that today
is because we could be running with wal_level=minimal and replace a
relation with one whose contents aren't logged. If WAL replay then
replays the drop, we're in trouble. But if the only time we reuse a
relfilenumber for new relation storage is when relations are moved
around, then I think that scenario can't happen. However, I think
assigning a new relfilenumber is probably better, because it gets us
closer to a world in which relfilenumbers are never reused at all. It
doesn't get us all the way there because of createdb() and movedb(),
but it gets us closer and I prefer that.

+ * XXX although this all was true when the relfilenumbers were 32 bits wide but
+ * now the relfilenumbers are 56 bits wide so we don't have risk of
+ * relfilenumber being reused so in future we can immediately unlink the first
+ * segment as well.  Although we can reuse the relfilenumber during createdb()
+ * using file copy method or during movedb() but the above scenario is only
+ * applicable when we create a new relation.

Here is an edited version:

XXX. Although all of this was true when relfilenumbers were 32 bits wide, they
are now 56 bits wide and do not wrap around, so in the future we can change
the code to immediately unlink the first segment of the relation along
with all the
others. We still do reuse relfilenumbers when createdb() is performed using the
file-copy method or during movedb(), but the scenario described above can only
happen when creating a new relation.

I think that pg_filenode_relation,
binary_upgrade_set_next_heap_relfilenode, and other functions that are
now going to be accepting a RelFileNode using the SQL int8 datatype
should bounds-check the argument. It could be <0 or >2^56, and I
believe it'd be best to throw an error for that straight off. The
three functions in pg_upgrade_support.c could share a static
subroutine for this, to avoid duplicating code.

This bounds-checking issue also applies to the -f argument to pg_checksums.

I notice that the patch makes no changes to relmapper.c, and I think
that's a problem. Notice in particular:

#define MAX_MAPPINGS            62  /* 62 * 8 + 16 = 512 */

I believe that making RelFileNumber into a 64-bit value will cause the
8 in the calculation above to change to 16, defeating the intention
that the size of the file ought to be the smallest imaginable size of
a disk sector. It does seem like it would have been smart to include a
StaticAssertStmt in this file someplace that checks that the data
structure has the expected size, and now might be a good time, perhaps
in a separate patch, to add one. If we do nothing fancy here, the
maximum number of mappings will have to be reduced from 62 to 31,
which is a problem because global/pg_filenode.map currently has 48
entries. We could try to arrange to squeeze padding out of the
RelMapping struct, which would let us use just 12 bytes per mapping,
which would increase the limit to 41, but that's still less than we're
using already, never mind leaving room for future growth.

I don't know what to do about this exactly. I believe it's been
previously suggested that the actual minimum sector size on reasonably
modern hardware is never as small as 512 bytes, so maybe the file size
can just be increased to 1kB or something. If that idea is judged
unsafe, I can think of two other possible approaches offhand. One is
that we could move away from the idea of storing the OIDs in the file
along with the RelFileNodes, and instead store the offset for a given
RelFileNode at a fixed offset in the file. That would require either
hard-wiring offset tables into the code someplace, or generating them
as part of the build process, with separate tables for shared and
database-local relation map files. The other is that we could have
multiple 512-byte sectors and try to arrange for each relation to be
in the same sector with the indexes of that relation, since the
comments in relmapper.c say this:

 * aborts.  An important factor here is that the indexes and toast table of
 * a mapped catalog must also be mapped, so that the rewrites/relocations of
 * all these files commit in a single map file update rather than being tied
 * to transaction commit.

This suggests that atomicity is required across a table and its
indexes, but that it's needed across arbitrary sets of entries in the
file.

Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC.

--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -34,6 +34,13 @@ CATALOG(pg_class,1259,RelationRelationId)
BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
        /* oid */
        Oid                     oid;

+       /* access method; 0 if not a table / index */
+       Oid                     relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
+
+       /* identifier of physical storage file */
+       /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
+       int64           relfilenode BKI_DEFAULT(0);
+
        /* class name */

        NameData        relname;

@@ -49,13 +56,6 @@ CATALOG(pg_class,1259,RelationRelationId)
BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
        /* class owner */
        Oid                     relowner BKI_DEFAULT(POSTGRES)
BKI_LOOKUP(pg_authid);

-       /* access method; 0 if not a table / index */
-       Oid                     relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
-
-       /* identifier of physical storage file */
-       /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
-       Oid                     relfilenode BKI_DEFAULT(0);
-
        /* identifier of table space for relation (0 means default for
database) */
        Oid                     reltablespace BKI_DEFAULT(0)
BKI_LOOKUP_OPT(pg_tablespace);

As Andres said elsewhere, this stinks. Not sure what the resolution of
the discussion over on the "AIX support" thread is going to be yet,
but hopefully not this.

+       uint32          relNumber_low;  /* relfilenumber 32 lower bits */
+       uint32          relNumber_hi:24;        /* relfilenumber 24 high bits */
+       uint32          forkNum:8;              /* fork number */

I still think we'd be better off with something like uint32
relForkDetails[2]. The bitfields would be nice if they meant that we
didn't have to do bit-shifting and masking operations ourselves, but
with the field split this way, we do anyway. So what's the point in
mixing the approaches?

  * relNumber identifies the specific relation.  relNumber corresponds to
  * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
  * to assign new physical files to relations in some situations).
- * Notice that relNumber is only unique within a database in a particular
- * tablespace.
+ * Notice that relNumber is unique within a cluster.

I think this paragraph would benefit from more revision. I think that
we should just nuke the parenthesized part altogether, since we'll now
never use pg_class.oid as relNumber, and to suggest otherwise is just
confusing. As for the last sentence, "Notice that relNumber is unique
within a cluster." isn't wrong, but I think we could be more precise
and informative. Perhaps: "relNumber values are assigned by
GetNewRelFileNumber(), which will only ever assign the same value once
during the lifetime of a cluster. However, since CREATE DATABASE
duplicates the relfilenumbers of the template database, the values are
in practice only unique within a database, not globally."

That's all I've got for now.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Jul 7, 2022 at 10:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have accepted all the suggestion, find my inline replies where we
need more thoughts.

> buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
> changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
>                         CLEAR_BUFFERTAG(buf->tag);
>                         ^~~~~~~~~~~~~~~~~~~~~~~~~
> ../../../../src/include/storage/buf_internals.h:122:14: note: expanded
> from macro 'CLEAR_BUFFERTAG'
>         (a).forkNum = InvalidForkNumber, \
>                     ^ ~~~~~~~~~~~~~~~~~
> 1 error generated.

Hmm so now we are using an unsigned int field so IMHO we can make
InvalidForkNumber to 255 instead of -1?


> > I think it is in line with oidCount, what do you think?
>
> Oh it definitely is, and maybe it's OK the way you have it. But the
> OID stuff has wraparound to worry about, and this doesn't; and this
> has the SetNextRelFileNumber and that doesn't; so it is not
> necessarily the case that the design which is best for that case is
> also best for this case.

Yeah right, but now with the latest changes for piggybacking the
XlogFlush I think it is cleaner to have the count.

> I believe that the persistence model for SetNextRelFileNumber needs
> more thought. Right now I believe it's relying on the fact that, after
> we try to restore the dump, we'll try to perform a clean shutdown of
> the server before doing anything important, and that will persist the
> final value, whatever it ends up being. However, there's no comment
> explaining that theory of operation, and it seems pretty fragile
> anyway. What if things don't go as planned? Suppose the power goes out
> halfway through restoring the dump, and the user for some reason then
> gives up on running pg_upgrade and just tries to do random things with
> that server? Then I think there will be trouble, because nothing has
> updated the nextrelfilenumber value and yet there are potentially new
> files on disk. Maybe that's a stretch since I think other things might
> also break if you do that, but I'm also not sure that's the only
> scenario to worry about, especially if you factor in the possibility
> of future code changes, like changes to the timing of when we shut
> down and restart the server during pg_upgrade, or other uses of
> binary-upgrade mode, or whatever. I don't know. Perhaps it's not
> actually broken but I'm inclined to think it should be logging its
> changes.

But we are already logging this if we are setting the relfilenumber
which is out of the already logged range, am I missing something?
Check this change.
+    relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
+    if (ShmemVariableCache->relnumbercount <= relnumbercount)
+    {
+        LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH, NULL);
+        ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
+    }
+    else
+        ShmemVariableCache->relnumbercount -= relnumbercount;

> A related thought is that I don't think this patch has as many
> cross-checks as it could have. For instance, suppose that when we
> replay a WAL record that creates relation storage, we cross-check that
> the value is less than the counter. I think you have a check in there
> someplace that will error out if there is an actual collision --
> although I can't find it at the moment, and possibly we want to add
> some comments there even if it's in existing code -- but this kind of
> thing would detect bugs that could lead to collisions even if no
> collision actually occurs, e.g. because a duplicate relfilenumber is
> used but in a different database or tablespace. It might be worth
> spending some time thinking about other possible cross-checks too.
> We're trying to create a system where the relfilenumber counter is
> always ahead of all the relfilenumbers used on disk, but the coupling
> between the relfilenumber-advancement machinery and the
> make-files-on-disk machinery is pretty loose, and so there is a risk
> that bugs could escape detection. Whatever we can do to increase the
> probability of noticing when things have gone wrong, and/or to notice
> it quicker, will be good.

I had those changes in v7-0003, now I have merged with 0002.  This has
assert check while replaying the WAL for smgr create and smgr
truncate, and while during normal path when allocating the new
relfilenumber we are asserting for any existing file.

> One thing that isn't great about this whole scheme is that it can lead
> to lock pile-ups. Once somebody is waiting for an
> XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend
> that tries to get a new relfilenumber is going to block waiting for
> RelFileNumberGenLock. I wonder whether this effect is observable in
> practice: suppose we just create relations in a tight loop from inside
> a stored procedure, and do that simultaneously in multiple backends?
> What does the wait event distribution look like? Can we observe a lot
> of RelFileNumberGenLock events or not really? I guess if we reduce
> VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but
> how small a value is needed?

I have done some performance tests, with very small values I can see a
lot of wait events for RelFileNumberGen but with bigger numbers like
256 or 512 it is not really bad.  See results at the end of the
mail[1]

> One thing we could think about doing here is try to stagger the xlog
> and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
> relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
> where we are now, and remember the LSN. When we've used up our entire
> previous allocation, XLogFlush() that record before allowing the
> additional values to be used. The bookkeeping would be a bit more
> complicated than currently, but I don't think it would be too bad. I'm
> not sure how much it would actually help, though, or whether we need
> it. If new relfilenumbers are being used up really quickly, then maybe
> the record won't get flushed into the background before we run out of
> available numbers anyway, and if they aren't, then maybe it doesn't
> matter. On the other hand, even one transaction commit between when
> the record is logged and when we run out of the previous allocation is
> enough to force a flush, at least with synchronous_commit=on, so maybe
> the chances of being able to piggyback on an existing flush are not so
> bad after all. I'm not sure.

I have done these changes during GetNewRelFileNumber() this required
to track the last logged record pointer as well but I think this looks
clean.  With this I can see some reduction in RelFileNumberGen wait
event[1]

> In theory I suppose there's another way we could solve this problem:
> keep using the same relfilenumber, and if the scenario described here
> occurs, just reuse the old file. The reason why we can't do that today
> is because we could be running with wal_level=minimal and replace a
> relation with one whose contents aren't logged. If WAL replay then
> replays the drop, we're in trouble. But if the only time we reuse a
> relfilenumber for new relation storage is when relations are moved
> around, then I think that scenario can't happen. However, I think
> assigning a new relfilenumber is probably better, because it gets us
> closer to a world in which relfilenumbers are never reused at all. It
> doesn't get us all the way there because of createdb() and movedb(),
> but it gets us closer and I prefer that.

I agree with you.

> I notice that the patch makes no changes to relmapper.c, and I think
> that's a problem. Notice in particular:
>
> #define MAX_MAPPINGS            62  /* 62 * 8 + 16 = 512 */
>
> I believe that making RelFileNumber into a 64-bit value will cause the
> 8 in the calculation above to change to 16, defeating the intention
> that the size of the file ought to be the smallest imaginable size of
> a disk sector. It does seem like it would have been smart to include a
> StaticAssertStmt in this file someplace that checks that the data
> structure has the expected size, and now might be a good time, perhaps
> in a separate patch, to add one. If we do nothing fancy here, the
> maximum number of mappings will have to be reduced from 62 to 31,
> which is a problem because global/pg_filenode.map currently has 48
> entries. We could try to arrange to squeeze padding out of the
> RelMapping struct, which would let us use just 12 bytes per mapping,
> which would increase the limit to 41, but that's still less than we're
> using already, never mind leaving room for future growth.
>
> I don't know what to do about this exactly. I believe it's been
> previously suggested that the actual minimum sector size on reasonably
> modern hardware is never as small as 512 bytes, so maybe the file size
> can just be increased to 1kB or something. If that idea is judged
> unsafe, I can think of two other possible approaches offhand. One is
> that we could move away from the idea of storing the OIDs in the file
> along with the RelFileNodes, and instead store the offset for a given
> RelFileNode at a fixed offset in the file. That would require either
> hard-wiring offset tables into the code someplace, or generating them
> as part of the build process, with separate tables for shared and
> database-local relation map files. The other is that we could have
> multiple 512-byte sectors and try to arrange for each relation to be
> in the same sector with the indexes of that relation, since the
> comments in relmapper.c say this:
>
>  * aborts.  An important factor here is that the indexes and toast table of
>  * a mapped catalog must also be mapped, so that the rewrites/relocations of
>  * all these files commit in a single map file update rather than being tied
>  * to transaction commit.
>
> This suggests that atomicity is required across a table and its
> indexes, but that it's needed across arbitrary sets of entries in the
> file.
>
> Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC.

I am not sure what is the best solution here, but I agree that most of
the modern hardware will have bigger sector size than 512 so we can
just change file size of 1024.

The current value of RELMAPPER_FILEMAGIC is 0x592717, I am not sure
how this version ID is decide is this some random magic number or
based on some logic?

>
> +       uint32          relNumber_low;  /* relfilenumber 32 lower bits */
> +       uint32          relNumber_hi:24;        /* relfilenumber 24 high bits */
> +       uint32          forkNum:8;              /* fork number */
>
> I still think we'd be better off with something like uint32
> relForkDetails[2]. The bitfields would be nice if they meant that we
> didn't have to do bit-shifting and masking operations ourselves, but
> with the field split this way, we do anyway. So what's the point in
> mixing the approaches?

Actually with this we were able to access the forkNum directly, but I
also think changing as relForkDetails[2] is cleaner so done that.  And
as part of the related changes in 0001 I have removed the direct
access to the forkNum.

[1] Wait event details

Procedure:
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
  relname varchar;
  pid int;
  i   int;
BEGIN
  SELECT pg_backend_pid() INTO pid;
  relname := 'test_' || pid;
  FOR i IN 1..count LOOP
    EXECUTE format('CREATE TABLE %s(a int)', relname);

    EXECUTE format('DROP TABLE %s', relname);
  END LOOP;
END;

Target test: Executed "select create_table(100);" query from pgbench
with 32 concurrent backends.

VAR_RELNUMBER_PREFETCH = 8

    905  LWLock          | LockManager
    346  LWLock          | RelFileNumberGen
    192
    190  Activity        | WalWriterMain

VAR_RELNUMBER_PREFETCH=128
   1187  LWLock          | LockManager
    247  LWLock          | RelFileNumberGen
    139  Activity        | CheckpointerMain

VAR_RELNUMBER_PREFETCH=256

   1029  LWLock          | LockManager
    158  LWLock          | BufferContent
    134  Activity        | CheckpointerMain
    134  Activity        | AutoVacuumMain
    133  Activity        | BgWriterMain
    132  Activity        | WalWriterMain
    130  Activity        | LogicalLauncherMain
    123  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=512

  1174  LWLock          | LockManager
    136  Activity        | CheckpointerMain
    136  Activity        | BgWriterMain
    136  Activity        | AutoVacuumMain
    134  Activity        | WalWriterMain
    134  Activity        | LogicalLauncherMain
     99  LWLock          | BufferContent
     35  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=2048
   1070  LWLock          | LockManager
    160  LWLock          | BufferContent
    156  Activity        | CheckpointerMain
    156
    155  Activity        | BgWriterMain
    154  Activity        | AutoVacuumMain
    153  Activity        | WalWriterMain
    149  Activity        | LogicalLauncherMain
     31  LWLock          | RelFileNumberGen
     28  Timeout         | VacuumDelay


VAR_RELNUMBER_PREFETCH=4096
Note, no wait event for RelFileNumberGen at value 4096

New patch with piggybacking XLogFlush()

VAR_RELNUMBER_PREFETCH = 8

  1105  LWLock          | LockManager
    143  LWLock          | BufferContent
    140  Activity        | CheckpointerMain
    140  Activity        | BgWriterMain
    139  Activity        | WalWriterMain
    138  Activity        | AutoVacuumMain
    137  Activity        | LogicalLauncherMain
    115  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 256
   1130  LWLock          | LockManager
    141  Activity        | CheckpointerMain
    139  Activity        | BgWriterMain
    137  Activity        | AutoVacuumMain
    136  Activity        | LogicalLauncherMain
    135  Activity        | WalWriterMain
     69  LWLock          | BufferContent
     31  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 1024
Note: no wait event for RelFileNumberGen at value 1024


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Mon, Jul 11, 2022 at 7:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
> > changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
> >                         CLEAR_BUFFERTAG(buf->tag);
> >                         ^~~~~~~~~~~~~~~~~~~~~~~~~
> > ../../../../src/include/storage/buf_internals.h:122:14: note: expanded
> > from macro 'CLEAR_BUFFERTAG'
> >         (a).forkNum = InvalidForkNumber, \
> >                     ^ ~~~~~~~~~~~~~~~~~
> > 1 error generated.
>
> Hmm so now we are using an unsigned int field so IMHO we can make
> InvalidForkNumber to 255 instead of -1?

If we're going to do that I think we had better do it as a separate,
preparatory patch.

It also makes me wonder why we're using macros rather than static
inline functions in buf_internals.h. I wonder whether we could do
something like this, for example, and keep InvalidForkNumber as -1:

static inline ForkNumber
BufTagGetForkNum(BufferTag *tagPtr)
{
    int8 ret;

    StaticAssertStmt(MAX_FORKNUM <= INT8_MAX);
    ret = (int8) ((tagPtr->relForkDetails[0] >> BUFFERTAG_RELNUMBER_BITS);
    return (ForkNumber) ret;
}

Even if we don't use that particular trick, I think we've generally
been moving toward using static inline functions rather than macros,
because it provides better type-safety and the code is often easier to
read. Maybe we should also approach it that way here. Or even commit a
preparatory patch replacing the existing macros with inline functions.
Or maybe it's best to leave it alone, not sure.

It feels like some of the changes to buf_internals.h in 0002 could be
moved into 0001. If we're going to introduce a combined method to set
the relnumber and fork, I think we could do that in 0001 rather than
making 0001 introduce a macro to set just the relfilenumber and then
having 0002 change it around again.

BUFFERTAG_RELNUMBER_BITS feels like a lie. It's defined to be 24, but
based on the name you'd expect it to be 56.

> But we are already logging this if we are setting the relfilenumber
> which is out of the already logged range, am I missing something?
> Check this change.
> +    relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
> +    if (ShmemVariableCache->relnumbercount <= relnumbercount)
> +    {
> +        LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH, NULL);
> +        ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
> +    }
> +    else
> +        ShmemVariableCache->relnumbercount -= relnumbercount;

Oh, I guess I missed that.

> I had those changes in v7-0003, now I have merged with 0002.  This has
> assert check while replaying the WAL for smgr create and smgr
> truncate, and while during normal path when allocating the new
> relfilenumber we are asserting for any existing file.

I think a test-and-elog might be better. Most users won't be running
assert-enabled builds, but this seems worth checking regardless.

> I have done some performance tests, with very small values I can see a
> lot of wait events for RelFileNumberGen but with bigger numbers like
> 256 or 512 it is not really bad.  See results at the end of the
> mail[1]

It's a little hard to interpret these results because you don't say
how often you were checking the wait events, or how often the
operation took to complete. I suppose we can guess the relative time
scale from the number of Activity events: if there were 190
WalWriterMain events observed, then the time to complete the operation
is probably 190 times how often you were checking the wait events, but
was that every second or every half second or every tenth of a second?

> I have done these changes during GetNewRelFileNumber() this required
> to track the last logged record pointer as well but I think this looks
> clean.  With this I can see some reduction in RelFileNumberGen wait
> event[1]

I find the code you wrote here a little bit magical. I believe it
depends heavily on choosing to issue the new WAL record when we've
exhausted exactly 50% of the available space. I suggest having two
constants, one of which is the number of relfilenumber values per WAL
record, and the other of which is the threshold for issuing a new WAL
record. Maybe something like RFN_VALUES_PER_XLOG and
RFN_NEW_XLOG_THRESHOLD, or something. And then work code that works
correctly for any value of RFN_NEW_XLOG_THRESHOLD between 0 (don't log
new RFNs until old allocation is completely exhausted) and
RFN_VALUES_PER_XLOG - 1 (log new RFNs after using just 1 item from the
previous allocation). That way, if in the future someone decides to
change the constant values, they can do that and the code still works.

> I am not sure what is the best solution here, but I agree that most of
> the modern hardware will have bigger sector size than 512 so we can
> just change file size of 1024.

I went looking for previous discussion of this topic. Here's Heikki
doubting whether even 512 is too big:

http://postgr.es/m/f03d9166-ad12-2a3c-f605-c1873ee86ae4@iki.fi

Here's Thomas saying that he thinks it's probably mostly 4kB these
days, except when it isn't:

http://postgr.es/m/CAEepm=1e91zMk-vZszCOGDtKd=DhMLQjgENRSxcbSEhxuEPpfA@mail.gmail.com

Here's Tom with another idea how to reduce space usage:

http://postgr.es/m/7235.1566626302@sss.pgh.pa.us

It doesn't look to me like there's a consensus that some bigger number is safe.

> The current value of RELMAPPER_FILEMAGIC is 0x592717, I am not sure
> how this version ID is decide is this some random magic number or
> based on some logic?

Hmm, maybe we're not supposed to bump this value after all. I guess
maybe it's intended strictly as a magic number, rather than as a
version indicator. At least, we've never changed it up until now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
Hi,

On 2022-07-07 13:26:29 -0400, Robert Haas wrote:
> We're trying to create a system where the relfilenumber counter is
> always ahead of all the relfilenumbers used on disk, but the coupling
> between the relfilenumber-advancement machinery and the
> make-files-on-disk machinery is pretty loose, and so there is a risk
> that bugs could escape detection. Whatever we can do to increase the
> probability of noticing when things have gone wrong, and/or to notice
> it quicker, will be good.

ISTM that we should redefine pg_class_tblspc_relfilenode_index to only cover
relfilenode - afaics there's no real connection to the tablespace
anymore. That'd a) reduce the size of the index b) guarantee uniqueness across
tablespaces.

I don't know where we could fit a sanity check that connects to all databases
and detects duplicates across all the pg_class instances. Perhaps pg_amcheck?


It may be worth changing RelidByRelfilenumber() / its infrastructure to not
use reltablespace anymore.


> One thing we could think about doing here is try to stagger the xlog
> and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
> relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
> where we are now, and remember the LSN. When we've used up our entire
> previous allocation, XLogFlush() that record before allowing the
> additional values to be used. The bookkeeping would be a bit more
> complicated than currently, but I don't think it would be too bad. I'm
> not sure how much it would actually help, though, or whether we need
> it.

I think that's a very good idea. My concern around doing an XLogFlush() is
that it could lead to a lot of tiny f[data]syncs(), because not much else
needs to be written out. But the scheme you describe would likely lead the
XLogFlush() flushing plenty other WAL writes out, addressing that.


> If new relfilenumbers are being used up really quickly, then maybe
> the record won't get flushed into the background before we run out of
> available numbers anyway, and if they aren't, then maybe it doesn't
> matter. On the other hand, even one transaction commit between when
> the record is logged and when we run out of the previous allocation is
> enough to force a flush, at least with synchronous_commit=on, so maybe
> the chances of being able to piggyback on an existing flush are not so
> bad after all. I'm not sure.

Even if the record isn't yet flushed out by the time we need to, the
deferred-ness means that there's a good chance more useful records can also be
flushed out at the same time...


> I notice that the patch makes no changes to relmapper.c, and I think
> that's a problem. Notice in particular:
> 
> #define MAX_MAPPINGS            62  /* 62 * 8 + 16 = 512 */
> 
> I believe that making RelFileNumber into a 64-bit value will cause the
> 8 in the calculation above to change to 16, defeating the intention
> that the size of the file ought to be the smallest imaginable size of
> a disk sector. It does seem like it would have been smart to include a
> StaticAssertStmt in this file someplace that checks that the data
> structure has the expected size, and now might be a good time, perhaps
> in a separate patch, to add one.

+1

Perhaps MAX_MAPPINGS should be at least partially computed instead of doing
the math in a comment? sizeof(RelMapping) could directly be used, and we could
define SIZEOF_RELMAPFILE_START with a StaticAssert() enforcing it to be equal
to offsetof(RelMapFile, mappings), if we move crc & pad to *before* mappings -
afaics that should be entirely doable.


> If we do nothing fancy here, the maximum number of mappings will have to be
> reduced from 62 to 31, which is a problem because global/pg_filenode.map
> currently has 48 entries. We could try to arrange to squeeze padding out of
> the RelMapping struct, which would let us use just 12 bytes per mapping,
> which would increase the limit to 41, but that's still less than we're using
> already, never mind leaving room for future growth.

Ugh.


> I don't know what to do about this exactly. I believe it's been
> previously suggested that the actual minimum sector size on reasonably
> modern hardware is never as small as 512 bytes, so maybe the file size
> can just be increased to 1kB or something.

I'm not so sure that's a good idea - while the hardware sector size likely
isn't 512 on much storage anymore, it's still the size that most storage
protocols use. Which then means you need to be confident that you not just
rely on storage atomicity, but also that nothing in the
  filesystem <-> block layer <-> driver
stack somehow causes a single larger write to be split up into two.

And if you use a filesystem with a smaller filesystem block size, there might
not even be a choice for the write to be split into two writes. E.g. XFS still
supports 512 byte blocks (although I think it's deprecating block size < 1024).


Maybe the easiest fix here would be to replace the file atomically. Then we
don't need this <= 512 byte stuff. These are done rarely enough that I don't
think the overhead of creating a separate file, fsyncing that, renaming,
fsyncing, would be a problem?

Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Mon, Jul 11, 2022 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> ISTM that we should redefine pg_class_tblspc_relfilenode_index to only cover
> relfilenode - afaics there's no real connection to the tablespace
> anymore. That'd a) reduce the size of the index b) guarantee uniqueness across
> tablespaces.

Sounds like a good idea.

> I don't know where we could fit a sanity check that connects to all databases
> and detects duplicates across all the pg_class instances. Perhaps pg_amcheck?

Unless we're going to change the way CREATE DATABASE works, uniqueness
across databases is not guaranteed.

> I think that's a very good idea. My concern around doing an XLogFlush() is
> that it could lead to a lot of tiny f[data]syncs(), because not much else
> needs to be written out. But the scheme you describe would likely lead the
> XLogFlush() flushing plenty other WAL writes out, addressing that.

Oh, interesting. I hadn't considered that angle.

> Maybe the easiest fix here would be to replace the file atomically. Then we
> don't need this <= 512 byte stuff. These are done rarely enough that I don't
> think the overhead of creating a separate file, fsyncing that, renaming,
> fsyncing, would be a problem?

Anything we can reasonably do to reduce the number of places where
we're relying on things being <= 512 bytes seems like a step in the
right direction to me. It's very difficult to know whether such code
is correct, or what the probability is that crossing the 512-byte
boundary would break anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
Hi,

On 2022-07-11 15:08:57 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't know where we could fit a sanity check that connects to all databases
> > and detects duplicates across all the pg_class instances. Perhaps pg_amcheck?
> 
> Unless we're going to change the way CREATE DATABASE works, uniqueness
> across databases is not guaranteed.

You could likely address that by not flagging conflicts iff oid also matches?
Not sure if worth it, but ...


> > Maybe the easiest fix here would be to replace the file atomically. Then we
> > don't need this <= 512 byte stuff. These are done rarely enough that I don't
> > think the overhead of creating a separate file, fsyncing that, renaming,
> > fsyncing, would be a problem?
> 
> Anything we can reasonably do to reduce the number of places where
> we're relying on things being <= 512 bytes seems like a step in the
> right direction to me. It's very difficult to know whether such code
> is correct, or what the probability is that crossing the 512-byte
> boundary would break anything.

Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
first (likely adding O_TRUNC to flags), use durable_rename() to rename it into
place.  The tempfile should probably be written out before the XLogInsert(),
the durable_rename() after, although I think it'd also be correct to more
closely approximate the current sequence.

It's a lot more problematic to do this for the control file, because we can
end up updating that at a high frequency on standbys, due to minRecoveryPoint.

I have wondered about maintaining that in a dedicated file instead, and
perhaps even doing so on a primary.

Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Mon, Jul 11, 2022 at 3:34 PM Andres Freund <andres@anarazel.de> wrote:
> Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
> first (likely adding O_TRUNC to flags), use durable_rename() to rename it into
> place.  The tempfile should probably be written out before the XLogInsert(),
> the durable_rename() after, although I think it'd also be correct to more
> closely approximate the current sequence.

Something like this?

I chose not to use durable_rename() here, because that allowed me to
do more of the work before starting the critical section, and it's
probably slightly more efficient this way, too. That could be changed,
though, if you really want to stick with durable_rename().

I haven't done anything about actually making the file variable-length
here, either, which I think is what we would want to do. If this seems
more or less all right, I can work on that next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
On 2022-07-11 16:11:53 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 3:34 PM Andres Freund <andres@anarazel.de> wrote:
> > Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
> > first (likely adding O_TRUNC to flags), use durable_rename() to rename it into
> > place.  The tempfile should probably be written out before the XLogInsert(),
> > the durable_rename() after, although I think it'd also be correct to more
> > closely approximate the current sequence.
> 
> Something like this?

Yea. I've not looked carefully, but on a quick skim it looks good.


> I chose not to use durable_rename() here, because that allowed me to
> do more of the work before starting the critical section, and it's
> probably slightly more efficient this way, too. That could be changed,
> though, if you really want to stick with durable_rename().

I guess I'm not enthused in duplicating the necessary knowledge in evermore
places. We've forgotten one of the magic incantations in the past, and needing
to find all the places that need to be patched is a bit bothersome.

Perhaps we could add extract helpers out of durable_rename()?

OTOH, I don't really see what we gain by keeping things out of the critical
section? It does seem good to have the temp-file creation/truncation and write
separately, but after that I don't think it's worth much to avoid a
PANIC. What legitimate issue does it avoid?


Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Mon, Jul 11, 2022 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:
> I guess I'm not enthused in duplicating the necessary knowledge in evermore
> places. We've forgotten one of the magic incantations in the past, and needing
> to find all the places that need to be patched is a bit bothersome.
>
> Perhaps we could add extract helpers out of durable_rename()?
>
> OTOH, I don't really see what we gain by keeping things out of the critical
> section? It does seem good to have the temp-file creation/truncation and write
> separately, but after that I don't think it's worth much to avoid a
> PANIC. What legitimate issue does it avoid?

OK, so then I think we should just use durable_rename(). Here's a
patch that does it that way. I briefly considered the idea of
extracting helpers, but it doesn't seem worthwhile to me. There's not
that much code in durable_rename() in the first place.

In this version, I also removed the struct padding, changed the limit
on the number of entries to a nice round 64, and made some comment
updates. I considered trying to go further and actually make the file
variable-size, so that we never again need to worry about the limit on
the number of entries, but I don't actually think that's a good idea.
It would require substantially more changes to the code in this file,
and that means there's more risk of introducing bugs, and I don't see
that there's much value anyway, because if we ever do hit the current
limit, we can just raise the limit.

If we were going to split up durable_rename(), the only intelligible
split I can see would be to have a second version of the function, or
a flag to the existing function, that caters to the situation where
the old file is already known to have been fsync()'d.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Mon, Jul 11, 2022 at 9:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>

> It also makes me wonder why we're using macros rather than static
> inline functions in buf_internals.h. I wonder whether we could do
> something like this, for example, and keep InvalidForkNumber as -1:
>
> static inline ForkNumber
> BufTagGetForkNum(BufferTag *tagPtr)
> {
>     int8 ret;
>
>     StaticAssertStmt(MAX_FORKNUM <= INT8_MAX);
>     ret = (int8) ((tagPtr->relForkDetails[0] >> BUFFERTAG_RELNUMBER_BITS);
>     return (ForkNumber) ret;
> }
>
> Even if we don't use that particular trick, I think we've generally
> been moving toward using static inline functions rather than macros,
> because it provides better type-safety and the code is often easier to
> read. Maybe we should also approach it that way here. Or even commit a
> preparatory patch replacing the existing macros with inline functions.
> Or maybe it's best to leave it alone, not sure.

I think it make sense to convert existing macros as well, I have
attached a patch for the same,
>
> > I had those changes in v7-0003, now I have merged with 0002.  This has
> > assert check while replaying the WAL for smgr create and smgr
> > truncate, and while during normal path when allocating the new
> > relfilenumber we are asserting for any existing file.
>
> I think a test-and-elog might be better. Most users won't be running
> assert-enabled builds, but this seems worth checking regardless.

IMHO the recovery time asserts we can convert to elog but one which we
are doing after each GetNewRelFileNumber is better to keep as an
assert as we are doing the file access so it can be costly?

> > I have done some performance tests, with very small values I can see a
> > lot of wait events for RelFileNumberGen but with bigger numbers like
> > 256 or 512 it is not really bad.  See results at the end of the
> > mail[1]
>
> It's a little hard to interpret these results because you don't say
> how often you were checking the wait events, or how often the
> operation took to complete. I suppose we can guess the relative time
> scale from the number of Activity events: if there were 190
> WalWriterMain events observed, then the time to complete the operation
> is probably 190 times how often you were checking the wait events, but
> was that every second or every half second or every tenth of a second?

I am executing it after every 0.5 sec using below script in psql
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

And running test for 60 sec
./pgbench -c 32 -j 32 -T 60 -f create_script.sql -p 54321  postgres

$ cat create_script.sql
select create_table(100);

// function body 'create_table'
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
  relname varchar;
  pid int;
  i   int;
BEGIN
  SELECT pg_backend_pid() INTO pid;
  relname := 'test_' || pid;
  FOR i IN 1..count LOOP
    EXECUTE format('CREATE TABLE %s(a int)', relname);

    EXECUTE format('DROP TABLE %s', relname);
  END LOOP;
END;
$$ LANGUAGE plpgsql;



> > I have done these changes during GetNewRelFileNumber() this required
> > to track the last logged record pointer as well but I think this looks
> > clean.  With this I can see some reduction in RelFileNumberGen wait
> > event[1]
>
> I find the code you wrote here a little bit magical. I believe it
> depends heavily on choosing to issue the new WAL record when we've
> exhausted exactly 50% of the available space. I suggest having two
> constants, one of which is the number of relfilenumber values per WAL
> record, and the other of which is the threshold for issuing a new WAL
> record. Maybe something like RFN_VALUES_PER_XLOG and
> RFN_NEW_XLOG_THRESHOLD, or something. And then work code that works
> correctly for any value of RFN_NEW_XLOG_THRESHOLD between 0 (don't log
> new RFNs until old allocation is completely exhausted) and
> RFN_VALUES_PER_XLOG - 1 (log new RFNs after using just 1 item from the
> previous allocation). That way, if in the future someone decides to
> change the constant values, they can do that and the code still works.

ok



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
Hi,

On 2022-07-12 09:51:12 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:
> > I guess I'm not enthused in duplicating the necessary knowledge in evermore
> > places. We've forgotten one of the magic incantations in the past, and needing
> > to find all the places that need to be patched is a bit bothersome.
> >
> > Perhaps we could add extract helpers out of durable_rename()?
> >
> > OTOH, I don't really see what we gain by keeping things out of the critical
> > section? It does seem good to have the temp-file creation/truncation and write
> > separately, but after that I don't think it's worth much to avoid a
> > PANIC. What legitimate issue does it avoid?
> 
> OK, so then I think we should just use durable_rename(). Here's a
> patch that does it that way. I briefly considered the idea of
> extracting helpers, but it doesn't seem worthwhile to me. There's not
> that much code in durable_rename() in the first place.

Cool.


> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates.

What does currently happen if we exceed that?

I wonder if we should just reference a new define generated by genbki.pl
documenting the number of relations that need to be tracked. Then we don't
need to maintain this manually going forward.


> I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.

Yea, I don't really see what we'd gain. For this stuff to change we need to
recompile anyway.


> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

I was thinking of something like durable_rename_prep() that'd fsync the
file/directories under their old names, and then durable_rename_exec() that
actually renames and then fsyncs.  But without a clear usecase...


> +    /* Write new data to the file. */
> +    pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE);
> +    if (write(fd, newmap, sizeof(RelMapFile)) != sizeof(RelMapFile))
...
> +    pgstat_report_wait_end();
> +

Not for this patch, but we eventually should move this sequence into a
wrapper. Perhaps combined with retry handling for short writes, the ENOSPC
stuff and an error message when the write fails. It's a bit insane how many
copies of this we have.


> diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> index b578e2ec75..5d3775ccde 100644
> --- a/src/include/utils/wait_event.h
> +++ b/src/include/utils/wait_event.h
> @@ -193,7 +193,7 @@ typedef enum
>      WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
>      WAIT_EVENT_LOGICAL_REWRITE_WRITE,
>      WAIT_EVENT_RELATION_MAP_READ,
> -    WAIT_EVENT_RELATION_MAP_SYNC,
> +    WAIT_EVENT_RELATION_MAP_RENAME,

Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
since it includes fsync etc?

Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Jul 12, 2022 at 1:09 PM Andres Freund <andres@anarazel.de> wrote:
> What does currently happen if we exceed that?

elog

> > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> > index b578e2ec75..5d3775ccde 100644
> > --- a/src/include/utils/wait_event.h
> > +++ b/src/include/utils/wait_event.h
> > @@ -193,7 +193,7 @@ typedef enum
> >       WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> >       WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> >       WAIT_EVENT_RELATION_MAP_READ,
> > -     WAIT_EVENT_RELATION_MAP_SYNC,
> > +     WAIT_EVENT_RELATION_MAP_RENAME,
>
> Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> since it includes fsync etc?

Sure, I had it that way for a while and changed it at the last minute.
I can change it back.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Hannu Krosing
Дата:
Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);

Have you really thought through making the ForkNum 8-bit ?

For example this would limit a columnar storage with each column
stored in it's own fork (which I'd say is not entirely unreasonable)
to having just about ~250 columns.

And there can easily be other use cases where we do not want to limit
number of forks so much

Cheers
Hannu

On Tue, Jul 12, 2022 at 10:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 12, 2022 at 1:09 PM Andres Freund <andres@anarazel.de> wrote:
> > What does currently happen if we exceed that?
>
> elog
>
> > > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> > > index b578e2ec75..5d3775ccde 100644
> > > --- a/src/include/utils/wait_event.h
> > > +++ b/src/include/utils/wait_event.h
> > > @@ -193,7 +193,7 @@ typedef enum
> > >       WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> > >       WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> > >       WAIT_EVENT_RELATION_MAP_READ,
> > > -     WAIT_EVENT_RELATION_MAP_SYNC,
> > > +     WAIT_EVENT_RELATION_MAP_RENAME,
> >
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>



Re: making relfilenodes 56 bits

От
Andres Freund
Дата:
Hi,

Please don't top quote - as mentioned a couple times recently.

On 2022-07-12 23:00:22 +0200, Hannu Krosing wrote:
> Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);
> 
> Have you really thought through making the ForkNum 8-bit ?

MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
filesystem access each time), it's not feasible to make that number large.

Greetings,

Andres Freund



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Jul 12, 2022 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:
> MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
> restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
> filesystem access each time), it's not feasible to make that number large.

Yeah. TBH, what I'd really like to do is kill the entire fork system
with fire and replace it with something more scalable, which would
maybe permit the sort of thing Hannu suggests here. With the current
system, forget it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Jul 12, 2022 at 7:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>

> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates. I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.
> It would require substantially more changes to the code in this file,
> and that means there's more risk of introducing bugs, and I don't see
> that there's much value anyway, because if we ever do hit the current
> limit, we can just raise the limit.
>
> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

The patch looks good except one minor comment

+ * corruption.  Since the file might be more tha none standard-size disk
+ * sector in size, we cannot rely on overwrite-in-place. Instead, we generate

typo "more tha none" -> "more than one"

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Jul 13, 2022 at 9:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jul 12, 2022 at 7:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
>
> > In this version, I also removed the struct padding, changed the limit
> > on the number of entries to a nice round 64, and made some comment
> > updates. I considered trying to go further and actually make the file
> > variable-size, so that we never again need to worry about the limit on
> > the number of entries, but I don't actually think that's a good idea.
> > It would require substantially more changes to the code in this file,
> > and that means there's more risk of introducing bugs, and I don't see
> > that there's much value anyway, because if we ever do hit the current
> > limit, we can just raise the limit.
> >
> > If we were going to split up durable_rename(), the only intelligible
> > split I can see would be to have a second version of the function, or
> > a flag to the existing function, that caters to the situation where
> > the old file is already known to have been fsync()'d.
>
> The patch looks good except one minor comment
>
> + * corruption.  Since the file might be more tha none standard-size disk
> + * sector in size, we cannot rely on overwrite-in-place. Instead, we generate
>
> typo "more tha none" -> "more than one"
>
I have fixed this and included this change in the new patch series.

Apart from this I have fixed all the pending issues that includes

- Change existing macros to inline functions done in 0001.
- Change pg_class index from (tbspc, relfilenode) to relfilenode and
also change RelidByRelfilenumber().  In RelidByRelfilenumber I have
changed the hash to maintain based on just the relfilenumber but we
still need to pass the tablespace to identify whether it is a shared
relation or not.  If we want we can make it bool but I don't think
that is really needed here.
- Changed logic of GetNewRelFileNumber() based on what Robert
described, and instead of tracking the pending logged relnumbercount
now I am tracking last loggedRelNumber, which help little bit in
SetNextRelFileNumber in making code cleaner, but otherwise it doesn't
make much difference.
- Some new asserts in buf_internal inline function to validate value
of computed/input relfilenumber.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Jul 14, 2022 at 5:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Apart from this I have fixed all the pending issues that includes
>
> - Change existing macros to inline functions done in 0001.
> - Change pg_class index from (tbspc, relfilenode) to relfilenode and
> also change RelidByRelfilenumber().  In RelidByRelfilenumber I have
> changed the hash to maintain based on just the relfilenumber but we
> still need to pass the tablespace to identify whether it is a shared
> relation or not.  If we want we can make it bool but I don't think
> that is really needed here.
> - Changed logic of GetNewRelFileNumber() based on what Robert
> described, and instead of tracking the pending logged relnumbercount
> now I am tracking last loggedRelNumber, which help little bit in
> SetNextRelFileNumber in making code cleaner, but otherwise it doesn't
> make much difference.
> - Some new asserts in buf_internal inline function to validate value
> of computed/input relfilenumber.

I was doing some more testing by setting the FirstNormalRelFileNumber
to a high value(more than 32 bits) I have noticed a couple of problems
there e.g. relpath is still using OIDCHARS macro which says max
relfilenumber file name can be only 10 character long which is no
longer true.  So there we need to change this value to 20 and also
need to carefully rename the macros and other variable names used for
this purpose.

Similarly there was some issue in macro in buf_internal.h while
fetching the relfilenumber.  So I will relook into all those issues
and repost the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I was doing some more testing by setting the FirstNormalRelFileNumber
> to a high value(more than 32 bits) I have noticed a couple of problems
> there e.g. relpath is still using OIDCHARS macro which says max
> relfilenumber file name can be only 10 character long which is no
> longer true.  So there we need to change this value to 20 and also
> need to carefully rename the macros and other variable names used for
> this purpose.
>
> Similarly there was some issue in macro in buf_internal.h while
> fetching the relfilenumber.  So I will relook into all those issues
> and repost the patch soon.

I have fixed these existing issues and there was also some issue in
pg_dump.c which was creating problems in upgrading to the same version
while using a higher range of the relfilenumber.

There was also an issue where the user table from the old cluster's
relfilenode could conflict with the system table of the new cluster.
As a solution currently for system table object (while creating
storage first time) we are keeping the low range of relfilenumber,
basically we are using the same relfilenumber as OID so that during
upgrade the normal user table from the old cluster will not conflict
with the system tables in the new cluster.  But with this solution
Robert told me (in off list chat) a problem that in future if we want
to make relfilenumber completely unique within a cluster by
implementing the CREATEDB differently then we can not do that as we
have created fixed relfilenodes for the system tables.

I am not sure what exactly we can do to avoid that because even if we
do something  to avoid that in the new cluster the old cluster might
be already using the non-unique relfilenode so after upgrading the new
cluster will also get those non-unique relfilenode.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Thomas Munro
Дата:
On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [v10 patch set]

Hi Dilip, I'm experimenting with these patches and will hopefully have
more to say soon, but I just wanted to point out that this builds with
warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
there is the good kind of uninitialised data on Linux, and the bad
kind of uninitialised data on those other pesky systems?



Re: making relfilenodes 56 bits

От
vignesh C
Дата:
On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I was doing some more testing by setting the FirstNormalRelFileNumber
> > to a high value(more than 32 bits) I have noticed a couple of problems
> > there e.g. relpath is still using OIDCHARS macro which says max
> > relfilenumber file name can be only 10 character long which is no
> > longer true.  So there we need to change this value to 20 and also
> > need to carefully rename the macros and other variable names used for
> > this purpose.
> >
> > Similarly there was some issue in macro in buf_internal.h while
> > fetching the relfilenumber.  So I will relook into all those issues
> > and repost the patch soon.
>
> I have fixed these existing issues and there was also some issue in
> pg_dump.c which was creating problems in upgrading to the same version
> while using a higher range of the relfilenumber.
>
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.

Thanks for the patch, my comments from the initial review:
1) Since we have changed the macros to inline functions, should we
change the function names similar to the other inline functions in the
same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:
-#define BUFFERTAGS_EQUAL(a,b) \
-( \
-       RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
-       (a).blockNum == (b).blockNum && \
-       (a).forkNum == (b).forkNum \
-)
+static inline void
+CLEAR_BUFFERTAG(BufferTag *tag)
+{
+       tag->rlocator.spcOid = InvalidOid;
+       tag->rlocator.dbOid = InvalidOid;
+       tag->rlocator.relNumber = InvalidRelFileNumber;
+       tag->forkNum = InvalidForkNumber;
+       tag->blockNum = InvalidBlockNumber;
+}

2) We could move this macros along with the other macros at the top of the file:
+/*
+ * The freeNext field is either the index of the next freelist entry,
+ * or one of these special values:
+ */
+#define FREENEXT_END_OF_LIST   (-1)
+#define FREENEXT_NOT_IN_LIST   (-2)

3) typo thn should be then:
+ * can raise it as necessary if we end up with more mapped relations. For
+ * now, we just pick a round number that is modestly larger thn the expected
+ * number of mappings.
+ */

4) There is one whitespace issue:
git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
Applying: Widen relfilenumber from 32 bits to 56 bits
.git/rebase-apply/patch:1500: space before tab in indent.
(relfilenumber)))); \
warning: 1 line adds whitespace errors.

Regards,
Vignesh



Re: making relfilenodes 56 bits

От
Ashutosh Sharma
Дата:
Hi,

As oid and relfilenumber are linked with each other, I still see that if the oid value reaches the threshold limit, we are unable to create a table with storage. For example I set FirstNormalObjectId to 4294967294 (one value less than the range limit of 2^32 -1 = 4294967295). Now when I try to create a table, the CREATE TABLE command gets stuck because it is unable to find the OID for the comp type although it can find a new relfilenumber.

postgres=# create table t1(a int);
CREATE TABLE

postgres=# select oid, reltype, relfilenode from pg_class where relname = 't1';
    oid     |  reltype   | relfilenode
------------+------------+-------------
 4294967295 | 4294967294 |      100000
(1 row)

postgres=# create table t2(a int);
^CCancel request sent
ERROR:  canceling statement due to user request

creation of t2 table gets stuck as it is unable to find a new oid. Basically the point that I am trying to put here is even though we will be able to find the new relfile number by increasing the relfilenumber size but still the commands like above will not execute if the oid value (of 32 bits) has reached the threshold limit.

--
With Regards,
Ashutosh Sharma.



On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I was doing some more testing by setting the FirstNormalRelFileNumber
> to a high value(more than 32 bits) I have noticed a couple of problems
> there e.g. relpath is still using OIDCHARS macro which says max
> relfilenumber file name can be only 10 character long which is no
> longer true.  So there we need to change this value to 20 and also
> need to carefully rename the macros and other variable names used for
> this purpose.
>
> Similarly there was some issue in macro in buf_internal.h while
> fetching the relfilenumber.  So I will relook into all those issues
> and repost the patch soon.

I have fixed these existing issues and there was also some issue in
pg_dump.c which was creating problems in upgrading to the same version
while using a higher range of the relfilenumber.

There was also an issue where the user table from the old cluster's
relfilenode could conflict with the system table of the new cluster.
As a solution currently for system table object (while creating
storage first time) we are keeping the low range of relfilenumber,
basically we are using the same relfilenumber as OID so that during
upgrade the normal user table from the old cluster will not conflict
with the system tables in the new cluster.  But with this solution
Robert told me (in off list chat) a problem that in future if we want
to make relfilenumber completely unique within a cluster by
implementing the CREATEDB differently then we can not do that as we
have created fixed relfilenodes for the system tables.

I am not sure what exactly we can do to avoid that because even if we
do something  to avoid that in the new cluster the old cluster might
be already using the non-unique relfilenode so after upgrading the new
cluster will also get those non-unique relfilenode.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От
Amul Sul
Дата:
On Fri, Jul 22, 2022 at 4:21 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I was doing some more testing by setting the FirstNormalRelFileNumber
> > > to a high value(more than 32 bits) I have noticed a couple of problems
> > > there e.g. relpath is still using OIDCHARS macro which says max
> > > relfilenumber file name can be only 10 character long which is no
> > > longer true.  So there we need to change this value to 20 and also
> > > need to carefully rename the macros and other variable names used for
> > > this purpose.
> > >
> > > Similarly there was some issue in macro in buf_internal.h while
> > > fetching the relfilenumber.  So I will relook into all those issues
> > > and repost the patch soon.
> >
> > I have fixed these existing issues and there was also some issue in
> > pg_dump.c which was creating problems in upgrading to the same version
> > while using a higher range of the relfilenumber.
> >
> > There was also an issue where the user table from the old cluster's
> > relfilenode could conflict with the system table of the new cluster.
> > As a solution currently for system table object (while creating
> > storage first time) we are keeping the low range of relfilenumber,
> > basically we are using the same relfilenumber as OID so that during
> > upgrade the normal user table from the old cluster will not conflict
> > with the system tables in the new cluster.  But with this solution
> > Robert told me (in off list chat) a problem that in future if we want
> > to make relfilenumber completely unique within a cluster by
> > implementing the CREATEDB differently then we can not do that as we
> > have created fixed relfilenodes for the system tables.
> >
> > I am not sure what exactly we can do to avoid that because even if we
> > do something  to avoid that in the new cluster the old cluster might
> > be already using the non-unique relfilenode so after upgrading the new
> > cluster will also get those non-unique relfilenode.
>
> Thanks for the patch, my comments from the initial review:
> 1) Since we have changed the macros to inline functions, should we
> change the function names similar to the other inline functions in the
> same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:
> -#define BUFFERTAGS_EQUAL(a,b) \
> -( \
> -       RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
> -       (a).blockNum == (b).blockNum && \
> -       (a).forkNum == (b).forkNum \
> -)
> +static inline void
> +CLEAR_BUFFERTAG(BufferTag *tag)
> +{
> +       tag->rlocator.spcOid = InvalidOid;
> +       tag->rlocator.dbOid = InvalidOid;
> +       tag->rlocator.relNumber = InvalidRelFileNumber;
> +       tag->forkNum = InvalidForkNumber;
> +       tag->blockNum = InvalidBlockNumber;
> +}
>
> 2) We could move this macros along with the other macros at the top of the file:
> +/*
> + * The freeNext field is either the index of the next freelist entry,
> + * or one of these special values:
> + */
> +#define FREENEXT_END_OF_LIST   (-1)
> +#define FREENEXT_NOT_IN_LIST   (-2)
>
> 3) typo thn should be then:
> + * can raise it as necessary if we end up with more mapped relations. For
> + * now, we just pick a round number that is modestly larger thn the expected
> + * number of mappings.
> + */
>

Few more typos in 0004 patch as well:

the a value
interger
previosly
currenly

> 4) There is one whitespace issue:
> git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
> Applying: Widen relfilenumber from 32 bits to 56 bits
> .git/rebase-apply/patch:1500: space before tab in indent.
> (relfilenumber)))); \
> warning: 1 line adds whitespace errors.
>
> Regards,
> Vignesh
>

Regards,
Amul



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Mon, Jul 25, 2022 at 9:51 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi,
>
> As oid and relfilenumber are linked with each other, I still see that if the oid value reaches the threshold limit,
weare unable to create a table with storage. For example I set FirstNormalObjectId to 4294967294 (one value less than
therange limit of 2^32 -1 = 4294967295). Now when I try to create a table, the CREATE TABLE command gets stuck because
itis unable to find the OID for the comp type although it can find a new relfilenumber. 
>

First of all if the OID value reaches to max oid then it should wrap
around to the FirstNormalObjectId and find a new non conflicting OID.
Since in your case the first normaloid is 4294967294 and max oid is
42949672945 there is no scope of wraparound because in this case you
can create at most one object and once you created that then there is
no more unused oid left and with the current patch we are not at all
trying do anything about this.

Now come to the problem we are trying to solve with 56bits
relfilenode.  Here we are not trying to extend the limit of the system
to create more than 4294967294 objects.  What we are trying to solve
is to not to reuse the same disk filenames for different objects.  And
also notice that the relfilenodes can get consumed really faster than
oid so chances of wraparound is more, I mean you can truncate/rewrite
the same relation multiple times so that relation will have the same
oid but will consume multiple relfilenodes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [v10 patch set]
>
> Hi Dilip, I'm experimenting with these patches and will hopefully have
> more to say soon, but I just wanted to point out that this builds with
> warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> there is the good kind of uninitialised data on Linux, and the bad
> kind of uninitialised data on those other pesky systems?

Thanks, I have figured out the issue, I will post the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Fri, Jul 22, 2022 at 4:21 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

> Thanks for the patch, my comments from the initial review:
> 1) Since we have changed the macros to inline functions, should we
> change the function names similar to the other inline functions in the
> same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:

I have thought about it while doing so but I am not sure whether it is
a good idea or not, because before my change these all were macros
with 2 naming conventions so I just changed to inline function so why
to change the name.

> -#define BUFFERTAGS_EQUAL(a,b) \
> -( \
> -       RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
> -       (a).blockNum == (b).blockNum && \
> -       (a).forkNum == (b).forkNum \
> -)
> +static inline void
> +CLEAR_BUFFERTAG(BufferTag *tag)
> +{
> +       tag->rlocator.spcOid = InvalidOid;
> +       tag->rlocator.dbOid = InvalidOid;
> +       tag->rlocator.relNumber = InvalidRelFileNumber;
> +       tag->forkNum = InvalidForkNumber;
> +       tag->blockNum = InvalidBlockNumber;
> +}
>
> 2) We could move this macros along with the other macros at the top of the file:
> +/*
> + * The freeNext field is either the index of the next freelist entry,
> + * or one of these special values:
> + */
> +#define FREENEXT_END_OF_LIST   (-1)
> +#define FREENEXT_NOT_IN_LIST   (-2)

Yeah we can do that.

> 3) typo thn should be then:
> + * can raise it as necessary if we end up with more mapped relations. For
> + * now, we just pick a round number that is modestly larger thn the expected
> + * number of mappings.
> + */
>
> 4) There is one whitespace issue:
> git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
> Applying: Widen relfilenumber from 32 bits to 56 bits
> .git/rebase-apply/patch:1500: space before tab in indent.
> (relfilenumber)))); \
> warning: 1 line adds whitespace errors.

Okay, I will fix it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Jul 26, 2022 at 10:05 AM Amul Sul <sulamul@gmail.com> wrote:
>
> Few more typos in 0004 patch as well:
>
> the a value
> interger
> previosly
> currenly
>

Thanks for the review, I will fix it in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [v10 patch set]
>
> Hi Dilip, I'm experimenting with these patches and will hopefully have
> more to say soon, but I just wanted to point out that this builds with
> warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> there is the good kind of uninitialised data on Linux, and the bad
> kind of uninitialised data on those other pesky systems?

Here is the patch to fix the issue, basically, while asserting for the
file existence it was not setting the relfilenumber in the
relfilelocator before generating the path so it was just checking for
the existence of the random path so it was asserting randomly.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Ashutosh Sharma
Дата:
        /*
         * If relfilenumber is unspecified by the caller then create storage
-        * with oid same as relid.
+        * with relfilenumber same as relid if it is a system table otherwise
+        * allocate a new relfilenumber.  For more details read comments atop
+        * FirstNormalRelFileNumber declaration.
         */
        if (!RelFileNumberIsValid(relfilenumber))
-           relfilenumber = relid;
+       {
+           relfilenumber = relid < FirstNormalObjectId ?
+               relid : GetNewRelFileNumber();

Above code says that in the case of system table we want relfilenode to be the same as object id. This technically means that the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code added in the patch, it says there is some chance for the storage path of the user tables from the old cluster conflicting with the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on the old cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict.

+/* ----------
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage
+ * will be created with the relfilenumber same as their oid.  And, later for
+ * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
+ * at 100000.  Thus, when upgrading from an older cluster, the relation storage
+ * path for the user table from the old cluster will not conflict with the
+ * relation storage path for the system table from the new cluster.  Anyway,
+ * the new cluster must not have any user tables while upgrading, so we needn't
+ * worry about them.
+ * ----------
+ */
+#define FirstNormalRelFileNumber   ((RelFileNumber) 100000)

==

When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is 512. Any reason for choosing this low arbitrary value in case of relfilenumber?

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [v10 patch set]
>
> Hi Dilip, I'm experimenting with these patches and will hopefully have
> more to say soon, but I just wanted to point out that this builds with
> warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> there is the good kind of uninitialised data on Linux, and the bad
> kind of uninitialised data on those other pesky systems?

Here is the patch to fix the issue, basically, while asserting for the
file existence it was not setting the relfilenumber in the
relfilelocator before generating the path so it was just checking for
the existence of the random path so it was asserting randomly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,
Note: please avoid top posting.

>         /*
>          * If relfilenumber is unspecified by the caller then create storage
> -        * with oid same as relid.
> +        * with relfilenumber same as relid if it is a system table otherwise
> +        * allocate a new relfilenumber.  For more details read comments atop
> +        * FirstNormalRelFileNumber declaration.
>          */
>         if (!RelFileNumberIsValid(relfilenumber))
> -           relfilenumber = relid;
> +       {
> +           relfilenumber = relid < FirstNormalObjectId ?
> +               relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be the same as object id. This technically
meansthat the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code
addedin the patch, it says there is some chance for the storage path of the user tables from the old cluster
conflictingwith the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on
theold cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict. 


Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383.  And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way.  Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid.  But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* ----------
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid.  And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
> + * at 100000.  Thus, when upgrading from an older cluster, the relation storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster.  Anyway,
> + * the new cluster must not have any user tables while upgrading, so we needn't
> + * worry about them.
> + * ----------
> + */
> +#define FirstNormalRelFileNumber   ((RelFileNumber) 100000)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is
512.Any reason for choosing this low arbitrary value in case of relfilenumber? 

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid.  OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables.  So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Ashutosh Sharma
Дата:
Thanks Dilip. Here are few comments that  could find upon quickly reviewing the v11 patch:

 /*
+ * Similar to the XLogPutNextOid but instead of writing NEXTOID log record it
+ * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
+ * XLogRecPtrthen flush the wal upto this record pointer otherwise flush upto

XLogRecPtrthen -> XLogRecPtr then

==

+       switch (relpersistence)
+       {
+           case RELPERSISTENCE_TEMP:
+               backend = BackendIdForTempRelations();
+               break;
+           case RELPERSISTENCE_UNLOGGED:
+           case RELPERSISTENCE_PERMANENT:
+               backend = InvalidBackendId;
+               break;
+           default:
+               elog(ERROR, "invalid relpersistence: %c", relpersistence);
+               return InvalidRelFileNumber;    /* placate compiler */
+       }


I think the above check should be added at the beginning of the function for the reason that if we come to the default switch case we won't be acquiring the lwlock and do other stuff to get a new relfilenumber.

==

-   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
+    * Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+    * because of the possibility that that relation will be moved back to the

that that relation -> that relation.

==

+ * option_parse_relfilenumber
+ *
+ * Parse relfilenumber value for an option.  If the parsing is successful,
+ * returns; if parsing fails, returns false.
+ */

If parsing is successful, returns true;

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,
Note: please avoid top posting.

>         /*
>          * If relfilenumber is unspecified by the caller then create storage
> -        * with oid same as relid.
> +        * with relfilenumber same as relid if it is a system table otherwise
> +        * allocate a new relfilenumber.  For more details read comments atop
> +        * FirstNormalRelFileNumber declaration.
>          */
>         if (!RelFileNumberIsValid(relfilenumber))
> -           relfilenumber = relid;
> +       {
> +           relfilenumber = relid < FirstNormalObjectId ?
> +               relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be the same as object id. This technically means that the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code added in the patch, it says there is some chance for the storage path of the user tables from the old cluster conflicting with the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on the old cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict.


Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383.  And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way.  Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid.  But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* ----------
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid.  And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
> + * at 100000.  Thus, when upgrading from an older cluster, the relation storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster.  Anyway,
> + * the new cluster must not have any user tables while upgrading, so we needn't
> + * worry about them.
> + * ----------
> + */
> +#define FirstNormalRelFileNumber   ((RelFileNumber) 100000)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is 512. Any reason for choosing this low arbitrary value in case of relfilenumber?

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid.  OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables.  So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have thought about it while doing so but I am not sure whether it is
> a good idea or not, because before my change these all were macros
> with 2 naming conventions so I just changed to inline function so why
> to change the name.

Well, the reason to change the name would be for consistency. It feels
weird to have some NAMES_LIKETHIS() and other NamesLikeThis().

Now, an argument against that is that it will make back-patching more
annoying, if any code using these functions/macros is touched. But
since the calling sequence is changing anyway (you now have to pass a
pointer rather than the object itself) that argument doesn't really
carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Jul 12, 2022 at 4:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.

Committed that way, also with the fix for the typo Dilip found.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Ashutosh Sharma
Дата:
Some more comments:

==

Shouldn't we retry for the new relfilenumber if "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a cases where some of the tables are dropped by the user and relfilenumber of those tables can be reused for which we would need to find the relfilenumber that can be resued. For e.g. consider below example:

postgres=# create table t1(a int);
CREATE TABLE

postgres=# create table t2(a int);
CREATE TABLE

postgres=# create table t3(a int);
ERROR:  relfilenumber is out of bound

postgres=# drop table t1, t2;
DROP TABLE

postgres=# checkpoint;
CHECKPOINT

postgres=# vacuum;
VACUUM

Now if I try to recreate table t3, it should succeed, shouldn't it? But it doesn't because we simply error out by seeing the nextRelFileNumber saved in the shared memory.

postgres=# create table t1(a int);
ERROR:  relfilenumber is out of bound

I think, above should have worked.

==

<caution>
<para>
Note that while a table's filenode often matches its OID, this is
<emphasis>not</emphasis> necessarily the case; some operations, like
<command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.

I think this note needs some improvement in storage.sgml. It says the table's relfilenode mostly matches its OID, but it doesn't. This will happen only in case of system table and maybe never in case of user table.

==

postgres=# create table t2(a int);
ERROR:  relfilenumber is out of bound

Since this is a user-visible error, I think it would be good to mention relfilenode instead of relfilenumber. Elsewhere (including the user manual) we refer to this as a relfilenode.

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 10:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Thanks Dilip. Here are few comments that  could find upon quickly reviewing the v11 patch:

 /*
+ * Similar to the XLogPutNextOid but instead of writing NEXTOID log record it
+ * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
+ * XLogRecPtrthen flush the wal upto this record pointer otherwise flush upto

XLogRecPtrthen -> XLogRecPtr then

==

+       switch (relpersistence)
+       {
+           case RELPERSISTENCE_TEMP:
+               backend = BackendIdForTempRelations();
+               break;
+           case RELPERSISTENCE_UNLOGGED:
+           case RELPERSISTENCE_PERMANENT:
+               backend = InvalidBackendId;
+               break;
+           default:
+               elog(ERROR, "invalid relpersistence: %c", relpersistence);
+               return InvalidRelFileNumber;    /* placate compiler */
+       }


I think the above check should be added at the beginning of the function for the reason that if we come to the default switch case we won't be acquiring the lwlock and do other stuff to get a new relfilenumber.

==

-   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
+    * Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+    * because of the possibility that that relation will be moved back to the

that that relation -> that relation.

==

+ * option_parse_relfilenumber
+ *
+ * Parse relfilenumber value for an option.  If the parsing is successful,
+ * returns; if parsing fails, returns false.
+ */

If parsing is successful, returns true;

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,
Note: please avoid top posting.

>         /*
>          * If relfilenumber is unspecified by the caller then create storage
> -        * with oid same as relid.
> +        * with relfilenumber same as relid if it is a system table otherwise
> +        * allocate a new relfilenumber.  For more details read comments atop
> +        * FirstNormalRelFileNumber declaration.
>          */
>         if (!RelFileNumberIsValid(relfilenumber))
> -           relfilenumber = relid;
> +       {
> +           relfilenumber = relid < FirstNormalObjectId ?
> +               relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be the same as object id. This technically means that the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code added in the patch, it says there is some chance for the storage path of the user tables from the old cluster conflicting with the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on the old cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict.


Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383.  And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way.  Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid.  But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* ----------
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid.  And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
> + * at 100000.  Thus, when upgrading from an older cluster, the relation storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster.  Anyway,
> + * the new cluster must not have any user tables while upgrading, so we needn't
> + * worry about them.
> + * ----------
> + */
> +#define FirstNormalRelFileNumber   ((RelFileNumber) 100000)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is 512. Any reason for choosing this low arbitrary value in case of relfilenumber?

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid.  OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables.  So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Jul 27, 2022 at 1:24 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Some more comments:

Note: Please don't top post.

> ==
>
> Shouldn't we retry for the new relfilenumber if "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There
canbe a cases where some of the tables are dropped by the user and relfilenumber of those tables can be reused for
whichwe would need to find the relfilenumber that can be resued. For e.g. consider below example: 
>
> postgres=# create table t1(a int);
> CREATE TABLE
>
> postgres=# create table t2(a int);
> CREATE TABLE
>
> postgres=# create table t3(a int);
> ERROR:  relfilenumber is out of bound
>
> postgres=# drop table t1, t2;
> DROP TABLE
>
> postgres=# checkpoint;
> CHECKPOINT
>
> postgres=# vacuum;
> VACUUM
>
> Now if I try to recreate table t3, it should succeed, shouldn't it? But it doesn't because we simply error out by
seeingthe nextRelFileNumber saved in the shared memory. 
>
> postgres=# create table t1(a int);
> ERROR:  relfilenumber is out of bound
>
> I think, above should have worked.

No, it should not, the whole point of this design is not to reuse the
relfilenumber ever within a cluster lifetime.  You might want to read
this mail[1] that by the time we use 2^56 relfilenumbers the cluster
will anyway reach its lifetime by other factors.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2BZrDms7gSjckme8YV2tzxgZ0KVfGcsjaFoKyzQX_f_Mw%40mail.gmail.com

> ==
>
> <caution>
> <para>
> Note that while a table's filenode often matches its OID, this is
> <emphasis>not</emphasis> necessarily the case; some operations, like
> <command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
> of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.
>
> I think this note needs some improvement in storage.sgml. It says the table's relfilenode mostly matches its OID, but
itdoesn't. This will happen only in case of system table and maybe never in case of user table. 

Yes, this should be changed.

> postgres=# create table t2(a int);
> ERROR:  relfilenumber is out of bound
>
> Since this is a user-visible error, I think it would be good to mention relfilenode instead of relfilenumber.
Elsewhere(including the user manual) we refer to this as a relfilenode. 

No this is expected to be an internal error because in general during
the cluster lifetime ideally, we should never reach this number.  So
we are putting this check so that it should not reach this number due
to some other computational/programming mistake.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
vignesh C
Дата:
On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.

Thanks for the updated patch, Few comments:
1) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_load_buffers
...............
if (fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
   &blkinfo[i].tablespace, &blkinfo[i].filenumber,
   &forknum, &blkinfo[i].blocknum) != 5)
...............

2) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_dump_now
...............
ret = fprintf(file, "%u,%u,%u,%u,%u\n",
  block_info_array[i].database,
  block_info_array[i].tablespace,
  block_info_array[i].filenumber,
  (uint32) block_info_array[i].forknum,
  block_info_array[i].blocknum);
...............

3) should the comment "entry point for old extension version" be on
top of pg_buffercache_pages, as the current version will use
pg_buffercache_pages_v1_4
+
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+       return pg_buffercache_pages_internal(fcinfo, OIDOID);
+}
+
+/* entry point for old extension version */
+Datum
+pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
+{
+       return pg_buffercache_pages_internal(fcinfo, INT8OID);
+}

4) we could use the new style or ereport by removing the brackets
around errcode:
+                               if (fctx->record[i].relfilenumber > OID_MAX)
+                                       ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+
errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",
+
 fctx->record[i].relfilenumber),
+
errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE")));

like:
ereport(ERROR,

errcode(ERRCODE_INVALID_PARAMETER_VALUE),

errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",

fctx->record[i].relfilenumber),

errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE"));

5) Similarly in the below code too:
+       /* check whether the relfilenumber is within a valid range */
+       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+               ereport(ERROR,
+                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+                                               (relfilenumber))));


6) Similarly in the below code too:
+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
         \
+do {
                                                         \
+       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+               ereport(ERROR,
                                                 \
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),                      \
+                                errmsg("relfilenumber " INT64_FORMAT
" is out of range",       \
+                                               (relfilenumber)))); \
+} while (0)
+


7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
this macro be used here too:
pg_filenode_relation(PG_FUNCTION_ARGS)
 {
        Oid                     reltablespace = PG_GETARG_OID(0);
-       RelFileNumber relfilenumber = PG_GETARG_OID(1);
+       RelFileNumber relfilenumber = PG_GETARG_INT64(1);
        Oid                     heaprel;

+       /* check whether the relfilenumber is within a valid range */
+       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+               ereport(ERROR,
+                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+                                               (relfilenumber))));


8) I felt this include is not required:
diff --git a/src/backend/access/transam/varsup.c
b/src/backend/access/transam/varsup.c
index 849a7ce..a2f0d35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -13,12 +13,16 @@

 #include "postgres.h"

+#include <unistd.h>
+
 #include "access/clog.h"
 #include "access/commit_ts.h"

9) should we change elog to ereport to use the New-style error reporting API
+       /* safety check, we should never get this far in a HS standby */
+       if (RecoveryInProgress())
+               elog(ERROR, "cannot assign RelFileNumber during recovery");
+
+       if (IsBinaryUpgrade)
+               elog(ERROR, "cannot assign RelFileNumber during binary
upgrade");

10) Here nextRelFileNumber is protected by RelFileNumberGenLock, the
comment stated OidGenLock. It should be slightly adjusted.
typedef struct VariableCacheData
{
/*
* These fields are protected by OidGenLock.
*/
Oid nextOid; /* next OID to assign */
uint32 oidCount; /* OIDs available before must do XLOG work */
RelFileNumber nextRelFileNumber; /* next relfilenumber to assign */
RelFileNumber loggedRelFileNumber; /* last logged relfilenumber */
XLogRecPtr loggedRelFileNumberRecPtr; /* xlog record pointer w.r.t.
* loggedRelFileNumber */

Regards,
Vignesh



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Jul 27, 2022 at 3:27 PM vignesh C <vignesh21@gmail.com> wrote:
>

> Thanks for the updated patch, Few comments:
> 1) The format specifier should be changed from %u to INT64_FORMAT
> autoprewarm.c -> apw_load_buffers
> ...............
> if (fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
>    &blkinfo[i].tablespace, &blkinfo[i].filenumber,
>    &forknum, &blkinfo[i].blocknum) != 5)
> ...............
>
> 2) The format specifier should be changed from %u to INT64_FORMAT
> autoprewarm.c -> apw_dump_now
> ...............
> ret = fprintf(file, "%u,%u,%u,%u,%u\n",
>   block_info_array[i].database,
>   block_info_array[i].tablespace,
>   block_info_array[i].filenumber,
>   (uint32) block_info_array[i].forknum,
>   block_info_array[i].blocknum);
> ...............
>
> 3) should the comment "entry point for old extension version" be on
> top of pg_buffercache_pages, as the current version will use
> pg_buffercache_pages_v1_4
> +
> +Datum
> +pg_buffercache_pages(PG_FUNCTION_ARGS)
> +{
> +       return pg_buffercache_pages_internal(fcinfo, OIDOID);
> +}
> +
> +/* entry point for old extension version */
> +Datum
> +pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
> +{
> +       return pg_buffercache_pages_internal(fcinfo, INT8OID);
> +}
>
> 4) we could use the new style or ereport by removing the brackets
> around errcode:
> +                               if (fctx->record[i].relfilenumber > OID_MAX)
> +                                       ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +
> errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> an OID",
> +
>  fctx->record[i].relfilenumber),
> +
> errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> UPDATE")));
>
> like:
> ereport(ERROR,
>
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>
> errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> an OID",
>
> fctx->record[i].relfilenumber),
>
> errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> UPDATE"));
>
> 5) Similarly in the below code too:
> +       /* check whether the relfilenumber is within a valid range */
> +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> +               ereport(ERROR,
> +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                errmsg("relfilenumber " INT64_FORMAT
> " is out of range",
> +                                               (relfilenumber))));
>
>
> 6) Similarly in the below code too:
> +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
>          \
> +do {
>                                                          \
> +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> +               ereport(ERROR,
>                                                  \
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),                      \
> +                                errmsg("relfilenumber " INT64_FORMAT
> " is out of range",       \
> +                                               (relfilenumber)))); \
> +} while (0)
> +
>
>
> 7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
> this macro be used here too:
> pg_filenode_relation(PG_FUNCTION_ARGS)
>  {
>         Oid                     reltablespace = PG_GETARG_OID(0);
> -       RelFileNumber relfilenumber = PG_GETARG_OID(1);
> +       RelFileNumber relfilenumber = PG_GETARG_INT64(1);
>         Oid                     heaprel;
>
> +       /* check whether the relfilenumber is within a valid range */
> +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> +               ereport(ERROR,
> +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                errmsg("relfilenumber " INT64_FORMAT
> " is out of range",
> +                                               (relfilenumber))));
>
>
> 8) I felt this include is not required:
> diff --git a/src/backend/access/transam/varsup.c
> b/src/backend/access/transam/varsup.c
> index 849a7ce..a2f0d35 100644
> --- a/src/backend/access/transam/varsup.c
> +++ b/src/backend/access/transam/varsup.c
> @@ -13,12 +13,16 @@
>
>  #include "postgres.h"
>
> +#include <unistd.h>
> +
>  #include "access/clog.h"
>  #include "access/commit_ts.h"
>
> 9) should we change elog to ereport to use the New-style error reporting API
> +       /* safety check, we should never get this far in a HS standby */
> +       if (RecoveryInProgress())
> +               elog(ERROR, "cannot assign RelFileNumber during recovery");
> +
> +       if (IsBinaryUpgrade)
> +               elog(ERROR, "cannot assign RelFileNumber during binary
> upgrade");
>
> 10) Here nextRelFileNumber is protected by RelFileNumberGenLock, the
> comment stated OidGenLock. It should be slightly adjusted.
> typedef struct VariableCacheData
> {
> /*
> * These fields are protected by OidGenLock.
> */
> Oid nextOid; /* next OID to assign */
> uint32 oidCount; /* OIDs available before must do XLOG work */
> RelFileNumber nextRelFileNumber; /* next relfilenumber to assign */
> RelFileNumber loggedRelFileNumber; /* last logged relfilenumber */
> XLogRecPtr loggedRelFileNumberRecPtr; /* xlog record pointer w.r.t.
> * loggedRelFileNumber */

Thanks for the review I have fixed these except,
> 9) should we change elog to ereport to use the New-style error reporting API
I think this is internal error so if we use ereport we need to give
error code and all and I think for internal that is not necessary?

> 8) I felt this include is not required:
it is using access API so we do need <unistd.h>

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Jul 27, 2022 at 12:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have thought about it while doing so but I am not sure whether it is
> > a good idea or not, because before my change these all were macros
> > with 2 naming conventions so I just changed to inline function so why
> > to change the name.
>
> Well, the reason to change the name would be for consistency. It feels
> weird to have some NAMES_LIKETHIS() and other NamesLikeThis().
>
> Now, an argument against that is that it will make back-patching more
> annoying, if any code using these functions/macros is touched. But
> since the calling sequence is changing anyway (you now have to pass a
> pointer rather than the object itself) that argument doesn't really
> carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
> etc.

Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well
to BufferTagEqual().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, 27 Jul 2022 at 9:49 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jul 27, 2022 at 12:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have thought about it while doing so but I am not sure whether it is
> > a good idea or not, because before my change these all were macros
> > with 2 naming conventions so I just changed to inline function so why
> > to change the name.
>
> Well, the reason to change the name would be for consistency. It feels
> weird to have some NAMES_LIKETHIS() and other NamesLikeThis().
>
> Now, an argument against that is that it will make back-patching more
> annoying, if any code using these functions/macros is touched. But
> since the calling sequence is changing anyway (you now have to pass a
> pointer rather than the object itself) that argument doesn't really
> carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
> etc.

Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well
to BufferTagEqual().

Just realised that this should have been BufferTagsEqual instead of BufferTagEqual

I will modify this and send an updated patch tomorrow.

Dilip
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Just realised that this should have been BufferTagsEqual instead of BufferTagEqual
>
> I will modify this and send an updated patch tomorrow.

I changed it and committed.

What was formerly 0002 will need minor rebasing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Jul 27, 2022 at 11:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Just realised that this should have been BufferTagsEqual instead of BufferTagEqual
> >
> > I will modify this and send an updated patch tomorrow.
>
> I changed it and committed.
>
> What was formerly 0002 will need minor rebasing.

Thanks, I have rebased other patches,  actually, there is a new 0001
patch now.  It seems during renaming relnode related Oid to
RelFileNumber, some of the references were missed and in the last
patch set I kept it as part of main patch 0003, but I think it's
better to keep it separate.  So took out those changes and created
0001, but you think this can be committed as part of 0003 only then
also it's fine with me.

I have done some cleanup in 0002 as well, basically, earlier we were
storing the result of the BufTagGetRelFileLocator() in a separate
variable which is not required everywhere.  So wherever possible I
have avoided using the intermediate variable.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Thu, Jul 28, 2022 at 7:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Thanks, I have rebased other patches,  actually, there is a new 0001
> patch now.  It seems during renaming relnode related Oid to
> RelFileNumber, some of the references were missed and in the last
> patch set I kept it as part of main patch 0003, but I think it's
> better to keep it separate.  So took out those changes and created
> 0001, but you think this can be committed as part of 0003 only then
> also it's fine with me.

I committed this in part. I took out the introduction of
RELNUMBERCHARS as I think that should probably be a separate commit,
but added in a comment change that you seem to have overlooked.

> I have done some cleanup in 0002 as well, basically, earlier we were
> storing the result of the BufTagGetRelFileLocator() in a separate
> variable which is not required everywhere.  So wherever possible I
> have avoided using the intermediate variable.

I'll have a look at this next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Alvaro Herrera
Дата:
Not a full review, just a quick skim of 0003.

On 2022-Jul-28, Dilip Kumar wrote:

> +    if (!shutdown)
> +    {
> +        if (ShmemVariableCache->loggedRelFileNumber < checkPoint.nextRelFileNumber)
> +            elog(ERROR, "nextRelFileNumber can not go backward from " INT64_FORMAT "to" INT64_FORMAT,
> +                 checkPoint.nextRelFileNumber, ShmemVariableCache->loggedRelFileNumber);
> +
> +        checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
> +    }

Please don't do this; rather use %llu and cast to (long long).
Otherwise the string becomes mangled for translation.  I think there are
many uses of this sort of pattern in strings, but not all of them are
translatable so maybe we don't care -- for example contrib doesn't have
translations.  And the rmgrdesc routines don't translate either, so we
probably don't care about it there; and nothing that uses elog either.
But this one in particular I think should be an ereport, not an elog.
There are several other ereports in various places of the patch also.

> @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
>          if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) != 0)
>          {
>              elog(FATAL,
> -                 "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
> +                 "inconsistent page found, rel %u/%u/" INT64_FORMAT ", forknum %u, blkno %u",
>                   rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
>                   forknum, blkno);

Should this one be an ereport, and thus you do need to change it to that
and handle it like that?


> +        if (xlrec->rlocator.relNumber > ShmemVariableCache->nextRelFileNumber)
> +            elog(ERROR, "unexpected relnumber " INT64_FORMAT "that is bigger than nextRelFileNumber " INT64_FORMAT,
> +                 xlrec->rlocator.relNumber, ShmemVariableCache->nextRelFileNumber);

You missed one whitespace here after the INT64_FORMAT.

> diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
> index c390ec5..f727078 100644
> --- a/src/bin/pg_controldata/pg_controldata.c
> +++ b/src/bin/pg_controldata/pg_controldata.c
> @@ -250,6 +250,8 @@ main(int argc, char *argv[])
>      printf(_("Latest checkpoint's NextXID:          %u:%u\n"),
>             EpochFromFullTransactionId(ControlFile->checkPointCopy.nextXid),
>             XidFromFullTransactionId(ControlFile->checkPointCopy.nextXid));
> +    printf(_("Latest checkpoint's NextRelFileNumber:  " INT64_FORMAT "\n"),
> +           ControlFile->checkPointCopy.nextRelFileNumber);

This one must definitely be translatable.

>  /* Characters to allow for an RelFileNumber in a relation path */
> -#define RELNUMBERCHARS    OIDCHARS    /* same as OIDCHARS */
> +#define RELNUMBERCHARS    20        /* max chars printed by %lu */

Maybe say %llu here instead.


I do wonder why do we keep relfilenodes limited to decimal digits.  Why
not use hex digits?  Then we know the limit is 14 chars, as in
0x00FFFFFFFFFFFFFF in the MAX_RELFILENUMBER definition.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> not use hex digits?  Then we know the limit is 14 chars, as in
> 0x00FFFFFFFFFFFFFF in the MAX_RELFILENUMBER definition.

Hmm, but surely we want the error messages to be printed using the
same format that we use for the actual filenames. We could make the
filenames use hex characters too, but I'm not wild about changing
user-visible details like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Joshua Drake
Дата:


On Thu, Jul 28, 2022 at 9:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> not use hex digits?  Then we know the limit is 14 chars, as in
> 0x00FFFFFFFFFFFFFF in the MAX_RELFILENUMBER definition.

Hmm, but surely we want the error messages to be printed using the
same format that we use for the actual filenames. We could make the
filenames use hex characters too, but I'm not wild about changing
user-visible details like that.

From a DBA perspective this would be a regression in usability.

JD

--

Re: making relfilenodes 56 bits

От
Alvaro Herrera
Дата:
On 2022-Jul-28, Robert Haas wrote:

> On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> > not use hex digits?  Then we know the limit is 14 chars, as in
> > 0x00FFFFFFFFFFFFFF in the MAX_RELFILENUMBER definition.
> 
> Hmm, but surely we want the error messages to be printed using the
> same format that we use for the actual filenames.

Of course.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."                               (Nathaniel Smith)



Re: making relfilenodes 56 bits

От
Ashutosh Sharma
Дата:
On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+/* ----------
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage

Above comment says that RelFileNumber zero is invalid which is technically correct because we don't have any relation file in disk with zero number. But the point is that if someone reads below definition of CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this definition relfilenumber zero is valid.

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)               \
+do {                                                               \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+       ereport(ERROR,                                              \
+               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),          \
+                errmsg("relfilenumber " INT64_FORMAT " is out of range",   \
+                       (relfilenumber)))); \
+} while (0)
+

+   RelFileNumber relfilenumber = PG_GETARG_INT64(0);
+   CHECK_RELFILENUMBER_RANGE(relfilenumber);

It seems like the relfilenumber in above definition represents relfilenode value in pg_class which can hold zero value which actually means it's a mapped relation. I think it would be good to provide some clarity here.

--
With Regards,
Ashutosh Sharma.

Re: making relfilenodes 56 bits

От
Ashutosh Sharma
Дата:
On Fri, Jul 29, 2022 at 6:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+/* ----------
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage

Above comment says that RelFileNumber zero is invalid which is technically correct because we don't have any relation file in disk with zero number. But the point is that if someone reads below definition of CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this definition relfilenumber zero is valid.

Please ignore the above comment shared in my previous email.  It is a little over-thinking on my part that generated this comment in my mind. Sorry for that. Here are the other comments I have:

+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_buffercache DROP VIEW pg_buffercache;
+ALTER EXTENSION pg_buffercache DROP FUNCTION pg_buffercache_pages();
+
+/* Then we can drop them */
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+/* Now redefine */
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages_v1_4'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE VIEW pg_buffercache AS
+   SELECT P.* FROM pg_buffercache_pages() AS P
+   (bufferid integer, relfilenode int8, reltablespace oid, reldatabase oid,
+    relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+    pinning_backends int4);

As we are dropping the function and view I think it would be good if we *don't* use the "OR REPLACE" keyword when re-defining them.

==

+                   ereport(ERROR,
+                           (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                            errmsg("relfilenode" INT64_FORMAT " is too large to be represented as an OID",
+                                   fctx->record[i].relfilenumber),
+                            errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE")));

I think it would be good to recommend users to upgrade to the latest version instead of just saying upgrade the pg_buffercache using ALTER EXTENSION ....

==

--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -39,10 +39,10 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
 -- Test for filtering out WAL records of a particular table
 -- ===================================================================

-SELECT oid AS sample_tbl_oid FROM pg_class WHERE relname = 'sample_tbl' \gset
+SELECT relfilenode AS sample_tbl_relfilenode FROM pg_class WHERE relname = 'sample_tbl' \gset

Is this change required? The original query is just trying to fetch table oid not relfilenode and AFAIK we haven't changed anything in table oid.

==

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)               \
+do {                                                               \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+       ereport(ERROR,                                              \
+               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),          \
+                errmsg("relfilenumber " INT64_FORMAT " is out of range",   \
+                       (relfilenumber)))); \
+} while (0)
+

I think we can shift this macro to some header file and reuse it at several places.

==


+    * Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+    * because of the possibility that that relation will be moved back to the

that that relation -> that relation

--
With Regards,
Ashutosh Sharma.

Re: making relfilenodes 56 bits

От
vignesh C
Дата:
On Wed, Jul 27, 2022 at 6:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 27, 2022 at 3:27 PM vignesh C <vignesh21@gmail.com> wrote:
> >
>
> > Thanks for the updated patch, Few comments:
> > 1) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_load_buffers
> > ...............
> > if (fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
> >    &blkinfo[i].tablespace, &blkinfo[i].filenumber,
> >    &forknum, &blkinfo[i].blocknum) != 5)
> > ...............
> >
> > 2) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_dump_now
> > ...............
> > ret = fprintf(file, "%u,%u,%u,%u,%u\n",
> >   block_info_array[i].database,
> >   block_info_array[i].tablespace,
> >   block_info_array[i].filenumber,
> >   (uint32) block_info_array[i].forknum,
> >   block_info_array[i].blocknum);
> > ...............
> >
> > 3) should the comment "entry point for old extension version" be on
> > top of pg_buffercache_pages, as the current version will use
> > pg_buffercache_pages_v1_4
> > +
> > +Datum
> > +pg_buffercache_pages(PG_FUNCTION_ARGS)
> > +{
> > +       return pg_buffercache_pages_internal(fcinfo, OIDOID);
> > +}
> > +
> > +/* entry point for old extension version */
> > +Datum
> > +pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
> > +{
> > +       return pg_buffercache_pages_internal(fcinfo, INT8OID);
> > +}
> >
> > 4) we could use the new style or ereport by removing the brackets
> > around errcode:
> > +                               if (fctx->record[i].relfilenumber > OID_MAX)
> > +                                       ereport(ERROR,
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> > +
> >  fctx->record[i].relfilenumber),
> > +
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE")));
> >
> > like:
> > ereport(ERROR,
> >
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> >
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> >
> > fctx->record[i].relfilenumber),
> >
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE"));
> >
> > 5) Similarly in the below code too:
> > +       /* check whether the relfilenumber is within a valid range */
> > +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +               ereport(ERROR,
> > +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                                errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +                                               (relfilenumber))));
> >
> >
> > 6) Similarly in the below code too:
> > +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
> >          \
> > +do {
> >                                                          \
> > +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> > +               ereport(ERROR,
> >                                                  \
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),                      \
> > +                                errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",       \
> > +                                               (relfilenumber)))); \
> > +} while (0)
> > +
> >
> >
> > 7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
> > this macro be used here too:
> > pg_filenode_relation(PG_FUNCTION_ARGS)
> >  {
> >         Oid                     reltablespace = PG_GETARG_OID(0);
> > -       RelFileNumber relfilenumber = PG_GETARG_OID(1);
> > +       RelFileNumber relfilenumber = PG_GETARG_INT64(1);
> >         Oid                     heaprel;
> >
> > +       /* check whether the relfilenumber is within a valid range */
> > +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +               ereport(ERROR,
> > +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                                errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +                                               (relfilenumber))));
> >
> >
> > 8) I felt this include is not required:
> > diff --git a/src/backend/access/transam/varsup.c
> > b/src/backend/access/transam/varsup.c
> > index 849a7ce..a2f0d35 100644
> > --- a/src/backend/access/transam/varsup.c
> > +++ b/src/backend/access/transam/varsup.c
> > @@ -13,12 +13,16 @@
> >
> >  #include "postgres.h"
> >
> > +#include <unistd.h>
> > +
> >  #include "access/clog.h"
> >  #include "access/commit_ts.h"
> >
> > 9) should we change elog to ereport to use the New-style error reporting API
> > +       /* safety check, we should never get this far in a HS standby */
> > +       if (RecoveryInProgress())
> > +               elog(ERROR, "cannot assign RelFileNumber during recovery");
> > +
> > +       if (IsBinaryUpgrade)
> > +               elog(ERROR, "cannot assign RelFileNumber during binary
> > upgrade");
> >
> > 10) Here nextRelFileNumber is protected by RelFileNumberGenLock, the
> > comment stated OidGenLock. It should be slightly adjusted.
> > typedef struct VariableCacheData
> > {
> > /*
> > * These fields are protected by OidGenLock.
> > */
> > Oid nextOid; /* next OID to assign */
> > uint32 oidCount; /* OIDs available before must do XLOG work */
> > RelFileNumber nextRelFileNumber; /* next relfilenumber to assign */
> > RelFileNumber loggedRelFileNumber; /* last logged relfilenumber */
> > XLogRecPtr loggedRelFileNumberRecPtr; /* xlog record pointer w.r.t.
> > * loggedRelFileNumber */
>
> Thanks for the review I have fixed these except,
> > 9) should we change elog to ereport to use the New-style error reporting API
> I think this is internal error so if we use ereport we need to give
> error code and all and I think for internal that is not necessary?

Ok, Sounds reasonable.

> > 8) I felt this include is not required:
> it is using access API so we do need <unistd.h>

Ok, It worked for me because I had not used the ASSERT enabled flag
while compilation.

Regards,
Vignesh



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Thu, Jul 28, 2022 at 10:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I have done some cleanup in 0002 as well, basically, earlier we were
> > storing the result of the BufTagGetRelFileLocator() in a separate
> > variable which is not required everywhere.  So wherever possible I
> > have avoided using the intermediate variable.
>
> I'll have a look at this next.

I was taught that when programming in C one should avoid returning a
struct type, as BufTagGetRelFileLocator does. I would have expected it
to return void and take an argument of type RelFileLocator * into
which it writes the results. On the other hand, I was also taught that
one should avoid passing a struct type as an argument, and smgropen()
has been doing that since Tom Lane committed
87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
maybe this isn't that relevant any more on modern compilers? Or maybe
for small structs it doesn't matter much? I dunno.

Other than that, I think your 0002 looks fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Alvaro Herrera
Дата:
On 2022-Jul-29, Robert Haas wrote:

> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
RelFileLocator as argument.  With this coding you can call one function
with the other function as its argument.

However, with the current definition of relpathbackend() and siblings,
it looks quite disastrous -- BufTagGetRelFileLocator is being called
three times.  You could argue that a solution would be to turn those
macros into static inline functions.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Jul 29, 2022 at 2:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
> RelFileLocator as argument.  With this coding you can call one function
> with the other function as its argument.
>
> However, with the current definition of relpathbackend() and siblings,
> it looks quite disastrous -- BufTagGetRelFileLocator is being called
> three times.  You could argue that a solution would be to turn those
> macros into static inline functions.

Yeah, if we think it's OK to pass around structs, then that seems like
the right solution. Otherwise functions that take RelFileLocator
should be changed to take const RelFileLocator * and we should adjust
elsewhere accordingly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Alvaro Herrera
Дата:
On 2022-Jul-29, Robert Haas wrote:

> Yeah, if we think it's OK to pass around structs, then that seems like
> the right solution. Otherwise functions that take RelFileLocator
> should be changed to take const RelFileLocator * and we should adjust
> elsewhere accordingly.

We do that in other places.  See get_object_address() for another
example.  Now, I don't see *why* they do it.  I suppose there's
notational convenience; for get_object_address() I think it'd be uglier
with another out argument (it already has *relp).  For smgropen() it's
not clear at all that there is any.

For the new function, there's at least a couple of places that the
calling convention makes simpler, so I don't see why you wouldn't use it
that way.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Jul 29, 2022 at 3:18 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > Yeah, if we think it's OK to pass around structs, then that seems like
> > the right solution. Otherwise functions that take RelFileLocator
> > should be changed to take const RelFileLocator * and we should adjust
> > elsewhere accordingly.
>
> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.  I suppose there's
> notational convenience; for get_object_address() I think it'd be uglier
> with another out argument (it already has *relp).  For smgropen() it's
> not clear at all that there is any.
>
> For the new function, there's at least a couple of places that the
> calling convention makes simpler, so I don't see why you wouldn't use it
> that way.

All right, perhaps it's fine as Dilip has it, then.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2022-Jul-29, Robert Haas wrote:
>> Yeah, if we think it's OK to pass around structs, then that seems like
>> the right solution. Otherwise functions that take RelFileLocator
>> should be changed to take const RelFileLocator * and we should adjust
>> elsewhere accordingly.

> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.

If it's a big struct then avoiding copying it is good; but RelFileLocator
isn't that big.

While researching that statement I did happen to notice that no one has
bothered to update the comment immediately above struct RelFileLocator,
and it is something that absolutely does require attention if there
are plans to make RelFileNumber something other than 32 bits.

 * Note: various places use RelFileLocator in hashtable keys.  Therefore,
 * there *must not* be any unused padding bytes in this struct.  That
 * should be safe as long as all the fields are of type Oid.
 */
typedef struct RelFileLocator
{
    Oid            spcOid;            /* tablespace */
    Oid            dbOid;             /* database */
    RelFileNumber  relNumber;         /* relation */
} RelFileLocator;

            regards, tom lane



Re: making relfilenodes 56 bits

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
C99 and later requires it.  But it is pass-by-value and you have
to think twice about whether you want the struct to be copied.

            regards, tom lane



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.

I think this aspect of the patch could use some more discussion.

To recap, the problem is that pg_upgrade mustn't discover that a
relfilenode that is being migrated from the old cluster is being used
for some other table in the new cluster. Since the new cluster should
only contain system tables that we assume have never been rewritten,
they'll all have relfilenodes equal to their OIDs, and thus less than
16384. On the other hand all the user tables from the old cluster will
have relfilenodes greater than 16384, so we're fine. pg_largeobject,
which also gets migrated, is a special case. Since we don't change OID
assignments from version to version, it should have either the same
relfilenode value in the old and new clusters, if never rewritten, or
else the value in the old cluster will be greater than 16384, in which
case no conflict is possible.

But if we just assign all relfilenode values from a central counter,
then we have got trouble. If the new version has more system catalog
tables than the old version, then some value that got used for a user
table in the old version might get used for a system table in the new
version, which is a problem. One idea for fixing this is to have two
RelFileNumber ranges: a system range (small values) and a user range.
System tables get values in the system range initially, and in the
user range when first rewritten. User tables always get values in the
user range. Everything works fine in this scenario except maybe for
pg_largeobject: what if it gets one value from the system range in the
old cluster, and a different value from the system range in the new
cluster, but some other system table in the new cluster gets the value
that pg_largeobject had in the old cluster? Then we've got trouble. It
doesn't help if we assign pg_largeobject a starting relfilenode from
the user range, either: now a relfilenode that needs to end up
containing the some user table from the old cluster might find itself
blocked by pg_largeobject in the new cluster.

One solution to all this is to do as Dilip proposes here: for system
relations, keep assigning the OID as the initial relfilenumber.
Actually, we really only need to do this for pg_largeobject; all the
other relfilenumber values could be assigned from a counter, as long
as they're assigned from a range distinct from what we use for user
relations.

But I don't really like that, because I feel like the whole thing
where we start out with relfilenumber=oid is a recipe for hidden bugs.
I believe we'd be better off if we decouple those concepts more
thoroughly. So here's another idea: what if we set the
next-relfilenumber counter for the new cluster to the value from the
old cluster, and then rewrote all the (thus-far-empty) system tables?
Then every system relation in the new cluster has a relfilenode value
greater than any in use in the old cluster, so we can afterwards
migrate over every relfilenode from the old cluster with no risk of
conflicting with anything. Then all the special cases go away. We
don't need system and user ranges for relfilenodes, and
pg_largeobject's not a special case, either. We can assign relfilenode
values to system relations in exactly the same we do for user
relations: assign a value from the global counter and forget about it.
If this cluster happens to be the "new cluster" for a pg_upgrade
attempt, the procedure described at the beginning of this paragraph
will move everything that might conflict out of the way.

One thing to perhaps not like about this is that it's a little more
expensive: clustering every system table in every database on a new
cluster isn't completely free. Perhaps it's not expensive enough to be
a big problem, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Thomas Munro
Дата:
On Sat, Jul 30, 2022 at 8:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
> C99 and later requires it.  But it is pass-by-value and you have
> to think twice about whether you want the struct to be copied.

C89 had that.

As for what it actually does in a non-inlined function: on all modern
Unix-y systems, 128 bit first arguments and return values are
transferred in register pairs[1].  So if you define a struct that
holds uint32_t, uint32_t, uint64_t and compile a function that takes
one and returns it, you see the struct being transferred directly from
input registers to output registers:

   0x0000000000000000 <+0>:    mov    %rdi,%rax
   0x0000000000000003 <+3>:    mov    %rsi,%rdx
   0x0000000000000006 <+6>:    ret

Similar on ARM64.  There it's an empty function, so it must be using
the same register in and out[2].

The MSVC calling convention is different and doesn't seem to be able
to pass it through registers, so it schleps it out to memory at a
return address[3].  But that's pretty similar to the proposed
alternative anyway, so surely no worse.  *shrug*  And of course those
"constructor"-like functions are inlined anyway.

[1] https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI
[2] https://gcc.godbolt.org/z/qfPzhW7YM
[3] https://gcc.godbolt.org/z/WqvYz6xjs



Re: making relfilenodes 56 bits

От
Thomas Munro
Дата:
On Sat, Jul 30, 2022 at 9:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> on all modern Unix-y systems,

(I meant to write AMD64 there)



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Not a full review, just a quick skim of 0003.

Thanks for the review

> > +     if (!shutdown)
> > +     {
> > +             if (ShmemVariableCache->loggedRelFileNumber < checkPoint.nextRelFileNumber)
> > +                     elog(ERROR, "nextRelFileNumber can not go backward from " INT64_FORMAT "to" INT64_FORMAT,
> > +                              checkPoint.nextRelFileNumber, ShmemVariableCache->loggedRelFileNumber);
> > +
> > +             checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
> > +     }
>
> Please don't do this; rather use %llu and cast to (long long).
> Otherwise the string becomes mangled for translation.  I think there are
> many uses of this sort of pattern in strings, but not all of them are
> translatable so maybe we don't care -- for example contrib doesn't have
> translations.  And the rmgrdesc routines don't translate either, so we
> probably don't care about it there; and nothing that uses elog either.
> But this one in particular I think should be an ereport, not an elog.
> There are several other ereports in various places of the patch also.

Okay, actually I did not understand the clear logic of when to use
%llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
integers.  So do you think it is fine to replace all INT64_FORMAT in
my patch with %llu?

> > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> >               if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) != 0)
> >               {
> >                       elog(FATAL,
> > -                              "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
> > +                              "inconsistent page found, rel %u/%u/" INT64_FORMAT ", forknum %u, blkno %u",
> >                                rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
> >                                forknum, blkno);
>
> Should this one be an ereport, and thus you do need to change it to that
> and handle it like that?

Okay, so you mean irrespective of this patch should this be converted
to ereport?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Alvaro Herrera
Дата:
On 2022-Jul-30, Dilip Kumar wrote:

> On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Please don't do this; rather use %llu and cast to (long long).
> > Otherwise the string becomes mangled for translation.
> 
> Okay, actually I did not understand the clear logic of when to use
> %llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
> integers.  So do you think it is fine to replace all INT64_FORMAT in
> my patch with %llu?

The point here is that there are two users of the source code: one is
the compiler, and the other is gettext, which extracts the string for
the translation catalog.  The compiler is OK with UINT64_FORMAT, of
course (because the preprocessor deals with it).  But gettext is quite
stupid and doesn't understand that UINT64_FORMAT expands to some
specifier, so it truncates the string at the double quote sign just
before; in other words, it just doesn't work.  So whenever you have a
string that ends up in a translation catalog, you must not use
UINT64_FORMAT or any other preprocessor macro; it has to be a straight
specifier in the format string.

We have found that the most convenient notation is to use %llu in the
string and cast the argument to (unsigned long long), so our convention
is to use that.

For strings that do not end up in a translation catalog, there's no
reason to use %llu-and-cast; UINT64_FORMAT is okay.

> > > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> > >               if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) != 0)
> > >               {
> > >                       elog(FATAL,
> > > -                              "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
> > > +                              "inconsistent page found, rel %u/%u/" INT64_FORMAT ", forknum %u, blkno %u",
> > >                                rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
> > >                                forknum, blkno);
> >
> > Should this one be an ereport, and thus you do need to change it to that
> > and handle it like that?
> 
> Okay, so you mean irrespective of this patch should this be converted
> to ereport?

Yes, I think this should be an ereport with errcode(ERRCODE_DATA_CORRUPTED).

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sat, Jul 30, 2022 at 1:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2022-Jul-29, Robert Haas wrote:
> >> Yeah, if we think it's OK to pass around structs, then that seems like
> >> the right solution. Otherwise functions that take RelFileLocator
> >> should be changed to take const RelFileLocator * and we should adjust
> >> elsewhere accordingly.
>
> > We do that in other places.  See get_object_address() for another
> > example.  Now, I don't see *why* they do it.
>
> If it's a big struct then avoiding copying it is good; but RelFileLocator
> isn't that big.
>
> While researching that statement I did happen to notice that no one has
> bothered to update the comment immediately above struct RelFileLocator,
> and it is something that absolutely does require attention if there
> are plans to make RelFileNumber something other than 32 bits.

I think we need to update this comment in the patch where we are
making RelFileNumber 64 bits wide.  But as such I do not see a problem
in using RelFileLocator directly as key because if we make
RelFileNumber 64 bits then its structure will already be 8 byte
aligned so there should not be any padding.  However, if we use some
other structure as key which contain RelFileLocator i.e.
RelFileLocatorBackend then there will be a problem.  So for handling
that issue while computing the key size (wherever we have
RelFileLocatorBackend as key) I have avoided the padding bytes in size
by introducing this new macro[1].

[1]
#define SizeOfRelFileLocatorBackend \
(offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Fri, Jul 29, 2022 at 10:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 28, 2022 at 10:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I have done some cleanup in 0002 as well, basically, earlier we were
> > > storing the result of the BufTagGetRelFileLocator() in a separate
> > > variable which is not required everywhere.  So wherever possible I
> > > have avoided using the intermediate variable.
> >
> > I'll have a look at this next.
>
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does. I would have expected it
> to return void and take an argument of type RelFileLocator * into
> which it writes the results. On the other hand, I was also taught that
> one should avoid passing a struct type as an argument, and smgropen()
> has been doing that since Tom Lane committed
> 87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
> maybe this isn't that relevant any more on modern compilers? Or maybe
> for small structs it doesn't matter much? I dunno.
>
> Other than that, I think your 0002 looks fine.

Generally, I try to avoid it, but I see in current code also if the
structure is small and by directly returning the structure it makes
the other code easy then we are doing this way[1].  I wanted to do
this way is a) if we pass as an argument then I will have to use an
extra variable which makes some code complicated, it's not a big
issue, infact I had it that way in the previous version but simplified
in one of the recent versions.  b) If I allocate memory and return
pointer then also I need to store that address and later free that.

[1]
static inline ForEachState
for_each_from_setup(const List *lst, int N)
{
ForEachState r = {lst, N};

Assert(N >= 0);
return r;
}

static inline FullTransactionId
FullTransactionIdFromEpochAndXid(uint32 epoch, TransactionId xid)
{
FullTransactionId result;

result.value = ((uint64) epoch) << 32 | xid;

return result;
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Fri, Jul 29, 2022 at 8:02 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
>
> +                   ereport(ERROR,
> +                           (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                            errmsg("relfilenode" INT64_FORMAT " is too large to be represented as an OID",
> +                                   fctx->record[i].relfilenumber),
> +                            errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE")));
>
> I think it would be good to recommend users to upgrade to the latest version instead of just saying upgrade the
pg_buffercacheusing ALTER EXTENSION ....
 

This error would be hit if the relfilenumber is out of OID range that
means the user is using a new cluster but old pg_buffercache
extension.  So this errhint is about suggesting to upgrade the
extension.

> ==
>
> --- a/contrib/pg_walinspect/sql/pg_walinspect.sql
> +++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
> @@ -39,10 +39,10 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
>  -- Test for filtering out WAL records of a particular table
>  -- ===================================================================
>
> -SELECT oid AS sample_tbl_oid FROM pg_class WHERE relname = 'sample_tbl' \gset
> +SELECT relfilenode AS sample_tbl_relfilenode FROM pg_class WHERE relname = 'sample_tbl' \gset
>
> Is this change required? The original query is just trying to fetch table oid not relfilenode and AFAIK we haven't
changedanything in table oid.
 

If you notice the complete test, then you will realize that
sample_tbl_oid are used for verifying that in
pg_get_wal_records_info(), so earlier it was okay if we were using oid
instead of relfilenode because this test case is just creating table
doing some DML and verifying oid in WAL so that will be same as
relfilenode, but that is no longer true.  So we will have to check the
relfilenode that was the actual intention of the test.


>
> +    * Generate a new relfilenumber.  We cannot reuse the old relfilenumber
> +    * because of the possibility that that relation will be moved back to the
>
> that that relation -> that relation
>

I think this is a grammatically correct sentence .

I have fixed other comments, and also fixed comments from Alvaro to
use %lld instead of INT64_FORMAT inside the ereport and wherever he
suggested.

I haven't yet changed MAX_RELFILENUMBER to represent the hex
characters because then we will have to change the filename as well.
So I think there is no conclusion on this yet whether we want to keep
it as it is or in hex.  And there is another suggestion to change one
of the existing elog to an ereport, so for that I will share a
separate patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sat, Jul 30, 2022 at 1:59 AM Robert Haas <robertmhaas@gmail.com> wrote:

> One solution to all this is to do as Dilip proposes here: for system
> relations, keep assigning the OID as the initial relfilenumber.
> Actually, we really only need to do this for pg_largeobject; all the
> other relfilenumber values could be assigned from a counter, as long
> as they're assigned from a range distinct from what we use for user
> relations.
>
> But I don't really like that, because I feel like the whole thing
> where we start out with relfilenumber=oid is a recipe for hidden bugs.
> I believe we'd be better off if we decouple those concepts more
> thoroughly. So here's another idea: what if we set the
> next-relfilenumber counter for the new cluster to the value from the
> old cluster, and then rewrote all the (thus-far-empty) system tables?

You mean in a new cluster start the next-relfilenumber counter from
the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
we start next-relfilenumber after the range of the old cluster then we
can also avoid the logic of SetNextRelFileNumber() during upgrade.

My very initial idea around this was to start the next-relfilenumber
directly from the 4 billion in the new cluster so there can not be any
conflict and we don't even need to identify the highest value of used
relfilenode in the old cluster.  In fact we don't need to rewrite the
system table before upgrading I think.  So what do we lose with this?
just 4 billion relfilenode? does that really matter provided the range
we get with the 56 bits relfilenumber.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Aug 4, 2022 at 5:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > One solution to all this is to do as Dilip proposes here: for system
> > relations, keep assigning the OID as the initial relfilenumber.
> > Actually, we really only need to do this for pg_largeobject; all the
> > other relfilenumber values could be assigned from a counter, as long
> > as they're assigned from a range distinct from what we use for user
> > relations.
> >
> > But I don't really like that, because I feel like the whole thing
> > where we start out with relfilenumber=oid is a recipe for hidden bugs.
> > I believe we'd be better off if we decouple those concepts more
> > thoroughly. So here's another idea: what if we set the
> > next-relfilenumber counter for the new cluster to the value from the
> > old cluster, and then rewrote all the (thus-far-empty) system tables?
>
> You mean in a new cluster start the next-relfilenumber counter from
> the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
> we start next-relfilenumber after the range of the old cluster then we
> can also avoid the logic of SetNextRelFileNumber() during upgrade.
>
> My very initial idea around this was to start the next-relfilenumber
> directly from the 4 billion in the new cluster so there can not be any
> conflict and we don't even need to identify the highest value of used
> relfilenode in the old cluster.  In fact we don't need to rewrite the
> system table before upgrading I think.  So what do we lose with this?
> just 4 billion relfilenode? does that really matter provided the range
> we get with the 56 bits relfilenumber.

I think even if we start the range from the 4 billion we can not avoid
keeping two separate ranges for system and user tables otherwise the
next upgrade where old and new clusters both have 56 bits
relfilenumber will get conflicting files.  And, for the same reason we
still have to call SetNextRelFileNumber() during upgrade.

So the idea is, we will be having 2 ranges for relfilenumbers, system
range will start from 4 billion and user range maybe something around
4.1 (I think we can keep it very small though, just reserve 50k
relfilenumber for system for future expansion and start user range
from there).

So now system tables have no issues and also the user tables from the
old cluster have no issues.  But pg_largeobject might get conflict
when both old and new cluster are using 56 bits relfilenumber, because
it is possible that in the new cluster some other system table gets
that relfilenumber which is used by pg_largeobject in the old cluster.

This could be resolved if we allocate pg_largeobject's relfilenumber
from the user range, that means this relfilenumber will always be the
first value from the user range.  So now if the old and new cluster
both are using 56bits relfilenumber then pg_largeobject in both
cluster would have got the same relfilenumber and if the old cluster
is using the current 32 bits relfilenode system then the whole range
of the new cluster is completely different than that of the old
cluster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think even if we start the range from the 4 billion we can not avoid
> keeping two separate ranges for system and user tables otherwise the
> next upgrade where old and new clusters both have 56 bits
> relfilenumber will get conflicting files.  And, for the same reason we
> still have to call SetNextRelFileNumber() during upgrade.

Well, my proposal to move everything from the new cluster up to higher
numbers would address this without requiring two ranges.

> So the idea is, we will be having 2 ranges for relfilenumbers, system
> range will start from 4 billion and user range maybe something around
> 4.1 (I think we can keep it very small though, just reserve 50k
> relfilenumber for system for future expansion and start user range
> from there).

A disadvantage of this is that it basically means all the file names
in new clusters are going to be 10 characters long. That's not a big
disadvantage, but it's not wonderful. File names that are only 5-7
characters long are common today, and easier to remember.

> So now system tables have no issues and also the user tables from the
> old cluster have no issues.  But pg_largeobject might get conflict
> when both old and new cluster are using 56 bits relfilenumber, because
> it is possible that in the new cluster some other system table gets
> that relfilenumber which is used by pg_largeobject in the old cluster.
>
> This could be resolved if we allocate pg_largeobject's relfilenumber
> from the user range, that means this relfilenumber will always be the
> first value from the user range.  So now if the old and new cluster
> both are using 56bits relfilenumber then pg_largeobject in both
> cluster would have got the same relfilenumber and if the old cluster
> is using the current 32 bits relfilenode system then the whole range
> of the new cluster is completely different than that of the old
> cluster.

I think this can work, but it does rely to some extent on the fact
that there are no other tables which need to be treated like
pg_largeobject. If there were others, they'd need fixed starting
RelFileNumber assignments, or some other trick, like renumbering them
twice in the cluster, first two a known-unused value and then back to
the proper value. You'd have trouble if in the other cluster
pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
cluster the reverse, without some hackery.

I do feel like your idea here has some advantages - my proposal
requires rewriting all the catalogs in the new cluster before we do
anything else, and that's going to take some time even though they
should be small. But I also feel like it has some disadvantages: it
seems to rely on complicated reasoning and special cases more than I'd
like.

What do other people think?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Aug 9, 2022 at 8:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I think even if we start the range from the 4 billion we can not avoid
> > keeping two separate ranges for system and user tables otherwise the
> > next upgrade where old and new clusters both have 56 bits
> > relfilenumber will get conflicting files.  And, for the same reason we
> > still have to call SetNextRelFileNumber() during upgrade.
>
> Well, my proposal to move everything from the new cluster up to higher
> numbers would address this without requiring two ranges.
>
> > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > range will start from 4 billion and user range maybe something around
> > 4.1 (I think we can keep it very small though, just reserve 50k
> > relfilenumber for system for future expansion and start user range
> > from there).
>
> A disadvantage of this is that it basically means all the file names
> in new clusters are going to be 10 characters long. That's not a big
> disadvantage, but it's not wonderful. File names that are only 5-7
> characters long are common today, and easier to remember.

That's correct.

> > So now system tables have no issues and also the user tables from the
> > old cluster have no issues.  But pg_largeobject might get conflict
> > when both old and new cluster are using 56 bits relfilenumber, because
> > it is possible that in the new cluster some other system table gets
> > that relfilenumber which is used by pg_largeobject in the old cluster.
> >
> > This could be resolved if we allocate pg_largeobject's relfilenumber
> > from the user range, that means this relfilenumber will always be the
> > first value from the user range.  So now if the old and new cluster
> > both are using 56bits relfilenumber then pg_largeobject in both
> > cluster would have got the same relfilenumber and if the old cluster
> > is using the current 32 bits relfilenode system then the whole range
> > of the new cluster is completely different than that of the old
> > cluster.
>
> I think this can work, but it does rely to some extent on the fact
> that there are no other tables which need to be treated like
> pg_largeobject. If there were others, they'd need fixed starting
> RelFileNumber assignments, or some other trick, like renumbering them
> twice in the cluster, first two a known-unused value and then back to
> the proper value. You'd have trouble if in the other cluster
> pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> cluster the reverse, without some hackery.

Agree, if it has more catalog like pg_largeobject then it would
require some hacking.

> I do feel like your idea here has some advantages - my proposal
> requires rewriting all the catalogs in the new cluster before we do
> anything else, and that's going to take some time even though they
> should be small. But I also feel like it has some disadvantages: it
> seems to rely on complicated reasoning and special cases more than I'd
> like.

One other advantage with your approach is that since we are starting
the "nextrelfilenumber" after the old cluster's relfilenumber range.
So only at the beginning we need to set the "nextrelfilenumber" but
after that while upgrading each object we don't need to set the
nextrelfilenumber every time because that is already higher than the
complete old cluster range.  In other 2 approaches we will have to try
to set the nextrelfilenumber everytime we preserve the relfilenumber
during upgrade.

Other than these two approaches we have another approach (what the
patch set is already doing) where we keep the system relfilenumber
range same as Oid.  I know you have already pointed out that this
might have some hidden bug but one advantage of this approach is it is
simple compared two above two approaches in the sense that it doesn't
need to maintain two ranges and it also doesn't need to rewrite all
system tables in the new cluster.  So I think it would be good if we
can get others' opinions on all these 3 approaches.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Aug 11, 2022 at 10:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 9, 2022 at 8:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I think even if we start the range from the 4 billion we can not avoid
> > > keeping two separate ranges for system and user tables otherwise the
> > > next upgrade where old and new clusters both have 56 bits
> > > relfilenumber will get conflicting files.  And, for the same reason we
> > > still have to call SetNextRelFileNumber() during upgrade.
> >
> > Well, my proposal to move everything from the new cluster up to higher
> > numbers would address this without requiring two ranges.
> >
> > > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > > range will start from 4 billion and user range maybe something around
> > > 4.1 (I think we can keep it very small though, just reserve 50k
> > > relfilenumber for system for future expansion and start user range
> > > from there).
> >
> > A disadvantage of this is that it basically means all the file names
> > in new clusters are going to be 10 characters long. That's not a big
> > disadvantage, but it's not wonderful. File names that are only 5-7
> > characters long are common today, and easier to remember.
>
> That's correct.
>
> > > So now system tables have no issues and also the user tables from the
> > > old cluster have no issues.  But pg_largeobject might get conflict
> > > when both old and new cluster are using 56 bits relfilenumber, because
> > > it is possible that in the new cluster some other system table gets
> > > that relfilenumber which is used by pg_largeobject in the old cluster.
> > >
> > > This could be resolved if we allocate pg_largeobject's relfilenumber
> > > from the user range, that means this relfilenumber will always be the
> > > first value from the user range.  So now if the old and new cluster
> > > both are using 56bits relfilenumber then pg_largeobject in both
> > > cluster would have got the same relfilenumber and if the old cluster
> > > is using the current 32 bits relfilenode system then the whole range
> > > of the new cluster is completely different than that of the old
> > > cluster.
> >
> > I think this can work, but it does rely to some extent on the fact
> > that there are no other tables which need to be treated like
> > pg_largeobject. If there were others, they'd need fixed starting
> > RelFileNumber assignments, or some other trick, like renumbering them
> > twice in the cluster, first two a known-unused value and then back to
> > the proper value. You'd have trouble if in the other cluster
> > pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> > cluster the reverse, without some hackery.
>
> Agree, if it has more catalog like pg_largeobject then it would
> require some hacking.
>
> > I do feel like your idea here has some advantages - my proposal
> > requires rewriting all the catalogs in the new cluster before we do
> > anything else, and that's going to take some time even though they
> > should be small. But I also feel like it has some disadvantages: it
> > seems to rely on complicated reasoning and special cases more than I'd
> > like.
>
> One other advantage with your approach is that since we are starting
> the "nextrelfilenumber" after the old cluster's relfilenumber range.
> So only at the beginning we need to set the "nextrelfilenumber" but
> after that while upgrading each object we don't need to set the
> nextrelfilenumber every time because that is already higher than the
> complete old cluster range.  In other 2 approaches we will have to try
> to set the nextrelfilenumber everytime we preserve the relfilenumber
> during upgrade.

I was also thinking that whether we will get the max "relfilenumber"
from the old cluster at the cluster level or per database level?  I
mean if we want to get database level we can run simple query on
pg_class and get it but there also we will need to see how to handle
the mapped relation if they are rewritten?  I don't think we can get
the max relfilenumber from the old cluster at the cluster level.
Maybe in the newer version we can expose a function from the server to
just return the NextRelFileNumber and that would be the max
relfilenumber but I'm not sure how to do that in the old version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Amit Kapila
Дата:
On Sat, Jul 30, 2022 at 1:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > There was also an issue where the user table from the old cluster's
> > relfilenode could conflict with the system table of the new cluster.
> > As a solution currently for system table object (while creating
> > storage first time) we are keeping the low range of relfilenumber,
> > basically we are using the same relfilenumber as OID so that during
> > upgrade the normal user table from the old cluster will not conflict
> > with the system tables in the new cluster.  But with this solution
> > Robert told me (in off list chat) a problem that in future if we want
> > to make relfilenumber completely unique within a cluster by
> > implementing the CREATEDB differently then we can not do that as we
> > have created fixed relfilenodes for the system tables.
> >
> > I am not sure what exactly we can do to avoid that because even if we
> > do something  to avoid that in the new cluster the old cluster might
> > be already using the non-unique relfilenode so after upgrading the new
> > cluster will also get those non-unique relfilenode.
>
> I think this aspect of the patch could use some more discussion.
>
> To recap, the problem is that pg_upgrade mustn't discover that a
> relfilenode that is being migrated from the old cluster is being used
> for some other table in the new cluster. Since the new cluster should
> only contain system tables that we assume have never been rewritten,
> they'll all have relfilenodes equal to their OIDs, and thus less than
> 16384. On the other hand all the user tables from the old cluster will
> have relfilenodes greater than 16384, so we're fine. pg_largeobject,
> which also gets migrated, is a special case. Since we don't change OID
> assignments from version to version, it should have either the same
> relfilenode value in the old and new clusters, if never rewritten, or
> else the value in the old cluster will be greater than 16384, in which
> case no conflict is possible.
>
> But if we just assign all relfilenode values from a central counter,
> then we have got trouble. If the new version has more system catalog
> tables than the old version, then some value that got used for a user
> table in the old version might get used for a system table in the new
> version, which is a problem. One idea for fixing this is to have two
> RelFileNumber ranges: a system range (small values) and a user range.
> System tables get values in the system range initially, and in the
> user range when first rewritten. User tables always get values in the
> user range. Everything works fine in this scenario except maybe for
> pg_largeobject: what if it gets one value from the system range in the
> old cluster, and a different value from the system range in the new
> cluster, but some other system table in the new cluster gets the value
> that pg_largeobject had in the old cluster? Then we've got trouble.
>

To solve that problem, how about rewriting the system table in the new
cluster which has a conflicting relfilenode? I think we can probably
do this conflict checking before processing the tables from the old
cluster.

-- 
With Regards,
Amit Kapila.



Re: making relfilenodes 56 bits

От
Amit Kapila
Дата:
On Mon, Aug 22, 2022 at 1:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > There was also an issue where the user table from the old cluster's
> > > relfilenode could conflict with the system table of the new cluster.
> > > As a solution currently for system table object (while creating
> > > storage first time) we are keeping the low range of relfilenumber,
> > > basically we are using the same relfilenumber as OID so that during
> > > upgrade the normal user table from the old cluster will not conflict
> > > with the system tables in the new cluster.  But with this solution
> > > Robert told me (in off list chat) a problem that in future if we want
> > > to make relfilenumber completely unique within a cluster by
> > > implementing the CREATEDB differently then we can not do that as we
> > > have created fixed relfilenodes for the system tables.
> > >
> > > I am not sure what exactly we can do to avoid that because even if we
> > > do something  to avoid that in the new cluster the old cluster might
> > > be already using the non-unique relfilenode so after upgrading the new
> > > cluster will also get those non-unique relfilenode.
> >
> > I think this aspect of the patch could use some more discussion.
> >
> > To recap, the problem is that pg_upgrade mustn't discover that a
> > relfilenode that is being migrated from the old cluster is being used
> > for some other table in the new cluster. Since the new cluster should
> > only contain system tables that we assume have never been rewritten,
> > they'll all have relfilenodes equal to their OIDs, and thus less than
> > 16384. On the other hand all the user tables from the old cluster will
> > have relfilenodes greater than 16384, so we're fine. pg_largeobject,
> > which also gets migrated, is a special case. Since we don't change OID
> > assignments from version to version, it should have either the same
> > relfilenode value in the old and new clusters, if never rewritten, or
> > else the value in the old cluster will be greater than 16384, in which
> > case no conflict is possible.
> >
> > But if we just assign all relfilenode values from a central counter,
> > then we have got trouble. If the new version has more system catalog
> > tables than the old version, then some value that got used for a user
> > table in the old version might get used for a system table in the new
> > version, which is a problem. One idea for fixing this is to have two
> > RelFileNumber ranges: a system range (small values) and a user range.
> > System tables get values in the system range initially, and in the
> > user range when first rewritten. User tables always get values in the
> > user range. Everything works fine in this scenario except maybe for
> > pg_largeobject: what if it gets one value from the system range in the
> > old cluster, and a different value from the system range in the new
> > cluster, but some other system table in the new cluster gets the value
> > that pg_largeobject had in the old cluster? Then we've got trouble.
> >
>
> To solve that problem, how about rewriting the system table in the new
> cluster which has a conflicting relfilenode? I think we can probably
> do this conflict checking before processing the tables from the old
> cluster.
>

I think while rewriting of system table during the upgrade, we need to
ensure that it gets relfilenumber from the system range, otherwise, if
we allocate it from the user range, there will be a chance of conflict
with the user tables from the old cluster. Another way could be to set
the next-relfilenumber counter for the new cluster to the value from
the old cluster as mentioned by Robert in his previous email [1].

[1] -
https://www.postgresql.org/message-id/CA%2BTgmoYsNiF8JGZ%2BKp7Zgcct67Qk%2B%2BYAp%2B1ybOQ0qomUayn%2B7A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> To solve that problem, how about rewriting the system table in the new
> cluster which has a conflicting relfilenode? I think we can probably
> do this conflict checking before processing the tables from the old
> cluster.

Thanks for chiming in.

Right now, there are two parts to the relfilenumber preservation
system, and this scheme doesn't quite fit into either of them. First,
the dump includes commands to set pg_class.relfilenode in the new
cluster to the same value that it had in the old cluster. The dump
can't include any SQL commands that depend on what's happening in the
new cluster because pg_dump(all) only connects to a single cluster,
which in this case is the old cluster. Second, pg_upgrade itself
copies the files from the old cluster to the new cluster. This doesn't
involve a database connection at all. So there's no part of the
current relfilenode preservation mechanism that can look at the old
AND the new database and decide on some SQL to execute against the new
database.

I thought for a while that we could use the information that's already
gathered by get_rel_infos() to do what you're suggesting here, but it
doesn't quite work, because that function excludes system tables, and
we can't afford to do that here. We'd either need to modify that query
to include system tables - at least for the new cluster - or run a
separate one to gather information about system tables in the new
cluster. Then, we could put all the pg_class.relfilenode values we
found in the new cluster into a hash table, loop over the list of rels
this function found in the old cluster, and for each one, probe into
the hash table. If we find a match, that's a system table that needs
to be moved out of the way before calling create_new_objects(), or
maybe inside that function but before it runs pg_restore.

That doesn't seem too crazy, I think. It's a little bit of new
mechanism, but it doesn't sound horrific. It's got the advantage of
being significantly cheaper than my proposal of moving everything out
of the way unconditionally, and at the same time it retains one of the
key advantages of that proposal - IMV, anyway - which is that we don't
need separate relfilenode ranges for user and system objects any more.
So I guess on balance I kind of like it, but maybe I'm missing
something.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Aug 23, 2022 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > To solve that problem, how about rewriting the system table in the new
> > cluster which has a conflicting relfilenode? I think we can probably
> > do this conflict checking before processing the tables from the old
> > cluster.
>
> Thanks for chiming in.
>
> Right now, there are two parts to the relfilenumber preservation
> system, and this scheme doesn't quite fit into either of them. First,
> the dump includes commands to set pg_class.relfilenode in the new
> cluster to the same value that it had in the old cluster. The dump
> can't include any SQL commands that depend on what's happening in the
> new cluster because pg_dump(all) only connects to a single cluster,
> which in this case is the old cluster. Second, pg_upgrade itself
> copies the files from the old cluster to the new cluster. This doesn't
> involve a database connection at all. So there's no part of the
> current relfilenode preservation mechanism that can look at the old
> AND the new database and decide on some SQL to execute against the new
> database.
>
> I thought for a while that we could use the information that's already
> gathered by get_rel_infos() to do what you're suggesting here, but it
> doesn't quite work, because that function excludes system tables, and
> we can't afford to do that here. We'd either need to modify that query
> to include system tables - at least for the new cluster - or run a
> separate one to gather information about system tables in the new
> cluster. Then, we could put all the pg_class.relfilenode values we
> found in the new cluster into a hash table, loop over the list of rels
> this function found in the old cluster, and for each one, probe into
> the hash table. If we find a match, that's a system table that needs
> to be moved out of the way before calling create_new_objects(), or
> maybe inside that function but before it runs pg_restore.
>
> That doesn't seem too crazy, I think. It's a little bit of new
> mechanism, but it doesn't sound horrific. It's got the advantage of
> being significantly cheaper than my proposal of moving everything out
> of the way unconditionally, and at the same time it retains one of the
> key advantages of that proposal - IMV, anyway - which is that we don't
> need separate relfilenode ranges for user and system objects any more.
> So I guess on balance I kind of like it, but maybe I'm missing
> something.

Okay, so this seems exactly the same as your previous proposal but
instead of unconditionally rewriting all the system tables we will
rewrite only those conflict with the user table or pg_largeobject from
the previous cluster.  Although it might have additional
implementation complexity on the pg upgrade side, it seems cheaper
than rewriting everything.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Aug 23, 2022 at 8:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > To solve that problem, how about rewriting the system table in the new
> > > cluster which has a conflicting relfilenode? I think we can probably
> > > do this conflict checking before processing the tables from the old
> > > cluster.
> >
> > Thanks for chiming in.
> >
> > Right now, there are two parts to the relfilenumber preservation
> > system, and this scheme doesn't quite fit into either of them. First,
> > the dump includes commands to set pg_class.relfilenode in the new
> > cluster to the same value that it had in the old cluster. The dump
> > can't include any SQL commands that depend on what's happening in the
> > new cluster because pg_dump(all) only connects to a single cluster,
> > which in this case is the old cluster. Second, pg_upgrade itself
> > copies the files from the old cluster to the new cluster. This doesn't
> > involve a database connection at all. So there's no part of the
> > current relfilenode preservation mechanism that can look at the old
> > AND the new database and decide on some SQL to execute against the new
> > database.
> >
> > I thought for a while that we could use the information that's already
> > gathered by get_rel_infos() to do what you're suggesting here, but it
> > doesn't quite work, because that function excludes system tables, and
> > we can't afford to do that here. We'd either need to modify that query
> > to include system tables - at least for the new cluster - or run a
> > separate one to gather information about system tables in the new
> > cluster. Then, we could put all the pg_class.relfilenode values we
> > found in the new cluster into a hash table, loop over the list of rels
> > this function found in the old cluster, and for each one, probe into
> > the hash table. If we find a match, that's a system table that needs
> > to be moved out of the way before calling create_new_objects(), or
> > maybe inside that function but before it runs pg_restore.
> >
> > That doesn't seem too crazy, I think. It's a little bit of new
> > mechanism, but it doesn't sound horrific. It's got the advantage of
> > being significantly cheaper than my proposal of moving everything out
> > of the way unconditionally, and at the same time it retains one of the
> > key advantages of that proposal - IMV, anyway - which is that we don't
> > need separate relfilenode ranges for user and system objects any more.
> > So I guess on balance I kind of like it, but maybe I'm missing
> > something.
>
> Okay, so this seems exactly the same as your previous proposal but
> instead of unconditionally rewriting all the system tables we will
> rewrite only those conflict with the user table or pg_largeobject from
> the previous cluster.  Although it might have additional
> implementation complexity on the pg upgrade side, it seems cheaper
> than rewriting everything.

OTOH, if we keep the two separate ranges for the user and system table
then we don't need all this complex logic of conflict checking.  From
the old cluster, we can just remember the relfilenumbr of the
pg_largeobject, and in the new cluster before trying to restore we can
just query the new cluster pg_class and find out whether it is used by
any system table and if so then we can just rewrite that system table.
And I think using two ranges might not be that complicated because as
soon as we are done with the initdb we can just set NextRelFileNumber
to the first user range relfilenumber so I think this could be the
simplest solution.  And I think what Amit is suggesting is something
on this line?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Amit Kapila
Дата:
On Tue, Aug 23, 2022 at 11:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 8:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Aug 23, 2022 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > To solve that problem, how about rewriting the system table in the new
> > > > cluster which has a conflicting relfilenode? I think we can probably
> > > > do this conflict checking before processing the tables from the old
> > > > cluster.
> > >
> > > Thanks for chiming in.
> > >
> > > Right now, there are two parts to the relfilenumber preservation
> > > system, and this scheme doesn't quite fit into either of them. First,
> > > the dump includes commands to set pg_class.relfilenode in the new
> > > cluster to the same value that it had in the old cluster. The dump
> > > can't include any SQL commands that depend on what's happening in the
> > > new cluster because pg_dump(all) only connects to a single cluster,
> > > which in this case is the old cluster. Second, pg_upgrade itself
> > > copies the files from the old cluster to the new cluster. This doesn't
> > > involve a database connection at all. So there's no part of the
> > > current relfilenode preservation mechanism that can look at the old
> > > AND the new database and decide on some SQL to execute against the new
> > > database.
> > >
> > > I thought for a while that we could use the information that's already
> > > gathered by get_rel_infos() to do what you're suggesting here, but it
> > > doesn't quite work, because that function excludes system tables, and
> > > we can't afford to do that here. We'd either need to modify that query
> > > to include system tables - at least for the new cluster - or run a
> > > separate one to gather information about system tables in the new
> > > cluster. Then, we could put all the pg_class.relfilenode values we
> > > found in the new cluster into a hash table, loop over the list of rels
> > > this function found in the old cluster, and for each one, probe into
> > > the hash table. If we find a match, that's a system table that needs
> > > to be moved out of the way before calling create_new_objects(), or
> > > maybe inside that function but before it runs pg_restore.
> > >
> > > That doesn't seem too crazy, I think. It's a little bit of new
> > > mechanism, but it doesn't sound horrific. It's got the advantage of
> > > being significantly cheaper than my proposal of moving everything out
> > > of the way unconditionally, and at the same time it retains one of the
> > > key advantages of that proposal - IMV, anyway - which is that we don't
> > > need separate relfilenode ranges for user and system objects any more.
> > > So I guess on balance I kind of like it, but maybe I'm missing
> > > something.
> >
> > Okay, so this seems exactly the same as your previous proposal but
> > instead of unconditionally rewriting all the system tables we will
> > rewrite only those conflict with the user table or pg_largeobject from
> > the previous cluster.  Although it might have additional
> > implementation complexity on the pg upgrade side, it seems cheaper
> > than rewriting everything.
>
> OTOH, if we keep the two separate ranges for the user and system table
> then we don't need all this complex logic of conflict checking.  From
> the old cluster, we can just remember the relfilenumbr of the
> pg_largeobject, and in the new cluster before trying to restore we can
> just query the new cluster pg_class and find out whether it is used by
> any system table and if so then we can just rewrite that system table.
>

Before re-write of that system table, I think we need to set
NextRelFileNumber to a number greater than the max relfilenumber from
the old cluster, otherwise, it can later conflict with some user
table.

> And I think using two ranges might not be that complicated because as
> soon as we are done with the initdb we can just set NextRelFileNumber
> to the first user range relfilenumber so I think this could be the
> simplest solution.  And I think what Amit is suggesting is something
> on this line?
>

Yeah, I had thought of checking only pg_largeobject. I think the
advantage of having separate ranges is that we have a somewhat simpler
logic in the upgrade but OTOH the other scheme has the advantage of
having a single allocation scheme. Do we see any other pros/cons of
one over the other?

One more thing we may want to think about is what if there are tables
created by extension? For example, I think BDR creates some tables
like node_group, conflict_history, etc. Now, I think if such an
extension is created by default, both old and new clusters will have
those tables. Isn't there a chance of relfilenumber conflict in such
cases?

-- 
With Regards,
Amit Kapila.



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Aug 23, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.  From
> > the old cluster, we can just remember the relfilenumbr of the
> > pg_largeobject, and in the new cluster before trying to restore we can
> > just query the new cluster pg_class and find out whether it is used by
> > any system table and if so then we can just rewrite that system table.
> >
>
> Before re-write of that system table, I think we need to set
> NextRelFileNumber to a number greater than the max relfilenumber from
> the old cluster, otherwise, it can later conflict with some user
> table.

Yes we will need to do that.

> > And I think using two ranges might not be that complicated because as
> > soon as we are done with the initdb we can just set NextRelFileNumber
> > to the first user range relfilenumber so I think this could be the
> > simplest solution.  And I think what Amit is suggesting is something
> > on this line?
> >
>
> Yeah, I had thought of checking only pg_largeobject. I think the
> advantage of having separate ranges is that we have a somewhat simpler
> logic in the upgrade but OTOH the other scheme has the advantage of
> having a single allocation scheme. Do we see any other pros/cons of
> one over the other?

I feel having a separate range is not much different from having a
single allocation scheme, after cluster initialization, we will just
have to set the NextRelFileNumber to something called
FirstNormalRelFileNumber which looks fine to me.

> One more thing we may want to think about is what if there are tables
> created by extension? For example, I think BDR creates some tables
> like node_group, conflict_history, etc. Now, I think if such an
> extension is created by default, both old and new clusters will have
> those tables. Isn't there a chance of relfilenumber conflict in such
> cases?

Shouldn't they behave as a normal user table? because before upgrade
anyway new cluster can not have any table other than system tables and
those tables created by an extension should also be restored as other
user table does.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> OTOH, if we keep the two separate ranges for the user and system table
> then we don't need all this complex logic of conflict checking.

True. That's the downside. The question is whether it's worth adding
some complexity to avoid needing separate ranges.

Honestly, if we don't care about having separate ranges, we can do
something even simpler and just make the starting relfilenumber for
system tables same as the OID. Then we don't have to do anything at
all, outside of not changing the OID assigned to pg_largeobject in a
future release. Then as long as pg_upgrade is targeting a new cluster
with completely fresh databases that have not had any system table
rewrites so far, there can't be any conflict.

And perhaps that is the best solution after all, but while it is
simple in terms of code, I feel it's a bit complicated for human
beings. It's very simple to understand the scheme that Amit proposed:
if there's anything in the new cluster that would conflict, we move it
out of the way. We don't have to assume the new cluster hasn't had any
table rewrites. We don't have to nail down starting relfilenumber
assignments for system tables. We don't have to worry about
relfilenumber or OID assignments changing between releases.
pg_largeobject is not a special case. There are no special ranges of
OIDs or relfilenumbers required. It just straight up works -- all the
time, no matter what, end of story.

The other schemes we're talking about here all require a bunch of
assumptions about stuff like what I just mentioned. We can certainly
do it that way, and maybe it's even for the best. But I feel like it's
a little bit fragile. Maybe some future change gets blocked because it
would break one of the assumptions that the system relies on, or maybe
someone doesn't even realize there's an issue and changes something
that introduces a bug into this system. Or on the other hand maybe
not. But I think there's at least some value in considering whether
adding a little more code might actually make things simpler to reason
about, and whether that might be a good enough reason to do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Amit Kapila
Дата:
On Tue, Aug 23, 2022 at 8:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.
>
> True. That's the downside. The question is whether it's worth adding
> some complexity to avoid needing separate ranges.
>
> Honestly, if we don't care about having separate ranges, we can do
> something even simpler and just make the starting relfilenumber for
> system tables same as the OID. Then we don't have to do anything at
> all, outside of not changing the OID assigned to pg_largeobject in a
> future release. Then as long as pg_upgrade is targeting a new cluster
> with completely fresh databases that have not had any system table
> rewrites so far, there can't be any conflict.
>
> And perhaps that is the best solution after all, but while it is
> simple in terms of code, I feel it's a bit complicated for human
> beings. It's very simple to understand the scheme that Amit proposed:
> if there's anything in the new cluster that would conflict, we move it
> out of the way. We don't have to assume the new cluster hasn't had any
> table rewrites. We don't have to nail down starting relfilenumber
> assignments for system tables. We don't have to worry about
> relfilenumber or OID assignments changing between releases.
> pg_largeobject is not a special case. There are no special ranges of
> OIDs or relfilenumbers required. It just straight up works -- all the
> time, no matter what, end of story.
>

This sounds simple to understand. It seems we always create new system
tables in the new cluster before the upgrade, so I think it is safe to
assume there won't be any table rewrite in it. OTOH, if the
relfilenumber allocation scheme is robust to deal with table rewrites
then we probably don't need to worry about this assumption changing in
the future.

-- 
With Regards,
Amit Kapila.



Re: making relfilenodes 56 bits

От
Amit Kapila
Дата:
On Tue, Aug 23, 2022 at 3:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > One more thing we may want to think about is what if there are tables
> > created by extension? For example, I think BDR creates some tables
> > like node_group, conflict_history, etc. Now, I think if such an
> > extension is created by default, both old and new clusters will have
> > those tables. Isn't there a chance of relfilenumber conflict in such
> > cases?
>
> Shouldn't they behave as a normal user table? because before upgrade
> anyway new cluster can not have any table other than system tables and
> those tables created by an extension should also be restored as other
> user table does.
>

Right.

-- 
With Regards,
Amit Kapila.



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Mon, Aug 1, 2022 at 7:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have fixed other comments, and also fixed comments from Alvaro to
> use %lld instead of INT64_FORMAT inside the ereport and wherever he
> suggested.

Notwithstanding the ongoing discussion about the exact approach for
the main patch, it seemed OK to push the preparatory patch you posted
here, so I have now done that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Aug 23, 2022 at 8:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.
>
> True. That's the downside. The question is whether it's worth adding
> some complexity to avoid needing separate ranges.

Other than complexity, we will have to check the conflict for all the
user table's relfilenumber from the old cluster into the hash build on
the new cluster's relfilenumber, isn't it extra overhead if there are
a lot of user tables?  But I think we are already restoring all those
tables in the new cluster so compared to that it will be very small.

> Honestly, if we don't care about having separate ranges, we can do
> something even simpler and just make the starting relfilenumber for
> system tables same as the OID. Then we don't have to do anything at
> all, outside of not changing the OID assigned to pg_largeobject in a
> future release. Then as long as pg_upgrade is targeting a new cluster
> with completely fresh databases that have not had any system table
> rewrites so far, there can't be any conflict.

I think having the OID-based system and having two ranges are not
exactly the same.  Because if we have the OID-based relfilenumber
allocation for system table (initially) and then later allocate from
the nextRelFileNumber counter then it seems like a mix of old system
(where actually OID and relfilenumber are tightly connected) and the
new system where nextRelFileNumber is completely independent counter.
OTOH having two ranges means logically we are not making dependent on
OID we are just allocating from a central counter but after catalog
initialization, we will leave some gap and start from a new range. So
I don't think this system is hard to explain.

> And perhaps that is the best solution after all, but while it is
> simple in terms of code, I feel it's a bit complicated for human
> beings. It's very simple to understand the scheme that Amit proposed:
> if there's anything in the new cluster that would conflict, we move it
> out of the way. We don't have to assume the new cluster hasn't had any
> table rewrites. We don't have to nail down starting relfilenumber
> assignments for system tables. We don't have to worry about
> relfilenumber or OID assignments changing between releases.
> pg_largeobject is not a special case. There are no special ranges of
> OIDs or relfilenumbers required. It just straight up works -- all the
> time, no matter what, end of story.

I agree on this that this system is easy to explain that we just
rewrite anything that conflicts so looks more future-proof.  Okay, I
will try this solution and post the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Aug 25, 2022 at 5:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> I agree on this that this system is easy to explain that we just
> rewrite anything that conflicts so looks more future-proof.  Okay, I
> will try this solution and post the patch.

While working on this solution I noticed one issue. Basically, the
problem is that during binary upgrade when we try to rewrite a heap we
would expect that “binary_upgrade_next_heap_pg_class_oid” and
“binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
creating a new heap. But we are not preserving anything so we don't
have those values. One option to this problem is that we can first
start the postmaster in non-binary upgrade mode perform all conflict
checking and rewrite and stop the postmaster.  Then start postmaster
again and perform the restore as we are doing now.  Although we will
have to start/stop the postmaster one extra time we have a solution.

But while thinking about this I started to think that since now we are
completely decoupling the concept of Oid and relfilenumber then
logically during REWRITE we should only be allocating new
relfilenumber but we don’t really need to allocate the Oid at all.
Yeah, we can do that if inside make_new_heap() if we pass the
OIDOldHeap to heap_create_with_catalog(), then it will just create new
storage(relfilenumber) but not a new Oid. But the problem is that the
ATRewriteTable() and finish_heap_swap() functions are completely based
on the relation cache.  So now if we only create a new relfilenumber
but not a new Oid then we will have to change this infrastructure to
copy at smgr level.

Thoughts?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> While working on this solution I noticed one issue. Basically, the
> problem is that during binary upgrade when we try to rewrite a heap we
> would expect that “binary_upgrade_next_heap_pg_class_oid” and
> “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> creating a new heap. But we are not preserving anything so we don't
> have those values. One option to this problem is that we can first
> start the postmaster in non-binary upgrade mode perform all conflict
> checking and rewrite and stop the postmaster.  Then start postmaster
> again and perform the restore as we are doing now.  Although we will
> have to start/stop the postmaster one extra time we have a solution.

Yeah, that seems OK. Or we could add a new function, like
binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
Not sure which way is better.

> But while thinking about this I started to think that since now we are
> completely decoupling the concept of Oid and relfilenumber then
> logically during REWRITE we should only be allocating new
> relfilenumber but we don’t really need to allocate the Oid at all.
> Yeah, we can do that if inside make_new_heap() if we pass the
> OIDOldHeap to heap_create_with_catalog(), then it will just create new
> storage(relfilenumber) but not a new Oid. But the problem is that the
> ATRewriteTable() and finish_heap_swap() functions are completely based
> on the relation cache.  So now if we only create a new relfilenumber
> but not a new Oid then we will have to change this infrastructure to
> copy at smgr level.

I think it would be a good idea to continue preserving the OIDs. If
nothing else, it makes debugging way easier, but also, there might be
user-defined regclass columns or something. Note the comments in
check_for_reg_data_type_usage().

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Fri, Aug 26, 2022 at 9:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > While working on this solution I noticed one issue. Basically, the
> > problem is that during binary upgrade when we try to rewrite a heap we
> > would expect that “binary_upgrade_next_heap_pg_class_oid” and
> > “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> > creating a new heap. But we are not preserving anything so we don't
> > have those values. One option to this problem is that we can first
> > start the postmaster in non-binary upgrade mode perform all conflict
> > checking and rewrite and stop the postmaster.  Then start postmaster
> > again and perform the restore as we are doing now.  Although we will
> > have to start/stop the postmaster one extra time we have a solution.
>
> Yeah, that seems OK. Or we could add a new function, like
> binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
> Not sure which way is better.

I have found one more issue with this approach of rewriting the
conflicting table.  Earlier I thought we could do the conflict
checking and rewriting inside create_new_objects() right before the
restore command.  But after implementing (while testing) this I
realized that we DROP and CREATE the database while restoring the dump
that means it will again generate the conflicting system tables.  So
theoretically the rewriting should go in between the CREATE DATABASE
and restoring the object but as of now both create database and
restoring other objects are part of a single dump file.  I haven't yet
analyzed how feasible it is to generate the dump in two parts, first
part just to create the database and in second part restore the rest
of the object.

Thoughts?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Aug 30, 2022 at 8:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have found one more issue with this approach of rewriting the
> conflicting table.  Earlier I thought we could do the conflict
> checking and rewriting inside create_new_objects() right before the
> restore command.  But after implementing (while testing) this I
> realized that we DROP and CREATE the database while restoring the dump
> that means it will again generate the conflicting system tables.  So
> theoretically the rewriting should go in between the CREATE DATABASE
> and restoring the object but as of now both create database and
> restoring other objects are part of a single dump file.  I haven't yet
> analyzed how feasible it is to generate the dump in two parts, first
> part just to create the database and in second part restore the rest
> of the object.
>
> Thoughts?

Well, that's very awkward. It doesn't seem like it would be very
difficult to teach pg_upgrade to call pg_restore without --clean and
just do the drop database itself, but that doesn't really help,
because pg_restore will in any event be creating the new database.
That doesn't seem like something we can practically refactor out,
because only pg_dump knows what properties to use when creating the
new database. What we could do is have the dump include a command like
SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
but that doesn't really help very much, because passing the whole list
of relfilenode values from the old database seems pretty certain to be
a bad idea. The whole idea here was that we'd be able to build a hash
table on the new database's system table OIDs, and it seems like
that's not going to work.

We could try to salvage some portion of the idea by making
pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
set of arguments, like the smallest and largest relfilenode values
from the old database, and then we'd just need to move things that
overlap. But that feels pretty hit-or-miss to me as to whether it
actually avoids any work, and
pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
to write. So perhaps we have to go back to the drawing board here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Aug 30, 2022 at 9:23 PM Robert Haas <robertmhaas@gmail.com> wrote:

> Well, that's very awkward. It doesn't seem like it would be very
> difficult to teach pg_upgrade to call pg_restore without --clean and
> just do the drop database itself, but that doesn't really help,
> because pg_restore will in any event be creating the new database.
> That doesn't seem like something we can practically refactor out,
> because only pg_dump knows what properties to use when creating the
> new database. What we could do is have the dump include a command like
> SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
> but that doesn't really help very much, because passing the whole list
> of relfilenode values from the old database seems pretty certain to be
> a bad idea. The whole idea here was that we'd be able to build a hash
> table on the new database's system table OIDs, and it seems like
> that's not going to work.

Right.

> We could try to salvage some portion of the idea by making
> pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
> set of arguments, like the smallest and largest relfilenode values
> from the old database, and then we'd just need to move things that
> overlap. But that feels pretty hit-or-miss to me as to whether it
> actually avoids any work, and
> pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
> to write. So perhaps we have to go back to the drawing board here.

So as of now, we have two open options 1) the current approach and
what patch is following to use Oid as relfilenode for the system
tables when initially created.  2) call
pg_binary_upgrade_move_things_out_of_the_way() which force rewrite all
the system tables.

Another idea that I am not very sure how feasible is. Can we change
the dump such that in binary upgrade mode it will not use template0 as
a template database (in creating database command) but instead some
new database as a template e.g. template-XYZ?   And later for conflict
checking, we will create this template-XYZ database on the new cluster
and then we will perform all the conflict check (from all the
databases of the old cluster) and rewrite operations on this database.
And later all the databases will be created using template-XYZ as the
template and all the rewriting stuff we have done is still intact.
The problems I could think of are 1) only for a binary upgrade we will
have to change the pg_dump.  2) we will have to use another database
name as the reserved database name but what if that name is already in
use in the previous cluster?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sat, Sep 3, 2022 at 1:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 30, 2022 at 9:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > Well, that's very awkward. It doesn't seem like it would be very
> > difficult to teach pg_upgrade to call pg_restore without --clean and
> > just do the drop database itself, but that doesn't really help,
> > because pg_restore will in any event be creating the new database.
> > That doesn't seem like something we can practically refactor out,
> > because only pg_dump knows what properties to use when creating the
> > new database. What we could do is have the dump include a command like
> > SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
> > but that doesn't really help very much, because passing the whole list
> > of relfilenode values from the old database seems pretty certain to be
> > a bad idea. The whole idea here was that we'd be able to build a hash
> > table on the new database's system table OIDs, and it seems like
> > that's not going to work.
>
> Right.
>
> > We could try to salvage some portion of the idea by making
> > pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
> > set of arguments, like the smallest and largest relfilenode values
> > from the old database, and then we'd just need to move things that
> > overlap. But that feels pretty hit-or-miss to me as to whether it
> > actually avoids any work, and
> > pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
> > to write. So perhaps we have to go back to the drawing board here.
>
> So as of now, we have two open options 1) the current approach and
> what patch is following to use Oid as relfilenode for the system
> tables when initially created.  2) call
> pg_binary_upgrade_move_things_out_of_the_way() which force rewrite all
> the system tables.
>
> Another idea that I am not very sure how feasible is. Can we change
> the dump such that in binary upgrade mode it will not use template0 as
> a template database (in creating database command) but instead some
> new database as a template e.g. template-XYZ?   And later for conflict
> checking, we will create this template-XYZ database on the new cluster
> and then we will perform all the conflict check (from all the
> databases of the old cluster) and rewrite operations on this database.
> And later all the databases will be created using template-XYZ as the
> template and all the rewriting stuff we have done is still intact.
> The problems I could think of are 1) only for a binary upgrade we will
> have to change the pg_dump.  2) we will have to use another database
> name as the reserved database name but what if that name is already in
> use in the previous cluster?

While we are still thinking on this issue, I have rebased the patch on
the latest head and fixed a couple of minor issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Amit Kapila
Дата:
On Tue, Aug 30, 2022 at 6:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 26, 2022 at 9:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > While working on this solution I noticed one issue. Basically, the
> > > problem is that during binary upgrade when we try to rewrite a heap we
> > > would expect that “binary_upgrade_next_heap_pg_class_oid” and
> > > “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> > > creating a new heap. But we are not preserving anything so we don't
> > > have those values. One option to this problem is that we can first
> > > start the postmaster in non-binary upgrade mode perform all conflict
> > > checking and rewrite and stop the postmaster.  Then start postmaster
> > > again and perform the restore as we are doing now.  Although we will
> > > have to start/stop the postmaster one extra time we have a solution.
> >
> > Yeah, that seems OK. Or we could add a new function, like
> > binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
> > Not sure which way is better.
>
> I have found one more issue with this approach of rewriting the
> conflicting table.  Earlier I thought we could do the conflict
> checking and rewriting inside create_new_objects() right before the
> restore command.  But after implementing (while testing) this I
> realized that we DROP and CREATE the database while restoring the dump
> that means it will again generate the conflicting system tables.  So
> theoretically the rewriting should go in between the CREATE DATABASE
> and restoring the object but as of now both create database and
> restoring other objects are part of a single dump file.  I haven't yet
> analyzed how feasible it is to generate the dump in two parts, first
> part just to create the database and in second part restore the rest
> of the object.
>

Isn't this happening because we are passing "--clean
--create"/"--create" options to pg_restore in create_new_objects()? If
so, then I think one idea to decouple would be to not use those
options. Perform drop/create separately via commands (for create, we
need to generate the command as we are generating while generating the
dump in custom format), then rewrite the conflicting tables, and
finally restore the dump.

--
With Regards,
Amit Kapila.



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> > I have found one more issue with this approach of rewriting the
> > conflicting table.  Earlier I thought we could do the conflict
> > checking and rewriting inside create_new_objects() right before the
> > restore command.  But after implementing (while testing) this I
> > realized that we DROP and CREATE the database while restoring the dump
> > that means it will again generate the conflicting system tables.  So
> > theoretically the rewriting should go in between the CREATE DATABASE
> > and restoring the object but as of now both create database and
> > restoring other objects are part of a single dump file.  I haven't yet
> > analyzed how feasible it is to generate the dump in two parts, first
> > part just to create the database and in second part restore the rest
> > of the object.
> >
>
> Isn't this happening because we are passing "--clean
> --create"/"--create" options to pg_restore in create_new_objects()? If
> so, then I think one idea to decouple would be to not use those
> options. Perform drop/create separately via commands (for create, we
> need to generate the command as we are generating while generating the
> dump in custom format), then rewrite the conflicting tables, and
> finally restore the dump.

Hmm, you are right.  So I think something like this is possible to do,
I will explore this more. Thanks for the idea.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Sun, Sep 4, 2022 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> > Isn't this happening because we are passing "--clean
> > --create"/"--create" options to pg_restore in create_new_objects()? If
> > so, then I think one idea to decouple would be to not use those
> > options. Perform drop/create separately via commands (for create, we
> > need to generate the command as we are generating while generating the
> > dump in custom format), then rewrite the conflicting tables, and
> > finally restore the dump.
>
> Hmm, you are right.  So I think something like this is possible to do,
> I will explore this more. Thanks for the idea.

I have explored this area more and also tried to come up with a
working prototype, so while working on this I realized that we would
have almost to execute all the code which is getting generated as part
of the dumpDatabase() and dumpACL() which is basically,

1. UPDATE pg_catalog.pg_database SET datistemplate = false
2. DROP DATABASE
3. CREATE DATABASE with all the database properties like ENCODING,
LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
COLLATION_VERSION, TABLESPACE
4. COMMENT ON DATABASE
5. Logic inside dumpACL()

I feel duplicating logic like this is really error-prone, but I do not
find any clear way to reuse the code as dumpDatabase() has a high
dependency on the Archive handle and generating the dump file.

So currently I have implemented most of this logic except for a few
e.g. dumpACL(), comments on the database, etc.  So before we go too
far in this direction I wanted to know the opinions of others.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have explored this area more and also tried to come up with a
> working prototype, so while working on this I realized that we would
> have almost to execute all the code which is getting generated as part
> of the dumpDatabase() and dumpACL() which is basically,
>
> 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> 2. DROP DATABASE
> 3. CREATE DATABASE with all the database properties like ENCODING,
> LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> COLLATION_VERSION, TABLESPACE
> 4. COMMENT ON DATABASE
> 5. Logic inside dumpACL()
>
> I feel duplicating logic like this is really error-prone, but I do not
> find any clear way to reuse the code as dumpDatabase() has a high
> dependency on the Archive handle and generating the dump file.

Yeah, I don't think this is the way to go at all. The duplicated logic
is likely to get broken, and is also likely to annoy the next person
who has to maintain it.

I suggest that for now we fall back on making the initial
RelFileNumber for a system table equal to pg_class.oid. I don't really
love that system and I think maybe we should change it at some point
in the future, but all the alternatives seem too complicated to cram
them into the current patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Sep 6, 2022 at 11:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have explored this area more and also tried to come up with a
> working prototype, so while working on this I realized that we would
> have almost to execute all the code which is getting generated as part
> of the dumpDatabase() and dumpACL() which is basically,
>
> 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> 2. DROP DATABASE
> 3. CREATE DATABASE with all the database properties like ENCODING,
> LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> COLLATION_VERSION, TABLESPACE
> 4. COMMENT ON DATABASE
> 5. Logic inside dumpACL()
>
> I feel duplicating logic like this is really error-prone, but I do not
> find any clear way to reuse the code as dumpDatabase() has a high
> dependency on the Archive handle and generating the dump file.

Yeah, I don't think this is the way to go at all. The duplicated logic
is likely to get broken, and is also likely to annoy the next person
who has to maintain it.

Right 
 
I suggest that for now we fall back on making the initial
RelFileNumber for a system table equal to pg_class.oid. I don't really
love that system and I think maybe we should change it at some point
in the future, but all the alternatives seem too complicated to cram
them into the current patch.

That makes sense.

On a separate note, while reviewing the latest patch I see there is some risk of using the unflushed relfilenumber in GetNewRelFileNumber() function.  Basically, in the current code, the flushing logic is tightly coupled with the logging new relfilenumber logic and that might not work with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea is we need to keep the flushing logic separate from the logging, I am working on the idea and I will post the patch soon.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> On a separate note, while reviewing the latest patch I see there is some risk of using the unflushed relfilenumber in
GetNewRelFileNumber()function.  Basically, in the current code, the flushing logic is tightly coupled with the logging
newrelfilenumber logic and that might not work with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the
ideais we need to keep the flushing logic separate from the logging, I am working on the idea and I will post the patch
soon.

I have fixed the issue, so now we will track nextRelFileNumber,
loggedRelFileNumber and flushedRelFileNumber.  So whenever
nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
relfilenumbers.  And whenever nextRelFileNumber reaches the
flushedRelFileNumber then we will do XlogFlush for WAL upto the last
loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
avoid tracking the flushedRelFileNumber.  But I feel keeping track of
the flushedRelFileNumber looks cleaner and easier to understand.  For
more details refer to the code in GetNewRelFileNumber().

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Amul Sul
Дата:
On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > On a separate note, while reviewing the latest patch I see there is some risk of using the unflushed relfilenumber
inGetNewRelFileNumber() function.  Basically, in the current code, the flushing logic is tightly coupled with the
loggingnew relfilenumber logic and that might not work with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So
theidea is we need to keep the flushing logic separate from the logging, I am working on the idea and I will post the
patchsoon. 
>
> I have fixed the issue, so now we will track nextRelFileNumber,
> loggedRelFileNumber and flushedRelFileNumber.  So whenever
> nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
> loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
> relfilenumbers.  And whenever nextRelFileNumber reaches the
> flushedRelFileNumber then we will do XlogFlush for WAL upto the last
> loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
> VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
> avoid tracking the flushedRelFileNumber.  But I feel keeping track of
> the flushedRelFileNumber looks cleaner and easier to understand.  For
> more details refer to the code in GetNewRelFileNumber().
>

Here are a few minor suggestions I came across while reading this
patch, might be useful:

+#ifdef USE_ASSERT_CHECKING
+
+   {

Unnecessary space after USE_ASSERT_CHECKING.
--

+               return InvalidRelFileNumber;    /* placate compiler */

I don't think we needed this after the error on the latest branches.
--

+   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
+   if (shutdown)
+       checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+   else
+       checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+
+   LWLockRelease(RelFileNumberGenLock);

This is done for the good reason, I think, it should have a comment
describing different checkPoint.nextRelFileNumber assignment
need and crash recovery perspective.
--

+#define SizeOfRelFileLocatorBackend \
+   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

Can append empty parenthesis "()" to the macro name, to look like a
function call at use or change the macro name to uppercase?
--

 +   if (val < 0 || val > MAX_RELFILENUMBER)
..
 if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \

How about adding a macro for this condition as RelFileNumberIsValid()?
We can replace all the checks referring to MAX_RELFILENUMBER with this.

Regards,
Amul



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [ new patch ]

+typedef pg_int64 RelFileNumber;

This seems really random to me. First, why isn't this an unsigned
type? OID is unsigned and I don't see a reason to change to a signed
type. But even if we were going to change to a signed type, why
pg_int64? That is declared like this:

/* Define a signed 64-bit integer type for use in client API declarations. */
typedef PG_INT64_TYPE pg_int64;

Surely this is not a client API declaration....

Note that if we change this a lot of references to INT64_FORMAT will
need to become UINT64_FORMAT.

I think we should use int64 at the SQL level, because we don't have an
unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
similar. But internally I think using unsigned is cleaner.

+ * RelFileNumber is unique within a cluster.

Not really, because of CREATE DATABASE. Probably just drop this line.
Or else expand it: we never assign the same RelFileNumber twice within
the lifetime of the same cluster, but there can be multiple relations
with the same RelFileNumber e.g. because CREATE DATABASE duplicates
the RelFileNumber values from the template database. But maybe we
don't need this here, as it's already explained in relfilelocator.h.

+    ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS);

Why not declare ret as ForkNumber instead of casting twice?

+    uint64      relnum;
+
+    Assert(relnumber <= MAX_RELFILENUMBER);
+    Assert(forknum <= MAX_FORKNUM);
+
+    relnum = relnumber;

Perhaps it'd be better to write uint64 relnum = relnumber instead of
initializing on a separate line.

+#define RELNUMBERCHARS  20      /* max chars printed by %llu */

Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if
there's some reason to stick with a signed type).

+        elog(ERROR, "relfilenumber is out of bound");

It would have to be "out of bounds", with an "s". But maybe "is too
large" would be better.

+    nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+    loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+    flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber;

Maybe it would be a good idea to asset that next <= flushed and
flushed <= logged?

+#ifdef USE_ASSERT_CHECKING
+
+    {
+        RelFileLocatorBackend rlocator;
+        char       *rpath;

Let's add a comment here, like "Because the RelFileNumber counter only
ever increases and never wraps around, it should be impossible for the
newly-allocated RelFileNumber to already be in use. But, if Asserts
are enabled, double check that there's no main-fork relation file with
the new RelFileNumber already on disk."

+        elog(ERROR, "cannot forward RelFileNumber during recovery");

forward -> set (or advance)

+    if (relnumber >= ShmemVariableCache->loggedRelFileNumber)

It probably doesn't make any difference, but to me it seems better to
test flushedRelFileNumber rather than logRelFileNumber here. What do
you think?

     /*
      * We set up the lockRelId in case anything tries to lock the dummy
-     * relation.  Note that this is fairly bogus since relNumber may be
-     * different from the relation's OID.  It shouldn't really matter though.
-     * In recovery, we are running by ourselves and can't have any lock
-     * conflicts.  While syncing, we already hold AccessExclusiveLock.
+     * relation.  Note we are setting relId to just FirstNormalObjectId which
+     * is completely bogus.  It shouldn't really matter though. In recovery,
+     * we are running by ourselves and can't have any lock conflicts.  While
+     * syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
-    rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
+    rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;

Boy, this makes me uncomfortable. The existing logic is pretty bogus,
and we're replacing it with some other bogus thing. Do we know whether
anything actually does try to use this for locking?

One notable difference between the existing logic and your change is
that, with the existing logic, we use a bogus value that will differ
from one relation to the next, whereas with this change, it will
always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
(Oid) rlocator.relNumber would be a more natural adaptation?

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)                \
+do {                                                                \
+    if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+        ereport(ERROR,                                              \
+                errcode(ERRCODE_INVALID_PARAMETER_VALUE),           \
+                errmsg("relfilenumber %lld is out of range",    \
+                        (long long) (relfilenumber))); \
+} while (0)

Here, you take the approach of casting the relfilenumber to long long
and then using %lld. But elsewhere, you use
INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we
ought to use it everywhere.

 typedef struct
 {
-    Oid         reltablespace;
-    RelFileNumber relfilenumber;
-} RelfilenumberMapKey;
-
-typedef struct
-{
-    RelfilenumberMapKey key;    /* lookup key - must be first */
+    RelFileNumber relfilenumber;    /* lookup key - must be first */
     Oid         relid;          /* pg_class.oid */
 } RelfilenumberMapEntry;

This feels like a bold change. Are you sure it's safe? i.e. Are you
certain that there's no way that a relfilenumber could repeat within a
database? If we're going to bank on that, we could adapt this more
heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
parameter. I think maybe we should push this change into an 0002 patch
(or later) and have 0001 just do a minimal adaptation for the changed
data type.

 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-    Datum       values[18];
-    bool        nulls[18];
+    Datum       values[19];
+    bool        nulls[19];

Documentation updated is needed.

-Note that while a table's filenode often matches its OID, this is
-<emphasis>not</emphasis> necessarily the case; some operations, like
+Note that table's filenode are completely different than its OID. Although for
+system catalogs initial filenode matches with its OID, but some
operations, like
 <command>TRUNCATE</command>, <command>REINDEX</command>,
<command>CLUSTER</command> and some forms
 of <command>ALTER TABLE</command>, can change the filenode while
preserving the OID.
-Avoid assuming that filenode and table OID are the same.

Suggest: Note that a table's filenode will normally be different than
the OID. For system tables, the initial filenode will be equal to the
table OID, but it will be different if the table has ever been
subjected to a rewriting operation, such as TRUNCATE, REINDEX,
CLUSTER, or some forms of ALTER TABLE. For user tables, even the
initial filenode will be different than the table OID.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Sep 20, 2022 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the review, please see my response inline for some of the
comments, rest all are accepted.

> On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [ new patch ]
>
> +typedef pg_int64 RelFileNumber;
>
> This seems really random to me. First, why isn't this an unsigned
> type? OID is unsigned and I don't see a reason to change to a signed
> type. But even if we were going to change to a signed type, why
> pg_int64? That is declared like this:
>
> /* Define a signed 64-bit integer type for use in client API declarations. */
> typedef PG_INT64_TYPE pg_int64;
>
> Surely this is not a client API declaration....
>
> Note that if we change this a lot of references to INT64_FORMAT will
> need to become UINT64_FORMAT.
>
> I think we should use int64 at the SQL level, because we don't have an
> unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
> So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
> similar. But internally I think using unsigned is cleaner.

Yeah you are right we can make it uint64.  With respect to this, we
can not directly use uint64 because that is declared in c.h and that
can not be used in
postgres_ext.h IIUC.  So what are the other option maybe we can
typedef the RelFIleNumber similar to what c.h done for uint64 i.e.

#ifdef HAVE_LONG_INT_64
typedef unsigned long int uint64;
#elif defined(HAVE_LONG_LONG_INT_64)
typedef long long int int64;
#endif

And maybe same for UINT64CONST ?

I am not liking duplicating this logic but is there any better
alternative for doing this?  Can we move the existing definitions from
c.h file to some common file (common for client and server)?

>
> +    if (relnumber >= ShmemVariableCache->loggedRelFileNumber)
>
> It probably doesn't make any difference, but to me it seems better to
> test flushedRelFileNumber rather than logRelFileNumber here. What do
> you think?

Actually based on this condition are logging more so it make more
sense to check w.r.t loggedRelFileNumber, but OTOH technically,
without flushing log we are not supposed to use the relfilenumber so
make more sense to test flushedRelFileNumber.  But since both are the
same I am fine with flushedRelFileNumber.

>      /*
>       * We set up the lockRelId in case anything tries to lock the dummy
> -     * relation.  Note that this is fairly bogus since relNumber may be
> -     * different from the relation's OID.  It shouldn't really matter though.
> -     * In recovery, we are running by ourselves and can't have any lock
> -     * conflicts.  While syncing, we already hold AccessExclusiveLock.
> +     * relation.  Note we are setting relId to just FirstNormalObjectId which
> +     * is completely bogus.  It shouldn't really matter though. In recovery,
> +     * we are running by ourselves and can't have any lock conflicts.  While
> +     * syncing, we already hold AccessExclusiveLock.
>       */
>      rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
> -    rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
> +    rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;
>
> Boy, this makes me uncomfortable. The existing logic is pretty bogus,
> and we're replacing it with some other bogus thing. Do we know whether
> anything actually does try to use this for locking?
>
> One notable difference between the existing logic and your change is
> that, with the existing logic, we use a bogus value that will differ
> from one relation to the next, whereas with this change, it will
> always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
> (Oid) rlocator.relNumber would be a more natural adaptation?
>
> +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)                \
> +do {                                                                \
> +    if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> +        ereport(ERROR,                                              \
> +                errcode(ERRCODE_INVALID_PARAMETER_VALUE),           \
> +                errmsg("relfilenumber %lld is out of range",    \
> +                        (long long) (relfilenumber))); \
> +} while (0)
>
> Here, you take the approach of casting the relfilenumber to long long
> and then using %lld. But elsewhere, you use
> INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we
> ought to use it everywhere.

Based on the discussion [1], it seems we can not use
INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?

[1] https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

>  typedef struct
>  {
> -    Oid         reltablespace;
> -    RelFileNumber relfilenumber;
> -} RelfilenumberMapKey;
> -
> -typedef struct
> -{
> -    RelfilenumberMapKey key;    /* lookup key - must be first */
> +    RelFileNumber relfilenumber;    /* lookup key - must be first */
>      Oid         relid;          /* pg_class.oid */
>  } RelfilenumberMapEntry;
>
> This feels like a bold change. Are you sure it's safe? i.e. Are you
> certain that there's no way that a relfilenumber could repeat within a
> database?

IIUC, as of now, CREATE DATABASE is the only option which can create
the duplicate relfilenumber but that would be in different databases.
So based on that theory I think it should be safe.

If we're going to bank on that, we could adapt this more
> heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> parameter.

Yeah we might, although we need a bool to identify whether it is
shared relation or not.

I think maybe we should push this change into an 0002 patch
> (or later) and have 0001 just do a minimal adaptation for the changed
> data type.

Yeah that make sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Sep 21, 2022 at 3:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Yeah you are right we can make it uint64.  With respect to this, we
> can not directly use uint64 because that is declared in c.h and that
> can not be used in
> postgres_ext.h IIUC.  So what are the other option maybe we can
> typedef the RelFIleNumber similar to what c.h done for uint64 i.e.
>
> #ifdef HAVE_LONG_INT_64
> typedef unsigned long int uint64;
> #elif defined(HAVE_LONG_LONG_INT_64)
> typedef long long int int64;
> #endif
>
> I am not liking duplicating this logic but is there any better
> alternative for doing this?  Can we move the existing definitions from
> c.h file to some common file (common for client and server)?

Here is the updated patch which fixes all the agreed comments. Except
this one which needs more thoughts, for now I have used unsigned long
int.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Tue, Sep 20, 2022 at 7:46 PM Amul Sul <sulamul@gmail.com> wrote:
>

Thanks for the review

> Here are a few minor suggestions I came across while reading this
> patch, might be useful:
>
> +#ifdef USE_ASSERT_CHECKING
> +
> +   {
>
> Unnecessary space after USE_ASSERT_CHECKING.

Changed

>
> +               return InvalidRelFileNumber;    /* placate compiler */
>
> I don't think we needed this after the error on the latest branches.
> --

Changed

> +   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
> +   if (shutdown)
> +       checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
> +   else
> +       checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
> +
> +   LWLockRelease(RelFileNumberGenLock);
>
> This is done for the good reason, I think, it should have a comment
> describing different checkPoint.nextRelFileNumber assignment
> need and crash recovery perspective.
> --

Done

> +#define SizeOfRelFileLocatorBackend \
> +   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))
>
> Can append empty parenthesis "()" to the macro name, to look like a
> function call at use or change the macro name to uppercase?
> --

Yeah we could SizeOfXXX macros are general practice I see used
everywhere in Postgres code so left as it is.

>  +   if (val < 0 || val > MAX_RELFILENUMBER)
> ..
>  if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
>
> How about adding a macro for this condition as RelFileNumberIsValid()?
> We can replace all the checks referring to MAX_RELFILENUMBER with this.

Actually, RelFileNumberIsValid is used to just check whether it is
InvalidRelFileNumber value i.e. 0.  Maybe for this we can introduce
RelFileNumberInValidRange() but I am not sure whether it would be
cleaner than what we have now, so left as it is for now.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Wed, Sep 21, 2022 at 6:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah you are right we can make it uint64.  With respect to this, we
> can not directly use uint64 because that is declared in c.h and that
> can not be used in
> postgres_ext.h IIUC.

Ugh.

> Can we move the existing definitions from
> c.h file to some common file (common for client and server)?

Yeah, I think that would be a good idea. Here's a quick patch that
moves them to common/relpath.h, which seems like a possibly-reasonable
choice, though perhaps you or someone else will have a better idea.

> Based on the discussion [1], it seems we can not use
> INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
> I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?
>
> [1] https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

Oh, hmm. So you're saying if the string is not translated then use
(U)INT64_FORMAT but if it is translated then cast? I guess that makes
sense. It feels a bit strange to have the style dependent on the
context like that, but maybe it's fine. I'll reread with that idea in
mind.

> If we're going to bank on that, we could adapt this more
> > heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> > parameter.
>
> Yeah we might, although we need a bool to identify whether it is
> shared relation or not.

Why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Mon, Sep 26, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> > Can we move the existing definitions from
> > c.h file to some common file (common for client and server)?
>
> Yeah, I think that would be a good idea. Here's a quick patch that
> moves them to common/relpath.h, which seems like a possibly-reasonable
> choice, though perhaps you or someone else will have a better idea.

Looks fine to me.

> > Based on the discussion [1], it seems we can not use
> > INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
> > I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?
> >
> > [1] https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql
>
> Oh, hmm. So you're saying if the string is not translated then use
> (U)INT64_FORMAT but if it is translated then cast?

Right

I guess that makes
> sense. It feels a bit strange to have the style dependent on the
> context like that, but maybe it's fine. I'll reread with that idea in
> mind.

Ok

> > If we're going to bank on that, we could adapt this more
> > > heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> > > parameter.
> >
> > Yeah we might, although we need a bool to identify whether it is
> > shared relation or not.
>
> Why?

Because if entry is not in cache then we need to look into the
relmapper and for that we need to know whether it is a shared relation
or not.  And I don't think we can identify that just by looking at
relfilenumber.


Another open comment which I missed in last reply

>      /*
>       * We set up the lockRelId in case anything tries to lock the dummy
> -     * relation.  Note that this is fairly bogus since relNumber may be
> -     * different from the relation's OID.  It shouldn't really matter though.
> -     * In recovery, we are running by ourselves and can't have any lock
> -     * conflicts.  While syncing, we already hold AccessExclusiveLock.
> +     * relation.  Note we are setting relId to just FirstNormalObjectId which
> +     * is completely bogus.  It shouldn't really matter though. In recovery,
> +     * we are running by ourselves and can't have any lock conflicts.  While
> +     * syncing, we already hold AccessExclusiveLock.
>       */
>      rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
> -    rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
> +    rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;
>
> Boy, this makes me uncomfortable. The existing logic is pretty bogus,
> and we're replacing it with some other bogus thing. Do we know whether
> anything actually does try to use this for locking?

Looking at the code it seems it is not used for locking.  I also test
by setting some special value for relid in
CreateFakeRelcacheEntry() and validating that id is never used for
locking in SET_LOCKTAG_RELATION.  And ran check-world so I could not
see we are ever trying to create lock tag using fake relcache entry.

> One notable difference between the existing logic and your change is
> that, with the existing logic, we use a bogus value that will differ
> from one relation to the next, whereas with this change, it will
> always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
> (Oid) rlocator.relNumber would be a more natural adaptation?

I agree, so changed it this way.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Tue, Sep 27, 2022 at 2:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Looks fine to me.

OK, committed. I also committed the 0002 patch with some wordsmithing,
and I removed a < 0 test an an unsigned value because my compiler
complained about it. 0001 turned out to make headerscheck sad, so I
just pushed a fix for that, too.

I'm not too sure about 0003. I think if we need an is_shared flag
maybe we might as well just pass the tablespace OID. The is_shared
flag seems to just make things a bit complicated for the callers for
no real benefit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Thomas Munro
Дата:
Hi Dilip,

I am very happy to see these commits.  Here's some belated review for
the tombstone-removal patch.

> v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch

More things you can remove:

 * sync_unlinkfiletag in struct SyncOps
 * the macro UNLINKS_PER_ABSORB
 * global variable pendingUnlinks

This comment after the question mark is obsolete:

                * XXX should we CHECK_FOR_INTERRUPTS in this loop?
Escaping with an
                * error in the case of SYNC_UNLINK_REQUEST would leave the
                * no-longer-used file still present on disk, which
would be bad, so
                * I'm inclined to assume that the checkpointer will
always empty the
                * queue soon.

(I think if the answer to the question is now yes, then we should
replace the stupid sleep with a condition variable sleep, but there's
another thread about that somewhere).

In a couple of places in dbcommands.c you could now make this change:

        /*
-        * Force a checkpoint before starting the copy. This will
force all dirty
-        * buffers, including those of unlogged tables, out to disk, to ensure
-        * source database is up-to-date on disk for the copy.
-        * FlushDatabaseBuffers() would suffice for that, but we also want to
-        * process any pending unlink requests. Otherwise, if a checkpoint
-        * happened while we're copying files, a file might be deleted just when
-        * we're about to copy it, causing the lstat() call in copydir() to fail
-        * with ENOENT.
+        * Force all dirty buffers, including those of unlogged tables, out to
+        * disk, to ensure source database is up-to-date on disk for the copy.
         */
-       RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
-                                         CHECKPOINT_WAIT |
CHECKPOINT_FLUSH_ALL);
+       FlushDatabaseBuffers(src_dboid);

More obsolete comments you could change:

         * If we were copying database at block levels then drop pages for the
         * destination database that are in the shared buffer cache.  And tell
-->      * checkpointer to forget any pending fsync and unlink
requests for files

-->     * Tell checkpointer to forget any pending fsync and unlink requests for
        * files in the database; else the fsyncs will fail at next
checkpoint, or
        * worse, it will delete file

In tablespace.c I think you could now make this change:

        if (!destroy_tablespace_directories(tablespaceoid, false))
        {
-               /*
-                * Not all files deleted?  However, there can be
lingering empty files
-                * in the directories, left behind by for example DROP
TABLE, that
-                * have been scheduled for deletion at next checkpoint
(see comments
-                * in mdunlink() for details).  We could just delete
them immediately,
-                * but we can't tell them apart from important data
files that we
-                * mustn't delete.  So instead, we force a checkpoint
which will clean
-                * out any lingering files, and try again.
-                */
-               RequestCheckpoint(CHECKPOINT_IMMEDIATE |
CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
+#ifdef WIN32
                /*
                 * On Windows, an unlinked file persists in the
directory listing
                 * until no process retains an open handle for the
file.  The DDL
@@ -523,6 +513,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)

                /* And now try again. */
                if (!destroy_tablespace_directories(tablespaceoid, false))
+#endif
                {
                        /* Still not empty, the files must be important then */
                        ereport(ERROR,



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Sep 28, 2022 at 9:23 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Hi Dilip,
>
> I am very happy to see these commits.  Here's some belated review for
> the tombstone-removal patch.
>
> > v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch
>
> More things you can remove:
>
>  * sync_unlinkfiletag in struct SyncOps
>  * the macro UNLINKS_PER_ABSORB
>  * global variable pendingUnlinks
>
> This comment after the question mark is obsolete:
>
>                 * XXX should we CHECK_FOR_INTERRUPTS in this loop?
> Escaping with an
>                 * error in the case of SYNC_UNLINK_REQUEST would leave the
>                 * no-longer-used file still present on disk, which
> would be bad, so
>                 * I'm inclined to assume that the checkpointer will
> always empty the
>                 * queue soon.
>
> (I think if the answer to the question is now yes, then we should
> replace the stupid sleep with a condition variable sleep, but there's
> another thread about that somewhere).
>
> In a couple of places in dbcommands.c you could now make this change:
>
>         /*
> -        * Force a checkpoint before starting the copy. This will
> force all dirty
> -        * buffers, including those of unlogged tables, out to disk, to ensure
> -        * source database is up-to-date on disk for the copy.
> -        * FlushDatabaseBuffers() would suffice for that, but we also want to
> -        * process any pending unlink requests. Otherwise, if a checkpoint
> -        * happened while we're copying files, a file might be deleted just when
> -        * we're about to copy it, causing the lstat() call in copydir() to fail
> -        * with ENOENT.
> +        * Force all dirty buffers, including those of unlogged tables, out to
> +        * disk, to ensure source database is up-to-date on disk for the copy.
>          */
> -       RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
> -                                         CHECKPOINT_WAIT |
> CHECKPOINT_FLUSH_ALL);
> +       FlushDatabaseBuffers(src_dboid);
>
> More obsolete comments you could change:
>
>          * If we were copying database at block levels then drop pages for the
>          * destination database that are in the shared buffer cache.  And tell
> -->      * checkpointer to forget any pending fsync and unlink
> requests for files
>
> -->     * Tell checkpointer to forget any pending fsync and unlink requests for
>         * files in the database; else the fsyncs will fail at next
> checkpoint, or
>         * worse, it will delete file
>
> In tablespace.c I think you could now make this change:
>
>         if (!destroy_tablespace_directories(tablespaceoid, false))
>         {
> -               /*
> -                * Not all files deleted?  However, there can be
> lingering empty files
> -                * in the directories, left behind by for example DROP
> TABLE, that
> -                * have been scheduled for deletion at next checkpoint
> (see comments
> -                * in mdunlink() for details).  We could just delete
> them immediately,
> -                * but we can't tell them apart from important data
> files that we
> -                * mustn't delete.  So instead, we force a checkpoint
> which will clean
> -                * out any lingering files, and try again.
> -                */
> -               RequestCheckpoint(CHECKPOINT_IMMEDIATE |
> CHECKPOINT_FORCE | CHECKPOINT_WAIT);
> -
> +#ifdef WIN32
>                 /*
>                  * On Windows, an unlinked file persists in the
> directory listing
>                  * until no process retains an open handle for the
> file.  The DDL
> @@ -523,6 +513,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)
>
>                 /* And now try again. */
>                 if (!destroy_tablespace_directories(tablespaceoid, false))
> +#endif
>                 {
>                         /* Still not empty, the files must be important then */
>                         ereport(ERROR,

Thanks, Thomas, these all look fine to me.  So far we have committed
the patch to make relfilenode 56 bits wide.  The Tombstone file
removal patch is still pending to be committed, so when I will rebase
that patch I will accommodate all these comments in that patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Thomas Munro
Дата:
On Wed, Sep 28, 2022 at 9:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Thanks, Thomas, these all look fine to me.  So far we have committed
> the patch to make relfilenode 56 bits wide.  The Tombstone file
> removal patch is still pending to be committed, so when I will rebase
> that patch I will accommodate all these comments in that patch.

I noticed that your new unlinking algorithm goes like this:

stat("x")
stat("x.1")
stat("x.2")
stat("x.3") -> ENOENT /* now we know how many segments there are */
truncate("x.2")
unlink("x.2")
truncate("x.1")
unlink("x.1")
truncate("x")
unlink("x")

Could you say what problem this solves, and, guessing that it's just
that you want the 0 file to be "in the way" until the other files are
gone (at least while we're running; who knows what'll be left if you
power-cycle), could you do it like this instead?

truncate("x")
truncate("x.1")
truncate("x.2")
truncate("x.3") -> ENOENT /* now we know how many segments there are */
unlink("x.2")
unlink("x.1")
unlink("x")



Re: making relfilenodes 56 bits

От
Maxim Orlov
Дата:
Hi!

I'm not in the context of this thread, but I've notice something strange by attempting to rebase my patch set from 64XID thread.
As far as I'm aware, this patch set is adding "relfilenumber". So, in pg_control_checkpoint, we have next changes:

diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..d441cd97e2 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -79,8 +79,8 @@ pg_control_system(PG_FUNCTION_ARGS)
 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-       Datum           values[18];
-       bool            nulls[18];
+       Datum           values[19];
+       bool            nulls[19];
        TupleDesc       tupdesc;
        HeapTuple       htup;
        ControlFileData *ControlFile;
@@ -129,6 +129,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
                                           XIDOID, -1, 0);
        TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
                                           TIMESTAMPTZOID, -1, 0);
+       TupleDescInitEntry(tupdesc, (AttrNumber) 19, "next_relfilenumber",
+                                          INT8OID, -1, 0);
        tupdesc = BlessTupleDesc(tupdesc);

        /* Read the control file. */

In other words, we have 19 attributes. But tupdesc here is constructed for 18 elements:
tupdesc = CreateTemplateTupleDesc(18);

Is that normal or not? Again, I'm not in this thread and if that is completely ok, I'm sorry about the noise.

--
Best regards,
Maxim Orlov.

Re: making relfilenodes 56 bits

От
Robert Haas
Дата:
On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> In other words, we have 19 attributes. But tupdesc here is constructed for 18 elements:
> tupdesc = CreateTemplateTupleDesc(18);
>
> Is that normal or not? Again, I'm not in this thread and if that is completely ok, I'm sorry about the noise.

I think that's a mistake. Thanks for the report.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: making relfilenodes 56 bits

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov <orlovmg@gmail.com> wrote:
>> In other words, we have 19 attributes. But tupdesc here is constructed for 18 elements:
>> tupdesc = CreateTemplateTupleDesc(18);

> I think that's a mistake. Thanks for the report.

The assertions in TupleDescInitEntry would have caught that,
if only utils/misc/pg_controldata.c had more than zero test coverage.
Seems like somebody ought to do something about that.

            regards, tom lane



Re: making relfilenodes 56 bits

От
Michael Paquier
Дата:
On Thu, Sep 29, 2022 at 02:39:44PM -0400, Tom Lane wrote:
> The assertions in TupleDescInitEntry would have caught that,
> if only utils/misc/pg_controldata.c had more than zero test coverage.
> Seems like somebody ought to do something about that.

While passing by, I have noticed this thread.  We don't really care
about the contents returned by these functions, and one simple trick
to check their execution is SELECT FROM.  Like in the attached, for
example.
--
Michael

Вложения

Re: making relfilenodes 56 bits

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> While passing by, I have noticed this thread.  We don't really care
> about the contents returned by these functions, and one simple trick
> to check their execution is SELECT FROM.  Like in the attached, for
> example.

Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
any actual checks on the sanity of the output?  I realize that the
output's far from static, but still ...

            regards, tom lane



Re: making relfilenodes 56 bits

От
Michael Paquier
Дата:
On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> any actual checks on the sanity of the output?  I realize that the
> output's far from static, but still ...

Honestly, checking all the fields is not that exciting, but the
maximum I can think of that would be portable enough is something like
the attached.  No arithmetic operators for xid limits things a bit,
but at least that's something.

Thoughts?
--
Michael

Вложения

Re: making relfilenodes 56 bits

От
vignesh C
Дата:
On Fri, 21 Oct 2022 at 11:31, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > any actual checks on the sanity of the output?  I realize that the
> > output's far from static, but still ...
>
> Honestly, checking all the fields is not that exciting, but the
> maximum I can think of that would be portable enough is something like
> the attached.  No arithmetic operators for xid limits things a bit,
> but at least that's something.
>
> Thoughts?

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
33ab0a2a527e3af5beee3a98fc07201e555d6e45 ===
=== applying patch ./controldata-regression-2.patch
patching file src/test/regress/expected/misc_functions.out
Hunk #1 succeeded at 642 with fuzz 2 (offset 48 lines).
patching file src/test/regress/sql/misc_functions.sql
Hunk #1 FAILED at 223.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/sql/misc_functions.sql.rej

[1] - http://cfbot.cputube.org/patch_41_3711.log

Regards,
Vignesh



Re: making relfilenodes 56 bits

От
Dilip Kumar
Дата:
On Wed, Jan 4, 2023 at 5:45 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 21 Oct 2022 at 11:31, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > > any actual checks on the sanity of the output?  I realize that the
> > > output's far from static, but still ...
> >
> > Honestly, checking all the fields is not that exciting, but the
> > maximum I can think of that would be portable enough is something like
> > the attached.  No arithmetic operators for xid limits things a bit,
> > but at least that's something.
> >
> > Thoughts?
>
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:
>

Because of the extra WAL overhead, we are not continuing with the
patch, I will withdraw it.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com