Обсуждение: Re: Make relfile tombstone files conditional on WAL level

Поиск

Список

Период

Сортировка

Re: Make relfile tombstone files conditional on WAL level

От

Heikki Linnakangas

Дата:

10 июня 2021 г., 13:47:49

On 05/03/2021 00:02, Thomas Munro wrote:
> Hi,
> 
> I'm starting a new thread for this patch that originated as a
> side-discussion in [1], to give it its own CF entry in the next cycle.
> This is a WIP with an open question to research: what could actually
> break if we did this?

I don't see a problem.

It would indeed be nice to have some other mechanism to prevent the 
issue with wal_level=minimal, the tombstone files feel hacky and 
complicated. Maybe a new shared memory hash table to track the 
relfilenodes of dropped tables.

- Heikki

Re: Make relfile tombstone files conditional on WAL level

От

Robert Haas

Дата:

02 августа 2021 г., 23:03:31

On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It would indeed be nice to have some other mechanism to prevent the
> issue with wal_level=minimal, the tombstone files feel hacky and
> complicated. Maybe a new shared memory hash table to track the
> relfilenodes of dropped tables.

Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.

I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.

So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Andres Freund

Дата:

03 августа 2021 г., 01:38:19

Hi,

On 2021-08-02 16:03:31 -0400, Robert Haas wrote:
> The two most principled solutions to this problem that I can see are
> (1) remove wal_level=minimal and

I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.

> (2) use 64-bit relfilenodes. I have
> been reluctant to support #1 because it's hard for me to believe that
> there aren't cases where being able to skip a whole lot of WAL-logging
> doesn't work out to a nice performance win, but I realize opinions on
> that topic vary. And I'm pretty sure that Andres, at least, will hate
> #2 because he's unhappy with the width of buffer tags already.

Yep :/

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.

> So I don't really have a good idea. I agree this tombstone system is a
> bit of a wart, but I'm not sure that this patch really makes anything
> any better, and I'm not really seeing another idea that seems better
> either.

> Maybe I am missing something...

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes.  It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.

It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Greetings,

Andres Freund

Re: Make relfile tombstone files conditional on WAL level

От

Robert Haas

Дата:

03 августа 2021 г., 18:22:31

On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
> What I proposed in the past was to have a new shared table that tracks
> relfilenodes. I still think that's a decent solution for just the problem at
> hand.

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

> But it'd also potentially be the way to redesign relation forks and even
> slim down buffer tags:
>
> Right now a buffer tag is:
> - 4 byte tablespace oid
> - 4 byte database oid
> - 4 byte "relfilenode oid" (don't think we have a good name for this)
> - 4 byte fork number
> - 4 byte block number
>
> If we had such a shared table we could put at least tablespace, fork number
> into that table mapping them to an 8 byte "new relfilenode". That'd only make
> the "new relfilenode" unique within a database, but that'd be sufficient for
> our purposes.  It'd give use a buffertag consisting out of the following:
> - 4 byte database oid
> - 8 byte "relfilenode"
> - 4 byte block number

Yep. I think this is a good direction.

> Of course, it'd add some complexity too, because a buffertag alone wouldn't be
> sufficient to read data (as you'd need the tablespace oid from elsewhere). But
> that's probably ok, I think all relevant places would have that information.

I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.

> It's probably possible to remove the database oid from the tag as well, but
> it'd make CREATE DATABASE tricker - we'd need to change the filenames of
> tables as we copy, to adjust them to the differing oid.

Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Thomas Munro

Дата:

29 сентября 2021 г., 06:07:32

On Fri, Mar 5, 2021 at 11:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> This is a WIP with an open question to research: what could actually
> break if we did this?

I thought this part of bgwriter.c might be a candidate:

       if (FirstCallSinceLastCheckpoint())
       {
           /*
            * After any checkpoint, close all smgr files.  This is so we
            * won't hang onto smgr references to deleted files indefinitely.
            */
            smgrcloseall();
       }

Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already?  What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running?  Not sinval, and not the
above code that only runs between BgBufferSync() invocations.

Re: Make relfile tombstone files conditional on WAL level

От

Thomas Munro

Дата:

29 сентября 2021 г., 06:29:16

On Wed, Aug 4, 2021 at 3:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
> It's not really clear to me what problem is at hand. The problems that
> the tombstone system created for the async I/O stuff weren't really
> explained properly, IMHO. And I don't think the current system is all
> that ugly. it's not the most beautiful thing in the world but we have
> lots of way worse hacks. And, it's easy to understand, requires very
> little code, and has few moving parts that can fail. As hacks go it's
> a quality hack, I would say.

It's not really an AIO problem.  It's just that while testing the AIO
stuff across a lot of operating systems, we had tests failing on
Windows because the extra worker processes you get if you use
io_method=worker were holding cached descriptors and causing stuff
like DROP TABLESPACE to fail.  AFAIK every problem we discovered in
that vein is a current live bug in all versions of PostgreSQL for
Windows (it just takes other backends or the bgwriter to hold an fd at
the wrong moment).  The solution I'm proposing to that general class
of problem is https://commitfest.postgresql.org/34/2962/ .

In the course of thinking about that, it seemed natural to look into
the possibility of getting rid of the tombstones, so that at least
Unix systems don't find themselves having to suffer through a
CHECKPOINT just to drop a tablespace that happens to contain a
tombstone.

Re: Make relfile tombstone files conditional on WAL level

От

Thomas Munro

Дата:

30 сентября 2021 г., 13:32:20

On Wed, Sep 29, 2021 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Hmm, on closer inspection, isn't the lack of real interlocking with
> checkpoints a bit suspect already?  What stops bgwriter from writing
> to the previous relfilenode generation's fd, if a relfilenode is
> recycled while BgBufferSync() is running?  Not sinval, and not the
> above code that only runs between BgBufferSync() invocations.

I managed to produce a case where live data is written to an unlinked
file and lost, with a couple of tweaks to get the right timing and
simulate OID wraparound.  See attached.  If you run the following
commands repeatedly with shared_buffers=256kB and
bgwriter_lru_multiplier=10, you should see a number lower than 10,000
from the last query in some runs, depending on timing.

create extension if not exists chaos;
create extension if not exists pg_prewarm;

drop table if exists t1, t2;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t1 as select 42 i from generate_series(1, 10000);
select pg_prewarm('t1'); -- fill buffer pool with t1
update t1 set i = i; -- dirty t1 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

drop table t1;
checkpoint;
vacuum pg_class;

select clobber_next_oid(200000);
create table t2 as select 0 i from generate_series(1, 10000);
select pg_prewarm('t2'); -- fill buffer pool with t2
update t2 set i = 1 where i = 0; -- dirty t2 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

select pg_prewarm('pg_attribute'); -- evict all clean t2 buffers
select sum(i) as t2_sum_should_be_10000 from t2; -- have any updates been lost?

On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Another problem is that relfilenodes are normally allocated with
> GetNewOidWithIndex(), and initially match a relation's OID. We'd need
> a new allocator, and they won't be able to match the OID in general
> (while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.

Make sense, I have started working on this idea, I will try to post the first version by early next week.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Dilip Kumar

Дата:

28 января 2022 г., 17:40:15

On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> > Another problem is that relfilenodes are normally allocated with
>> > GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
>> > a new allocator, and they won't be able to match the OID in general
>> > (while we have 32 bit OIDs at least).
>>
>> Personally I'm not sad about that. Values that are the same in simple
>> cases but diverge in more complex cases are kind of a trap for the
>> unwary. There's no real reason to have them ever match. Yeah, in
>> theory, it makes it easier to tell which file matches which relation,
>> but in practice, you always have to double-check in case the table has
>> ever been rewritten. It doesn't seem worth continuing to contort the
>> code for a property we can't guarantee anyway.
>
>
> Make sense, I have started working on this idea, I will try to post the first version by early next week.

Here is the first working patch, with that now we don't need to
maintain the TombStone file until the next checkpoint.  This is still
a WIP patch with this I can see my problem related to ALTER DATABASE
SET TABLESPACE WAL-logged problem is solved which Robert reported a
couple of mails above in the same thread.

General idea of the patch:
- Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits
for fork number and 56 bits for the relNode as shown below. [1]
- GetNewRelFileNode() will just generate a new unique relfilenode and
check the file existence and if it already exists then throw an error,
so no loop.  We also need to add the logic for preserving the
nextRelNode across restart and also WAL logging it but that is similar
to the preserving nextOid.
- mdunlinkfork, will directly forget the relfilenode, so we get rid of
all unlinking code from the code.
- Now, we don't need any post checkpoint unlinking activity.

[1]
/*
* RelNodeId:
*
* this is a storage type for RelNode. The reasoning behind using this is same
* as using the BlockId so refer comment atop BlockId.
*/
typedef struct RelNodeId
{
      uint32 rn_hi;
      uint32 rn_lo;
} RelNodeId;
typedef struct RelFileNode
{
   Oid spcNode; /* tablespace */
   Oid dbNode; /* database */
   RelNodeId relNode; /* relation */
} RelFileNode;

TODO:

There are a couple of TODOs and FIXMEs which I am planning to improve
by next week.  I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose.  And,
Improve comments.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v1-WIP-0001-Don-t-wait-for-next-checkpoint-to-remove-unwanted.patch

Re: Make relfile tombstone files conditional on WAL level

От

Dilip Kumar

Дата:

31 января 2022 г., 08:29:33

On Fri, Jan 28, 2022 at 8:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

>
> TODO:
>
> There are a couple of TODOs and FIXMEs which I am planning to improve
> by next week.  I am also planning to do the testing where relfilenode
> consumes more than 32 bits, maybe for that we can set the
> FirstNormalRelfileNode to higher value for the testing purpose.  And,
> Improve comments.
>

I have fixed most of TODO and FIXMEs but there are still a few which I
could not decide, the main one currently we do not have uint8 data
type only int8 is there so I have used int8 for storing relfilenode +
forknumber.  Although this is sufficient because I don't think we will
ever get more than 128 fork numbers.  But my question is should we
think for adding uint8 as new data type or infect make RelNode itself
as new data type like we have Oid.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v2-WIP-0001-Don-t-delay-removing-Tombstone-file-until-next-ch.patch

Re: Make relfile tombstone files conditional on WAL level

От

Robert Haas

Дата:

31 января 2022 г., 17:04:41

On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> the main one currently we do not have uint8 data
> type only int8 is there so I have used int8 for storing relfilenode +
> forknumber.

I'm confused. We use int8 in tons of places, so I feel like it must exist.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Robert Haas

Дата:

31 января 2022 г., 17:06:33

On Mon, Jan 31, 2022 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > the main one currently we do not have uint8 data
> > type only int8 is there so I have used int8 for storing relfilenode +
> > forknumber.
>
> I'm confused. We use int8 in tons of places, so I feel like it must exist.

Rather, we use uint8 in tons of places, so I feel like it must exist.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Dilip Kumar

Дата:

31 января 2022 г., 17:37:17

On Mon, Jan 31, 2022 at 7:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 31, 2022 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > the main one currently we do not have uint8 data
> > > type only int8 is there so I have used int8 for storing relfilenode +
> > > forknumber.
> >
> > I'm confused. We use int8 in tons of places, so I feel like it must exist.
>
> Rather, we use uint8 in tons of places, so I feel like it must exist.

Hmm, at least pg_type doesn't have anything with a name like uint8.

postgres[101702]=# select oid, typname from pg_type where typname like '%int8';
 oid  | typname
------+---------
   20 | int8
 1016 | _int8
(2 rows)

postgres[101702]=# select oid, typname from pg_type where typname like '%uint%';
 oid | typname
-----+---------
(0 rows)

I agree that we are using 8 bytes unsigned int multiple places in code
as uint64.  But I don't see it as an exposed data type and not used as
part of any exposed function.  But we will have to use the relfilenode
in the exposed c function e.g.
binary_upgrade_set_next_heap_relfilenode().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Robert Haas

Дата:

02 февраля 2022 г., 16:27:28

On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I agree that we are using 8 bytes unsigned int multiple places in code
> as uint64.  But I don't see it as an exposed data type and not used as
> part of any exposed function.  But we will have to use the relfilenode
> in the exposed c function e.g.
> binary_upgrade_set_next_heap_relfilenode().

Oh, I thought we were talking about the C data type uint8 i.e. an
8-bit unsigned integer. Which in retrospect was a dumb thought because
you said you wanted to store the relfilenode AND the fork number
there, which only make sense if you were talking about SQL data types
rather than C data types. It is confusing that we have an SQL data
type called int8 and a C data type called int8 and they're not the
same.

But if you're talking about SQL data types, why? pg_class only stores
the relfilenode and not the fork number currently, and I don't see why
that would change. I think that the data type for the relfilenode
column would change to a 64-bit signed integer (i.e. bigint or int8)
that only ever uses the low-order 56 bits, and then when you need to
store a relfilenode and a fork number in the same 8-byte quantity
you'd do that using either a struct with bit fields or by something
like combined = ((uint64) signed_representation_of_relfilenode) |
(((int) forknumber) << 56);

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Dilip Kumar

Дата:

02 февраля 2022 г., 16:39:15

On Wed, Feb 2, 2022 at 6:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I agree that we are using 8 bytes unsigned int multiple places in code
> > as uint64.  But I don't see it as an exposed data type and not used as
> > part of any exposed function.  But we will have to use the relfilenode
> > in the exposed c function e.g.
> > binary_upgrade_set_next_heap_relfilenode().
>
> Oh, I thought we were talking about the C data type uint8 i.e. an
> 8-bit unsigned integer. Which in retrospect was a dumb thought because
> you said you wanted to store the relfilenode AND the fork number
> there, which only make sense if you were talking about SQL data types
> rather than C data types. It is confusing that we have an SQL data
> type called int8 and a C data type called int8 and they're not the
> same.
>
> But if you're talking about SQL data types, why? pg_class only stores
> the relfilenode and not the fork number currently, and I don't see why
> that would change. I think that the data type for the relfilenode
> column would change to a 64-bit signed integer (i.e. bigint or int8)
> that only ever uses the low-order 56 bits, and then when you need to
> store a relfilenode and a fork number in the same 8-byte quantity
> you'd do that using either a struct with bit fields or by something
> like combined = ((uint64) signed_representation_of_relfilenode) |
> (((int) forknumber) << 56);

Yeah you're right.  I think whenever we are using combined then we can
use uint64 C type and in pg_class we can keep it as int64 because that
is only representing the relfilenode part.  I think I was just
confused and tried to use the same data type everywhere whether it is
combined with fork number or not.  Thanks for your input, I will
change this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Make relfile tombstone files conditional on WAL level

От

Dilip Kumar

Дата:

07 февраля 2022 г., 08:26:17

On Wed, Feb 2, 2022 at 7:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Feb 2, 2022 at 6:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have splitted the patch into multiple patches which can be
independently committable and easy to review. I have explained the
purpose and scope of each patch in the respective commit messages.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Thu, Jan 6, 2022 at 1:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

 2) GetNewRelFileNode() will not loop for checking the file existence
> and retry with other relfilenode.

While working on this I realized that even if we make the relfilenode
56 bits we can not remove the loop inside GetNewRelFileNode() for
checking the file existence.  Because it is always possible that the
file reaches to the disk even before the WAL for advancing the next
relfilenode and if the system crashes in between that then we might
generate the duplicate relfilenode right?

I think the second paragraph in XLogPutNextOid() function explain this
issue and now even after we get the wider relfilenode we will have
this issue.  Correct?

I am also attaching the latest set of patches for reference, these
patches fix the review comments given by Robert about moving the
dbOid, tbsOid and RelNode directly into the buffer tag.

Open Issues- there are currently 2 open issues in the patch 1) Issue
as discussed above about removing the loop, so currently in this patch
the loop is removed.  2) During upgrade from the previous version we
need to advance the nextrelfilenode to the current relfilenode we are
setting for the object in order to avoid the conflict.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Make relfile tombstone files conditional on WAL level

От

Dilip Kumar

Дата:

04 марта 2022 г., 08:37:19

On Mon, Feb 21, 2022 at 1:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 6, 2022 at 1:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>  2) GetNewRelFileNode() will not loop for checking the file existence
> > and retry with other relfilenode.
>
>
> Open Issues- there are currently 2 open issues in the patch 1) Issue
> as discussed above about removing the loop, so currently in this patch
> the loop is removed.  2) During upgrade from the previous version we
> need to advance the nextrelfilenode to the current relfilenode we are
> setting for the object in order to avoid the conflict.

In this version I have fixed both of these issues.  Thanks Robert for
suggesting the solution for both of these problems in our offlist
discussion.  Basically, for the first problem we can flush the xlog
immediately because we are actually logging the WAL every time after
we allocate 64 relfilenode so this should not have much impact on the
performance and I have added the same in the comments.  And during
pg_upgrade, whenever we are assigning the relfilenode as part of the
upgrade we will set that relfilenode + 1 as nextRelFileNode to be
assigned so that we never generate the conflicting relfilenode.

The only part I do not like in the patch is that before this patch we
could directly access the buftag->rnode.  But since now we are not
having directly relfilenode as part of the buffertag and instead of
that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
in the buffer tag.  So if we have to directly get the relfilenode we
need to generate it.  However those changes are very limited to just 1
or 2 file so maybe not that bad.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Tue, Mar 8, 2022 at 10:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > In this version I have fixed both of these issues.
>
> Here's a bit of review for these patches:
>
> - The whole relnode vs. relfilenode thing is really confusing. I
> realize that there is some precedent for calling the number that
> pertains to the file on disk "relnode" and that value when combined
> with the database and tablespace OIDs "relfilenode," but it's
> definitely not the most obvious thing, especially since
> pg_class.relfilenode is a prominent case where we don't even adhere to
> that convention. I'm kind of tempted to think that we should go the
> other way and rename the RelFileNode struct to something like
> RelFileLocator, and then maybe call the new data type RelFileNumber.
> And then we could work toward removing references to "filenode" and
> "relfilenode" in favor of either (rel)filelocator or (rel)filenumber.
> Now the question (even assuming other people like this general
> direction) is how far do we go with it? Renaming pg_class.relfilenode
> itself wouldn't be the worst compatibility break we've ever had, but
> it would definitely cause some pain. I'd be inclined to leave the
> user-visible catalog column alone and just push in this direction for
> internal stuff.

I have worked on this renaming stuff first and once we agree with that
then I will rebase the other patches on top of this and will also work
on the other review comments for those patches.
So basically in this patch
- The "RelFileNode" structure to "RelFileLocator" and also renamed
other internal member as below
typedef struct RelFileLocator
{
      Oid spcOid; /* tablespace */
      Oid dbOid; /* database */
      Oid relNumber; /* relation */
} RelFileLocator;
- All variables and internal functions which are using name as
relfilenode/rnode and referring to this structure are renamed to
relfilelocator/rlocator.
- relNode/relfilenode which are referring to the actual file name on
disk is renamed to relNumber/relfilenumber.
- Based on the new terminology, I have renamed the file names as well, e.g.
relfilenode.h -> relfilelocator.h
relfilenodemap.h -> relfilenumbermap.h

I haven't renamed the exposed catalog variable and exposed function
here is the high level list
- pg_class.relfilenode
- pg_catalog.pg_relation_filenode()
- All test cases variables referring to pg_class.relfilenode.
- exposed option for tool which are w.r.t pg_class relfilenode (e.g.
-f, --filenode=FILENODE)
- exposed functions
pg_catalog.binary_upgrade_set_next_heap_relfilenode() and friends
- pg_filenode.map file name, maybe we can rename this but this is used
by other tools so I left this alone.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v1-0001-Rename-RelFileNode-to-RelFileLocator-and-relNode-.patch

making relfilenodes 56 bits

От

Robert Haas

Дата:

23 июня 2022 г., 23:06:21

[ changing subject line so nobody misses what's under discussion ]

For a quick summary of the overall idea being discussed here and some
discussion of the problems it solves, see
http://postgr.es/m/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com

For discussion of the proposed renaming of non-user-visible references
to relfilenode to either RelFileLocator or RelFileNumber as
preparatory refactoring work for that change, see
http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com

On Thu, Jun 23, 2022 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have worked on this renaming stuff first and once we agree with that
> then I will rebase the other patches on top of this and will also work
> on the other review comments for those patches.
> So basically in this patch
> - The "RelFileNode" structure to "RelFileLocator" and also renamed
> other internal member as below
> typedef struct RelFileLocator
> {
>       Oid spcOid; /* tablespace */
>       Oid dbOid; /* database */
>       Oid relNumber; /* relation */
> } RelFileLocator;

I like those structure member names fine, but I'd like to see this
preliminary patch also introduce the RelFileNumber typedef as an alias
for Oid. Then the main patch can change it to be uint64.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

24 июня 2022 г., 14:08:13

On Fri, Jun 24, 2022 at 1:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> [ changing subject line so nobody misses what's under discussion ]
>
> For a quick summary of the overall idea being discussed here and some
> discussion of the problems it solves, see
> http://postgr.es/m/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com
>
> For discussion of the proposed renaming of non-user-visible references
> to relfilenode to either RelFileLocator or RelFileNumber as
> preparatory refactoring work for that change, see
> http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
>
> On Thu, Jun 23, 2022 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have worked on this renaming stuff first and once we agree with that
> > then I will rebase the other patches on top of this and will also work
> > on the other review comments for those patches.
> > So basically in this patch
> > - The "RelFileNode" structure to "RelFileLocator" and also renamed
> > other internal member as below
> > typedef struct RelFileLocator
> > {
> >       Oid spcOid; /* tablespace */
> >       Oid dbOid; /* database */
> >       Oid relNumber; /* relation */
> > } RelFileLocator;
>
> I like those structure member names fine, but I'd like to see this
> preliminary patch also introduce the RelFileNumber typedef as an alias
> for Oid. Then the main patch can change it to be uint64.

I have changed that. PFA, the updated patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v2-0001-Rename-RelFileNode-to-RelFileLocator-and-relNode-.patch

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

24 июня 2022 г., 17:59:25

On Fri, Jun 24, 2022 at 7:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed that. PFA, the updated patch.

Apart from one minor nitpick (see below) I don't see a problem with
this in isolation. It seems like a pretty clean renaming. So I think
we need to move onto the question of how clean the rest of the patch
series looks with this as a base.

A preliminary refactoring that was discussed in the past and was
originally in 0001 was to move the fields included in BufferTag via
RelFileNode/Locator directly into the struct. I think maybe it doesn't
make sense to include that in 0001 as you have it here, but maybe that
could be 0002 with the main patch to follow as 0003, or something like
that. I wonder if we can get by with redefining RelFileNode like this
in 0002:

typedef struct buftag
{
    Oid     spcOid;
    Oid     dbOid;
    RelFileNumber   fileNumber;
    ForkNumber  forkNum;
} BufferTag;

And then like this in 0003:

typedef struct buftag
{
    Oid     spcOid;
    Oid     dbOid;
    RelFileNumber   fileNumber:56;
    ForkNumber  forkNum:8;
} BufferTag;

- * from catalog OIDs to filenode numbers.  Each database has a map file for
+ * from catalog OIDs to filenumber.  Each database has a map file for

should be filenumbers

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Andres Freund

Дата:

25 июня 2022 г., 04:30:26

Hi,

On 2022-06-24 10:59:25 -0400, Robert Haas wrote:
> A preliminary refactoring that was discussed in the past and was
> originally in 0001 was to move the fields included in BufferTag via
> RelFileNode/Locator directly into the struct. I think maybe it doesn't
> make sense to include that in 0001 as you have it here, but maybe that
> could be 0002 with the main patch to follow as 0003, or something like
> that. I wonder if we can get by with redefining RelFileNode like this
> in 0002:
> 
> typedef struct buftag
> {
>     Oid     spcOid;
>     Oid     dbOid;
>     RelFileNumber   fileNumber;
>     ForkNumber  forkNum;
> } BufferTag;

If we "inline" RelFileNumber, it's probably worth reorder the members so that
the most distinguishing elements come first, to make it quicker to detect hash
collisions. It shows up in profiles today...

I guess it should be blockNum, fileNumber, forkNumber, dbOid, spcOid? I think
as long as blockNum, fileNumber are first, the rest doesn't matter much.

> And then like this in 0003:
> 
> typedef struct buftag
> {
>     Oid     spcOid;
>     Oid     dbOid;
>     RelFileNumber   fileNumber:56;
>     ForkNumber  forkNum:8;
> } BufferTag;

Probably worth checking the generated code / the performance effects of using
bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
byte boundary, so it might be ok.

Greetings,

Andres Freund

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

25 июня 2022 г., 15:47:37

On Fri, Jun 24, 2022 at 9:30 PM Andres Freund <andres@anarazel.de> wrote:
> If we "inline" RelFileNumber, it's probably worth reorder the members so that
> the most distinguishing elements come first, to make it quicker to detect hash
> collisions. It shows up in profiles today...
>
> I guess it should be blockNum, fileNumber, forkNumber, dbOid, spcOid? I think
> as long as blockNum, fileNumber are first, the rest doesn't matter much.

Hmm, I guess we could do that. Possibly as a separate, very small patch.

> > And then like this in 0003:
> >
> > typedef struct buftag
> > {
> >     Oid     spcOid;
> >     Oid     dbOid;
> >     RelFileNumber   fileNumber:56;
> >     ForkNumber  forkNum:8;
> > } BufferTag;
>
> Probably worth checking the generated code / the performance effects of using
> bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
> byte boundary, so it might be ok.

One advantage of using bitfields is that it might mean we don't need
to introduce accessor macros. Now, if that's going to lead to terrible
performance I guess we should go ahead and add the accessor macros -
Dilip had those in an earlier patch anyway. But it'd be nice if it
weren't necessary.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Simon Riggs

Дата:

28 июня 2022 г., 14:45:16

On Sat, 25 Jun 2022 at 02:30, Andres Freund <andres@anarazel.de> wrote:

> > And then like this in 0003:
> >
> > typedef struct buftag
> > {
> >     Oid     spcOid;
> >     Oid     dbOid;
> >     RelFileNumber   fileNumber:56;
> >     ForkNumber  forkNum:8;
> > } BufferTag;
>
> Probably worth checking the generated code / the performance effects of using
> bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
> byte boundary, so it might be ok.

Another approach would be to condense spcOid and dbOid into a single
4-byte Oid-like number, since in most cases they are associated with
each other, and not often many of them anyway. So this new number
would indicate both the database and the tablespace. I know that we
want to be able to make file changes without doing catalog lookups,
but since the number of combinations is usually 1, but even then, low,
it can be cached easily in a smgr array and included in the checkpoint
record (or nearby) for ease of use.

typedef struct buftag
{
     Oid     db_spcOid;
     ForkNumber  uint32;
     RelFileNumber   uint64;
} BufferTag;

That way we could just have a simple 64-bit RelFileNumber, without
restriction, and probably some spare bytes on the ForkNumber, if we
needed them later.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

28 июня 2022 г., 18:25:55

On Tue, Jun 28, 2022 at 7:45 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> Another approach would be to condense spcOid and dbOid into a single
> 4-byte Oid-like number, since in most cases they are associated with
> each other, and not often many of them anyway. So this new number
> would indicate both the database and the tablespace. I know that we
> want to be able to make file changes without doing catalog lookups,
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.
>
> typedef struct buftag
> {
>      Oid     db_spcOid;
>      ForkNumber  uint32;
>      RelFileNumber   uint64;
> } BufferTag;

I've thought about this before too, because it does seem like the DB
OID and tablespace OID are a poor use of bit space. You might not even
need to keep the db_spcOid value in any persistent place, because it
could just be an alias for buffer mapping lookups that might change on
every restart. That does have the problem that you now need a
secondary hash table - in theory of unbounded size - to store mappings
from <dboid,tsoid> to db_spcOid, and that seems complicated and hard
to get right. It might be possible, though. Alternatively, you could
imagine a durable mapping that also affects the on-disk structure, but
I don't quite see how to make that work: for example, pg_basebackup
wants to produce a tar file for each tablespace directory, and if the
pathnames no longer contain the tablespace OID but only the db_spcOid,
then that doesn't work any more.

But the primary problem we're trying to solve here is that right now
we sometimes reuse the same filename for a whole new file, and that
results in bugs that only manifest themselves in obscure
circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f.
There are residual failure modes even now related to the "tombstone"
files that are created when you drop a relation: remove everything but
the first file from the main fork but then keep that file (only)
around until after the next checkpoint. OID wraparound is another
annoyance that has influenced the design of quite a bit of code over
the years and where we probably still have bugs. If we don't reuse
relfilenodes, we can avoid a lot of that pain. Combining the DB OID
and TS OID fields doesn't solve that problem.

> That way we could just have a simple 64-bit RelFileNumber, without
> restriction, and probably some spare bytes on the ForkNumber, if we
> needed them later.

In my personal opinion, the ForkNumber system is an ugly wart which
has nothing to recommend it except that the VM and FSM forks are
awesome. But if we could have those things without needing forks, I
think that would be way better. Forks add code complexity in tons of
places, and it's barely possible to scale it to the 4 forks we have
already, let alone any larger number. Furthermore, there are really
negative performance effects from creating 3 files per small relation
rather than 1, and we sure can't afford to have that number get any
bigger. I'd rather kill the ForkNumber system with fire that expand it
further, but even if we do expand it, we're not close to being able to
cope with more than 256 forks per relation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

28 июня 2022 г., 20:01:11

On Tue, Jun 28, 2022 at 11:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But the primary problem we're trying to solve here is that right now
> we sometimes reuse the same filename for a whole new file, and that
> results in bugs that only manifest themselves in obscure
> circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f.
> There are residual failure modes even now related to the "tombstone"
> files that are created when you drop a relation: remove everything but
> the first file from the main fork but then keep that file (only)
> around until after the next checkpoint. OID wraparound is another
> annoyance that has influenced the design of quite a bit of code over
> the years and where we probably still have bugs. If we don't reuse
> relfilenodes, we can avoid a lot of that pain. Combining the DB OID
> and TS OID fields doesn't solve that problem.

Oh wait, I'm being stupid. You were going to combine those fields but
then also widen the relfilenode, so that would solve this problem
after all. Oops, I'm dumb.

I still think this is a lot more complicated though, to the point
where I'm not sure we can really make it work at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Matthias van de Meent

Дата:

28 июня 2022 г., 21:18:46

On Tue, 28 Jun 2022 at 13:45, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.

I was reading the thread to keep up with storage-related prototypes
and patches, and this specifically doesn't sound quite right to me. I
do not know what values you considered to be 'low' or what 'can be
cached easily', so here's some field data:

I have seen PostgreSQL clusters that utilized the relative isolation
of seperate databases within the same cluster (instance / postmaster)
to provide higher guarantees of data access isolation while still
being able to share a resource pool, which resulted in several
clusters containing upwards of 100 databases.

I will be the first to admit that it is quite unlikely to be common
practise, but this workload increases the number of dbOid+spcOid
combinations to 100s (even while using only a single tablespace),
which in my opinion requires some more thought than just handwaving it
into an smgr array and/or checkpoint records.

Kind regards,

Matthias van de Meent

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

29 июня 2022 г., 12:15:09

On Fri, Jun 24, 2022 at 8:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jun 24, 2022 at 7:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have changed that. PFA, the updated patch.
>
> Apart from one minor nitpick (see below) I don't see a problem with
> this in isolation. It seems like a pretty clean renaming. So I think
> we need to move onto the question of how clean the rest of the patch
> series looks with this as a base.
>

PFA, the remaining set of patches.   It might need to fix some
indentation but lets first see how is the overall idea then we can
work on it.  I have fixed all the open review comment from the
previous thread except this comment from Robert.

>- It looks to me like you need to give significantly more thought to
> the proper way of adjusting the relfilenode-related test cases in
> alter_table.out.

It seems to me that this test case is just testing whether the
table/child table are rewritten or not after the alter table.  And for
that it is comparing the oid with the relfilenode, now that is not
possible so I think it's quite reasonable to just compare the current
relfilenode with the old relfilenode and if they are same the table is
not rewritten.  So I am not sure why the original test case had two
cases 'own' and 'orig'.  With respect to this test case they both have
the same meaning, in fact comparing old relfilenode with current
relfilenode is better way of testing than comparing the oid with
relfilenode.

diff --git a/src/test/regress/expected/alter_table.out
b/src/test/regress/expected/alter_table.out
index 5ede56d..80af97e 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2164,7 +2164,6 @@ select relname,
   c.oid = oldoid as orig_oid,
   case relfilenode
     when 0 then 'none'
-    when c.oid then 'own'
     when oldfilenode then 'orig'
     else 'OTHER'
     end as storage,
@@ -2175,10 +2174,10 @@ select relname,
            relname            | orig_oid | storage |     desc
 ------------------------------+----------+---------+---------------
  at_partitioned               | t        | none    |
- at_partitioned_0             | t        | own     |
- at_partitioned_0_id_name_key | t        | own     | child 0 index
- at_partitioned_1             | t        | own     |
- at_partitioned_1_id_name_key | t        | own     | child 1 index
+ at_partitioned_0             | t        | orig    |
+ at_partitioned_0_id_name_key | t        | orig    | child 0 index
+ at_partitioned_1             | t        | orig    |
+ at_partitioned_1_id_name_key | t        | orig    | child 1 index
  at_partitioned_id_name_key   | t        | none    | parent index
 (6 rows)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Jul 1, 2022 at 12:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > PFA, the remaining set of patches.   It might need to fix some
> > indentation but lets first see how is the overall idea then we can
> > work on it
>
> So just playing around with this patch set, and also looking at the
> code a bit, here are a few random observations:
>
> - The patch assigns relfilenumbers starting with 1. I don't see any
> specific problem with that, but I wonder if it would be a good idea to
> start with a random larger value just in case we ever need some fixed
> values for some purpose or other. Maybe we should start with 100000 or
> something?

Yeah we can do that, I have changed to 100000.

> - If I use ALTER TABLE .. SET TABLESPACE to move a table around, then
> the relfilenode changes each time, but if I use ALTER DATABASE .. SET
> TABLESPACE to move a database around, the relfilenodes don't change.
> So, what this guarantees is that if the same filename is used twice,
> it will be for the same relation and not some unrelated relation.
> That's enough to avoid the hazard described in the comments for
> mdunlink(), because that scenario intrinsically involves confusion
> caused by two relations using the same filename after an OID
> wraparound. And it also means that if we pursue the idea of using an
> end-of-recovery record in all cases, we don't need to start creating
> tombstones during crash recovery. The forced checkpoint at the end of
> crash recovery means we don't currently need to do that, but if we
> change that, then the same hazard would exist there as we already have
> in normal running, and this fixes it. However, I don't find it
> entirely obvious that there are no hazards of any kind stemming from
> repeated use of ALTER DATABASE .. SET TABLESPACE resulting in
> filenames getting reused. On the other hand avoiding filename reuse
> completely would be more work, not closely related to what the rest of
> the patch set does, probably somewhat controversial in terms of what
> it would have to do, and I'm not sure that we really need it. It does
> seem like it would be quite a bit easier to reason about, though,
> because the current guarantee is suspiciously similar to "we don't do
> X, except when we do." This is not really so much a review comment for
> Dilip as a request for input from others ... thoughts?

Yeah that can be done, but maybe as a separate patch.  One option is
that when we will support the WAL method for the ALTER TABLE .. SET
TABLESPACE like we did for CREATE DATABASE, as part of that we will
generate the new relfilenumber.

> - Again, not a review comment for this patch specifically, but I'm
> wondering if we could use this as infrastructure for a tool to clean
> orphaned files out of the data directory. Suppose we create a file for
> a new relation and then crash, leaving a potentially large file on
> disk that will never be removed. Well, if the relfilenumber as it
> exists on disk is not in pg_class and old enough that a transaction
> inserting into pg_class can't still be running, then it must be safe
> to remove that file. Maybe that's safe even today, but it's a little
> hard to reason about it in the face of a possible OID wraparound that
> might result in reusing the same numbers over again. It feels like
> this makes easier to identify which files are old stuff that can never
> again be touched.

Correct.

> - I might be missing something here, but this isn't actually making
> the relfilenode 56 bits, is it? The reason to do that is to make the
> BufferTag smaller, so I expected to see that BufferTag either used
> bitfields like RelFileNumber relNumber:56 and ForkNumber forkNum:8, or
> else that it just declared a single field for both as uint64 and used
> accessor macros or static inlines to separate them out. But it doesn't
> seem to do either of those things, which seems like it can't be right.
> On a related note, I think it would be better to declare RelFileNumber
> as an unsigned type even though we have no use for the high bit; we
> have, equally, no use for negative values. It's easier to reason about
> bit-shifting operations with unsigned types.

Opps, somehow missed to merge that change in the patch.  Changed that
like below and adjusted the macros.
typedef struct buftag
{
Oid spcOid; /* tablespace oid. */
Oid dbOid; /* database oid. */
uint32 relNumber_low; /* relfilenumber 32 lower bits */
uint32 relNumber_hi:24; /* relfilenumber 24 high bits */
uint32 forkNum:8; /* fork number */
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;

I think we need to break like this to keep the BufferTag 4 byte
aligned otherwise the size of the structure will be increased.

> - I also think that the cross-version compatibility stuff in
> pg_buffercache isn't quite right. It does values[1] =
> ObjectIdGetDatum(fctx->record[i].relfilenumber). But I think what it
> ought to do is dependent on the output type. If the output type is
> int8, then it ought to do values[1] = Int64GetDatum((int64)
> fctx->record[i].relfilenumber), and if it's OID, then it ought to do
> values[1] = ObjectIdGetDatum((Oid) fctx->record[i].relfilenumber)).
> The  macro that you use needs to be based on the output SQL type, not
> the C data type.

Fixed

> - I think it might be a good idea to allocate RelFileNumbers in much
> smaller batches than we do OIDs. 8192 feels wasteful to me. It
> shouldn't practically matter, because if we have 56 bits of bit space
> and so even if we repeatedly allocate 2^13 RelFileNumbers and then
> crash, we can still crash 2^41 times before we completely run out of
> numbers, and 2 trillion crashes ought to be enough for anyone. But I
> see little benefit from being so profligate. You can allocate an OID
> as an identifier for a catalog tuple or a TOAST chunk, but a
> RelFileNumber requires a filesystem operation, so the amount of work
> that is needed to use up 8192 RelFileNumbers is a lot bigger than the
> amount of work required to use up 8192 OIDs. If we dropped this down
> to 128, or 64, or 256, would anything bad happen?

This makes sense so I have changed to 64.

> - Do we really want GetNewRelFileNumber() to call access() just for a
> can't-happen scenario? Can't we catch this problem later when we
> actually go to create the files on disk?

Yeah we don't need to, actually we can completely get rid of
GetNewRelFileNumber() function and we can directly call
GenerateNewRelFileNumber() and in fact we can rename
GenerateNewRelFileNumber() to GetNewRelFileNumber().  So I have done
these changes.

> - The patch updates the comments in XLogPrefetcherNextBlock to talk
> about relfilenumbers being reused rather than relfilenodes being
> reused, which is fine except that we're sorta kinda not doing that any
> more as noted above. I don't really know what these comments ought to
> say instead but perhaps more than a mechanical update is in order.

Changed

> This applies, even more, to the comments above mdunlink(). Apart from
> updating the existing comments, I think that the patch needs a good
> explanation of the new scheme someplace, and what it does and doesn't
> guarantee, which relates to the point above about making sure we know
> exactly what we're guaranteeing and why. I don't know where exactly
> this text should be positioned yet, or what it should say, but it
> needs to go someplace. This is a fairly significant change and needs
> to be talked about somewhere.

For now, in v4_0004**, I have removed the comment which is explaining
why we need to keep the Tombstone file and added some note that why we
do not need to keep those files from PG16 onwards.

> - I think there's still a bit of a terminology problem here. With the
> patch set, we use RelFileNumber to refer to a single, 56-bit integer
> and RelFileLocator to refer to that integer combined with the DB and
> TS OIDs. But sometimes in the comments we want to talk about the
> logical sequence of files that is identified by a RelFileLocator, and
> that's not quite the same as either of those things. For example, in
> tableam.h we currently say "This callback needs to create a new
> relation filenode for `rel`" and how should that be changed in this
> new naming? We're not creating a new RelFileNumber - those would need
> to be allocated, not created, as all the numbers in the universe exist
> already. Neither are we creating a new locator; that sounds like it
> means assembling it from pieces. What we're doing is creating the
> first of what may end up being a series of similarly-named files on
> disk. I'm not exactly sure how we can refer to that in a way that is
> clear, but it's a problem that arises here and here throughout the
> patch.

I think the comment can say
"This callback needs to create a new relnumber file for 'rel' " ?

I have not modified this yet, I will check other places where we have
such terminology issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Sun, Jul 3, 2022 at 8:02 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jul 2, 2022 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
> > Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
> > is 8192, but you chose 64 for VAR_RFN_PREFETCH?
>
> As Dilip mentioned, I suggested a lower value. If that's too low, we
> can go higher, but I think there is value in not making this
> excessively large. Somebody somewhere is going to have a database
> that's crash-restarting like mad, and I don't want that person to run
> through an insane number of relfilenodes for no reason. I don't think
> there are going to be a lot of people creating thousands upon
> thousands of relations in a short period of time, and I'm not sure
> that it's a big deal if those who do end up having to wait for a few
> extra xlog flushes.

Here is the updated version of the patch.

Patch 0001-0003 are the same with review comments fixes given by
Andres, 0004 as an extra assert patch suggested by Andres, this can be
merged with 0003.  Basically, during recovery we add asserts checking
"relfilenumbers aren't above ->nextRelFileNumber," and also the assert
for checking that after we allocate a new relfile number the file
should not already exist on the disk so that once we are sure that
this assertion is not hitting then maybe we are safe for removing the
TombStone files immediately what we were doing in 0005.

In 0005 I fixed the file delete order so now we are deleting in
descending order, for that first we need to count the number of
segments by doing stat() on each file and after that we need to go
ahead and unlink in the descending order.

The VAR_RELFILENUMBER_PREFETCH is still 64 as we have not yet
concluded on that, and as discussed I will test some performance to
see whether we have some obvious impact with different values of this.
Maybe I will start with some very small numbers so that we have some
impact.

I thought about this comment from Robert
> that's not quite the same as either of those things. For example, in
> tableam.h we currently say "This callback needs to create a new
> relation filenode for `rel`" and how should that be changed in this
> new naming? We're not creating a new RelFileNumber - those would need
> to be allocated, not created, as all the numbers in the universe exist
> already. Neither are we creating a new locator; that sounds like it
> means assembling it from pieces.

I think that "This callback needs to create a new relation storage
for `rel`" looks better.

I have again reviewed 0001 and 0003 and found some discrepancies in
usage of relfilenumber vs relfilelocator and fixed those, also some
places InvalidOid were use instead of InvalidRelFileNumber.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Wed, Jul 6, 2022 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 5, 2022 at 4:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I thought about this comment from Robert
> > > that's not quite the same as either of those things. For example, in
> > > tableam.h we currently say "This callback needs to create a new
> > > relation filenode for `rel`" and how should that be changed in this
> > > new naming? We're not creating a new RelFileNumber - those would need
> > > to be allocated, not created, as all the numbers in the universe exist
> > > already. Neither are we creating a new locator; that sounds like it
> > > means assembling it from pieces.
> >
> > I think that "This callback needs to create a new relation storage
> > for `rel`" looks better.
>
> I like the idea, but it would sound better to say "create new relation
> storage" rather than "create a new relation storage."

Okay, changed that and changed a few more occurrences in 0001 which
were on similar lines.  I also tested the performance of pg_bench
where concurrently I am running the script which creates/drops
relation but I do not see any regression with fairly small values of
VAR_RELNUMBER_PREFETCH, the smallest value I tried was 8.  That
doesn't mean I am suggesting this small value but I think we can keep
the value something like 512 or 1024 without worrying much about the
performance, so changed to 512 in the latest patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Thu, Jul 7, 2022 at 2:54 AM Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for committing the 0001.

> On Wed, Jul 6, 2022 at 11:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think 0002 and 0003 need more work yet; I'll try to write a review
> > of those soon.
>
> Regarding 0002:
>
> I don't particularly like the names BufTagCopyRelFileLocator and
> BufTagRelFileLocatorEquals. My suggestion is to rename
> BufTagRelFileLocatorEquals to BufTagMatchesRelFileLocator, because it
> doesn't really make sense to me to talk about equality between values
> of different data types. Instead of BufTagCopyRelFileLocator I would
> prefer BufTagGetRelFileLocator. That would make it more similar to
> BufTagGetFileNumber and BufTagSetFileNumber, which I think would be a
> good thing.
>
> Other than that I think 0002 seems fine.

Changed as suggested.  Although I feel BufTagCopyRelFileLocator is
actually copying the relfilelocator from buffer tag to an input
variable, I am fine with BufTagGetRelFileLocator so that it is similar
to the other names.

Changed some other macro names as below because field name they are
getting/setting is relNumber
BufTagSetFileNumber -> BufTagSetRelNumber
BufTagGetFileNumber -> BufTagGetRelNumber

> Regarding 0003:

> I'm worried that this might not be correct. The comment changes here
> (and I think also in some other plces) imply that we've eliminated
> relfilenode ruse, but I think that's not true. createdb() and movedb()
> don't seem to be modified, so I think it's possible to just copy a
> template database over without change, which means that relfilenumbers
> and even relfilelocators could be reused. So I feel like maybe this
> and similar places shouldn't be modified in this way. Am I
> misunderstanding?

I think you are right, so I changed it.

>         /*
> -        * Relfilenumbers are not unique in databases across
> tablespaces, so we need
> -        * to allocate a new one in the new tablespace.
> +        * Generate a new relfilenumber. Although relfilenumber are
> unique within a
> +        * cluster, we are unable to use the old relfilenumber since unused
> +        * relfilenumber are not unlinked until commit.  So if within a
> +        * transaction, if we set the old tablespace again, we will
> get conflicting
> +        * relfilenumber file.
>          */
> -       newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
> -
>             rel->rd_rel->relpersistence);
> +       newrelfilenumber = GetNewRelFileNumber();
>
> I can't clearly understand this comment. Is it saying that the code
> which follows is broken and needs to be fixed by a future patch before
> things are OK again? If so, that's not good.

No it is not broken in this patch.  Basically, before our patch the
reason for allocating the new relfilenumber was that if we create the
file with oldrelfilenumber in new tablespace then it is possible that
in the new tablespace file with same name exist because relfilenumber
was unique in databases across tablespaces so there could be conflict.
But now that is not the case but still we can not reuse the old
relfilenumber because from the old tablespace the old relfilenumber
file is not removed until the next checkpoint so if we move the table
back to the old tablespace again then there could be conflict.  And
even after we get the final patch of removing the tombstone file on
commit then also we can not reuse the old relfilenumber because within
a transaction we can switch between the tablespaces multiple times and
the relfilenumber file from the old tablespace will be removed only on
commit.  This is what I am trying to explain in the comment.

Now I have modified the comment slightly, such that in 0002 I am
saying files are not removed until the next checkpoint and in 0004 I
am modifying that and saying not removed until commit.

> - * callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in
> - * catalog/catalog.c.
> + * callers should be GetNewOidWithIndex() in catalog/catalog.c.
>
> If there is only one, it should say "caller", not "callers".
>
>  Orphan files are harmless --- at worst they waste a bit of disk space ---
> -because we check for on-disk collisions when allocating new relfilenumber
> -OIDs.  So cleaning up isn't really necessary.
> +because relfilenumber is 56 bit wide so logically there should not be any
> +collisions.  So cleaning up isn't really necessary.
>
> I don't agree that orphaned files are harmless, but changing that is
> beyond the scope of this patch. I think that the way you've ended the
> sentence isn't sufficiently clear and correct even if we accept the
> principle that orphaned files are harmless. What I think we should
> stay instead is "because the relfilenode counter is monotonically
> increasing. The maximum value is 2^56-1, and there is no provision for
> wraparound."

Done

> +       /*
> +        * Check if we set the new relfilenumber then do we run out of
> the logged
> +        * relnumber, if so then we need to WAL log again.  Otherwise,
> just adjust
> +        * the relnumbercount.
> +        */
> +       relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
> +       if (ShmemVariableCache->relnumbercount <= relnumbercount)
> +       {
> +               LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH);
> +               ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
> +       }
> +       else
> +               ShmemVariableCache->relnumbercount -= relnumbercount;
>
> Would it be clearer, here and elsewhere, if VariableCacheData tracked
> nextRelFileNumber and nextUnloggedRelFileNumber instead of
> nextRelFileNumber and relnumbercount? I'm not 100% sure, but the idea
> seems worth considering.

I think it is in line with oidCount, what do you think?

>
> +        * Flush xlog record to disk before returning.  To protect against file
> +        * system changes reaching the disk before the
> XLOG_NEXT_RELFILENUMBER log.
>
> The way this is worded, you would need it to be just one sentence,
> like "Flush xlog record to disk before returning to protect
> against...". Or else add "this is," like "This is to protect
> against..."
>
> But I'm thinking maybe we could reword it a little more, perhaps
> something like this: "Flush xlog record to disk before returning. We
> want to be sure that the in-memory nextRelFileNumber value is always
> larger than any relfilenumber that is already in use on disk. To
> maintain that invariant, we must make sure that the record we just
> logged reaches the disk before any new files are created."

Done

> This isn't a full review, I think, but I'm kind of out of time and
> energy for today.

I have updated some other comments as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

07 июля 2022 г., 20:26:29

Trying to compile with 0001 and 0002 applied and -Wall -Werror in use, I get:

buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
                        CLEAR_BUFFERTAG(buf->tag);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~
../../../../src/include/storage/buf_internals.h:122:14: note: expanded
from macro 'CLEAR_BUFFERTAG'
        (a).forkNum = InvalidForkNumber, \
                    ^ ~~~~~~~~~~~~~~~~~
1 error generated.

More review comments:

In pg_buffercache_pages_internal(), I suggest that we add an error
check. If fctx->record[i].relfilenumber is greater than the largest
value that can be represented as an OID, then let's do something like:

ERROR: relfilenode is too large to be represented as an OID
HINT: Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE

That way, instead of confusing people by giving them an incorrect
answer, we'll push them toward a step that they may have overlooked.

In src/backend/access/transam/README, I think the sentence "So
cleaning up isn't really necessary." isn't too helpful. I suggest
replacing it with "Thus, on-disk collisions aren't possible."

> I think it is in line with oidCount, what do you think?

Oh it definitely is, and maybe it's OK the way you have it. But the
OID stuff has wraparound to worry about, and this doesn't; and this
has the SetNextRelFileNumber and that doesn't; so it is not
necessarily the case that the design which is best for that case is
also best for this case.

I believe that the persistence model for SetNextRelFileNumber needs
more thought. Right now I believe it's relying on the fact that, after
we try to restore the dump, we'll try to perform a clean shutdown of
the server before doing anything important, and that will persist the
final value, whatever it ends up being. However, there's no comment
explaining that theory of operation, and it seems pretty fragile
anyway. What if things don't go as planned? Suppose the power goes out
halfway through restoring the dump, and the user for some reason then
gives up on running pg_upgrade and just tries to do random things with
that server? Then I think there will be trouble, because nothing has
updated the nextrelfilenumber value and yet there are potentially new
files on disk. Maybe that's a stretch since I think other things might
also break if you do that, but I'm also not sure that's the only
scenario to worry about, especially if you factor in the possibility
of future code changes, like changes to the timing of when we shut
down and restart the server during pg_upgrade, or other uses of
binary-upgrade mode, or whatever. I don't know. Perhaps it's not
actually broken but I'm inclined to think it should be logging its
changes.

A related thought is that I don't think this patch has as many
cross-checks as it could have. For instance, suppose that when we
replay a WAL record that creates relation storage, we cross-check that
the value is less than the counter. I think you have a check in there
someplace that will error out if there is an actual collision --
although I can't find it at the moment, and possibly we want to add
some comments there even if it's in existing code -- but this kind of
thing would detect bugs that could lead to collisions even if no
collision actually occurs, e.g. because a duplicate relfilenumber is
used but in a different database or tablespace. It might be worth
spending some time thinking about other possible cross-checks too.
We're trying to create a system where the relfilenumber counter is
always ahead of all the relfilenumbers used on disk, but the coupling
between the relfilenumber-advancement machinery and the
make-files-on-disk machinery is pretty loose, and so there is a risk
that bugs could escape detection. Whatever we can do to increase the
probability of noticing when things have gone wrong, and/or to notice
it quicker, will be good.

+       if (!IsBinaryUpgrade)
+               elog(ERROR, "the RelFileNumber can be set only during
binary upgrade");

I think you should remove the word "the". Primary error messages are
written telegram-style and "the" is usually omitted, especially at the
beginning of the message.

+        * This should not impact the performance, since we are not WAL logging
+        * it for every allocation, but only after allocating 512 RelFileNumber.

I think this claim is overly bold, and it would be better if the
current value of the constant weren't encoded in the comment. I'm not
sure we really need this part of the comment at all, but if we do,
maybe it should be reworded to something like: This is potentially a
somewhat expensive operation, but fortunately we only need to do it
for every VAR_RELNUMBER_PREFETCH new relfilenodes. Or maybe it's
better to put this explanation in GetNewRelFileNumber instead, e.g.
"If we run out of logged RelFileNumbers, then we must log more, and
also wait for the xlog record to be flushed to disk. This is somewhat
expensive, but hopefully VAR_RELNUMBER_PREFETCH is large enough that
this doesn't slow things down too much."

One thing that isn't great about this whole scheme is that it can lead
to lock pile-ups. Once somebody is waiting for an
XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend
that tries to get a new relfilenumber is going to block waiting for
RelFileNumberGenLock. I wonder whether this effect is observable in
practice: suppose we just create relations in a tight loop from inside
a stored procedure, and do that simultaneously in multiple backends?
What does the wait event distribution look like? Can we observe a lot
of RelFileNumberGenLock events or not really? I guess if we reduce
VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but
how small a value is needed?

One thing we could think about doing here is try to stagger the xlog
and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
where we are now, and remember the LSN. When we've used up our entire
previous allocation, XLogFlush() that record before allowing the
additional values to be used. The bookkeeping would be a bit more
complicated than currently, but I don't think it would be too bad. I'm
not sure how much it would actually help, though, or whether we need
it. If new relfilenumbers are being used up really quickly, then maybe
the record won't get flushed into the background before we run out of
available numbers anyway, and if they aren't, then maybe it doesn't
matter. On the other hand, even one transaction commit between when
the record is logged and when we run out of the previous allocation is
enough to force a flush, at least with synchronous_commit=on, so maybe
the chances of being able to piggyback on an existing flush are not so
bad after all. I'm not sure.

+        * Generate a new relfilenumber.  We can not reuse the old relfilenumber
+        * because the unused relfilenumber files are not unlinked
until the next
+        * checkpoint.  So if move the relation to the old tablespace again, we
+        * will get the conflicting relfilenumber file.

This is much clearer now but the grammar has some issues, e.g. "the
unused relfilenumber" should be just "unused relfilenumber" and "So if
move" is not right either. I suggest: We cannot reuse the old
relfilenumber because of the possibility that that relation will be
moved back to the original tablespace before the next checkpoint. At
that point, the first segment of the main fork won't have been
unlinked yet, and an attempt to create new relation storage with that
same relfilenumber will fail."

In theory I suppose there's another way we could solve this problem:
keep using the same relfilenumber, and if the scenario described here
occurs, just reuse the old file. The reason why we can't do that today
is because we could be running with wal_level=minimal and replace a
relation with one whose contents aren't logged. If WAL replay then
replays the drop, we're in trouble. But if the only time we reuse a
relfilenumber for new relation storage is when relations are moved
around, then I think that scenario can't happen. However, I think
assigning a new relfilenumber is probably better, because it gets us
closer to a world in which relfilenumbers are never reused at all. It
doesn't get us all the way there because of createdb() and movedb(),
but it gets us closer and I prefer that.

+ * XXX although this all was true when the relfilenumbers were 32 bits wide but
+ * now the relfilenumbers are 56 bits wide so we don't have risk of
+ * relfilenumber being reused so in future we can immediately unlink the first
+ * segment as well.  Although we can reuse the relfilenumber during createdb()
+ * using file copy method or during movedb() but the above scenario is only
+ * applicable when we create a new relation.

Here is an edited version:

XXX. Although all of this was true when relfilenumbers were 32 bits wide, they
are now 56 bits wide and do not wrap around, so in the future we can change
the code to immediately unlink the first segment of the relation along
with all the
others. We still do reuse relfilenumbers when createdb() is performed using the
file-copy method or during movedb(), but the scenario described above can only
happen when creating a new relation.

I think that pg_filenode_relation,
binary_upgrade_set_next_heap_relfilenode, and other functions that are
now going to be accepting a RelFileNode using the SQL int8 datatype
should bounds-check the argument. It could be <0 or >2^56, and I
believe it'd be best to throw an error for that straight off. The
three functions in pg_upgrade_support.c could share a static
subroutine for this, to avoid duplicating code.

This bounds-checking issue also applies to the -f argument to pg_checksums.

I notice that the patch makes no changes to relmapper.c, and I think
that's a problem. Notice in particular:

#define MAX_MAPPINGS            62  /* 62 * 8 + 16 = 512 */

I believe that making RelFileNumber into a 64-bit value will cause the
8 in the calculation above to change to 16, defeating the intention
that the size of the file ought to be the smallest imaginable size of
a disk sector. It does seem like it would have been smart to include a
StaticAssertStmt in this file someplace that checks that the data
structure has the expected size, and now might be a good time, perhaps
in a separate patch, to add one. If we do nothing fancy here, the
maximum number of mappings will have to be reduced from 62 to 31,
which is a problem because global/pg_filenode.map currently has 48
entries. We could try to arrange to squeeze padding out of the
RelMapping struct, which would let us use just 12 bytes per mapping,
which would increase the limit to 41, but that's still less than we're
using already, never mind leaving room for future growth.

I don't know what to do about this exactly. I believe it's been
previously suggested that the actual minimum sector size on reasonably
modern hardware is never as small as 512 bytes, so maybe the file size
can just be increased to 1kB or something. If that idea is judged
unsafe, I can think of two other possible approaches offhand. One is
that we could move away from the idea of storing the OIDs in the file
along with the RelFileNodes, and instead store the offset for a given
RelFileNode at a fixed offset in the file. That would require either
hard-wiring offset tables into the code someplace, or generating them
as part of the build process, with separate tables for shared and
database-local relation map files. The other is that we could have
multiple 512-byte sectors and try to arrange for each relation to be
in the same sector with the indexes of that relation, since the
comments in relmapper.c say this:

 * aborts.  An important factor here is that the indexes and toast table of
 * a mapped catalog must also be mapped, so that the rewrites/relocations of
 * all these files commit in a single map file update rather than being tied
 * to transaction commit.

This suggests that atomicity is required across a table and its
indexes, but that it's needed across arbitrary sets of entries in the
file.

Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC.

--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -34,6 +34,13 @@ CATALOG(pg_class,1259,RelationRelationId)
BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
        /* oid */
        Oid                     oid;

+       /* access method; 0 if not a table / index */
+       Oid                     relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
+
+       /* identifier of physical storage file */
+       /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
+       int64           relfilenode BKI_DEFAULT(0);
+
        /* class name */

        NameData        relname;

@@ -49,13 +56,6 @@ CATALOG(pg_class,1259,RelationRelationId)
BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
        /* class owner */
        Oid                     relowner BKI_DEFAULT(POSTGRES)
BKI_LOOKUP(pg_authid);

-       /* access method; 0 if not a table / index */
-       Oid                     relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
-
-       /* identifier of physical storage file */
-       /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
-       Oid                     relfilenode BKI_DEFAULT(0);
-
        /* identifier of table space for relation (0 means default for
database) */
        Oid                     reltablespace BKI_DEFAULT(0)
BKI_LOOKUP_OPT(pg_tablespace);

As Andres said elsewhere, this stinks. Not sure what the resolution of
the discussion over on the "AIX support" thread is going to be yet,
but hopefully not this.

+       uint32          relNumber_low;  /* relfilenumber 32 lower bits */
+       uint32          relNumber_hi:24;        /* relfilenumber 24 high bits */
+       uint32          forkNum:8;              /* fork number */

I still think we'd be better off with something like uint32
relForkDetails[2]. The bitfields would be nice if they meant that we
didn't have to do bit-shifting and masking operations ourselves, but
with the field split this way, we do anyway. So what's the point in
mixing the approaches?

  * relNumber identifies the specific relation.  relNumber corresponds to
  * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
  * to assign new physical files to relations in some situations).
- * Notice that relNumber is only unique within a database in a particular
- * tablespace.
+ * Notice that relNumber is unique within a cluster.

I think this paragraph would benefit from more revision. I think that
we should just nuke the parenthesized part altogether, since we'll now
never use pg_class.oid as relNumber, and to suggest otherwise is just
confusing. As for the last sentence, "Notice that relNumber is unique
within a cluster." isn't wrong, but I think we could be more precise
and informative. Perhaps: "relNumber values are assigned by
GetNewRelFileNumber(), which will only ever assign the same value once
during the lifetime of a cluster. However, since CREATE DATABASE
duplicates the relfilenumbers of the template database, the values are
in practice only unique within a database, not globally."

That's all I've got for now.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

11 июля 2022 г., 14:39:11

On Thu, Jul 7, 2022 at 10:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have accepted all the suggestion, find my inline replies where we
need more thoughts.

> buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
> changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
>                         CLEAR_BUFFERTAG(buf->tag);
>                         ^~~~~~~~~~~~~~~~~~~~~~~~~
> ../../../../src/include/storage/buf_internals.h:122:14: note: expanded
> from macro 'CLEAR_BUFFERTAG'
>         (a).forkNum = InvalidForkNumber, \
>                     ^ ~~~~~~~~~~~~~~~~~
> 1 error generated.

Hmm so now we are using an unsigned int field so IMHO we can make
InvalidForkNumber to 255 instead of -1?


> > I think it is in line with oidCount, what do you think?
>
> Oh it definitely is, and maybe it's OK the way you have it. But the
> OID stuff has wraparound to worry about, and this doesn't; and this
> has the SetNextRelFileNumber and that doesn't; so it is not
> necessarily the case that the design which is best for that case is
> also best for this case.

Yeah right, but now with the latest changes for piggybacking the
XlogFlush I think it is cleaner to have the count.

> I believe that the persistence model for SetNextRelFileNumber needs
> more thought. Right now I believe it's relying on the fact that, after
> we try to restore the dump, we'll try to perform a clean shutdown of
> the server before doing anything important, and that will persist the
> final value, whatever it ends up being. However, there's no comment
> explaining that theory of operation, and it seems pretty fragile
> anyway. What if things don't go as planned? Suppose the power goes out
> halfway through restoring the dump, and the user for some reason then
> gives up on running pg_upgrade and just tries to do random things with
> that server? Then I think there will be trouble, because nothing has
> updated the nextrelfilenumber value and yet there are potentially new
> files on disk. Maybe that's a stretch since I think other things might
> also break if you do that, but I'm also not sure that's the only
> scenario to worry about, especially if you factor in the possibility
> of future code changes, like changes to the timing of when we shut
> down and restart the server during pg_upgrade, or other uses of
> binary-upgrade mode, or whatever. I don't know. Perhaps it's not
> actually broken but I'm inclined to think it should be logging its
> changes.

But we are already logging this if we are setting the relfilenumber
which is out of the already logged range, am I missing something?
Check this change.
+    relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
+    if (ShmemVariableCache->relnumbercount <= relnumbercount)
+    {
+        LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH, NULL);
+        ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
+    }
+    else
+        ShmemVariableCache->relnumbercount -= relnumbercount;

> A related thought is that I don't think this patch has as many
> cross-checks as it could have. For instance, suppose that when we
> replay a WAL record that creates relation storage, we cross-check that
> the value is less than the counter. I think you have a check in there
> someplace that will error out if there is an actual collision --
> although I can't find it at the moment, and possibly we want to add
> some comments there even if it's in existing code -- but this kind of
> thing would detect bugs that could lead to collisions even if no
> collision actually occurs, e.g. because a duplicate relfilenumber is
> used but in a different database or tablespace. It might be worth
> spending some time thinking about other possible cross-checks too.
> We're trying to create a system where the relfilenumber counter is
> always ahead of all the relfilenumbers used on disk, but the coupling
> between the relfilenumber-advancement machinery and the
> make-files-on-disk machinery is pretty loose, and so there is a risk
> that bugs could escape detection. Whatever we can do to increase the
> probability of noticing when things have gone wrong, and/or to notice
> it quicker, will be good.

I had those changes in v7-0003, now I have merged with 0002.  This has
assert check while replaying the WAL for smgr create and smgr
truncate, and while during normal path when allocating the new
relfilenumber we are asserting for any existing file.

> One thing that isn't great about this whole scheme is that it can lead
> to lock pile-ups. Once somebody is waiting for an
> XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend
> that tries to get a new relfilenumber is going to block waiting for
> RelFileNumberGenLock. I wonder whether this effect is observable in
> practice: suppose we just create relations in a tight loop from inside
> a stored procedure, and do that simultaneously in multiple backends?
> What does the wait event distribution look like? Can we observe a lot
> of RelFileNumberGenLock events or not really? I guess if we reduce
> VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but
> how small a value is needed?

I have done some performance tests, with very small values I can see a
lot of wait events for RelFileNumberGen but with bigger numbers like
256 or 512 it is not really bad.  See results at the end of the
mail[1]

> One thing we could think about doing here is try to stagger the xlog
> and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
> relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
> where we are now, and remember the LSN. When we've used up our entire
> previous allocation, XLogFlush() that record before allowing the
> additional values to be used. The bookkeeping would be a bit more
> complicated than currently, but I don't think it would be too bad. I'm
> not sure how much it would actually help, though, or whether we need
> it. If new relfilenumbers are being used up really quickly, then maybe
> the record won't get flushed into the background before we run out of
> available numbers anyway, and if they aren't, then maybe it doesn't
> matter. On the other hand, even one transaction commit between when
> the record is logged and when we run out of the previous allocation is
> enough to force a flush, at least with synchronous_commit=on, so maybe
> the chances of being able to piggyback on an existing flush are not so
> bad after all. I'm not sure.

I have done these changes during GetNewRelFileNumber() this required
to track the last logged record pointer as well but I think this looks
clean.  With this I can see some reduction in RelFileNumberGen wait
event[1]

> In theory I suppose there's another way we could solve this problem:
> keep using the same relfilenumber, and if the scenario described here
> occurs, just reuse the old file. The reason why we can't do that today
> is because we could be running with wal_level=minimal and replace a
> relation with one whose contents aren't logged. If WAL replay then
> replays the drop, we're in trouble. But if the only time we reuse a
> relfilenumber for new relation storage is when relations are moved
> around, then I think that scenario can't happen. However, I think
> assigning a new relfilenumber is probably better, because it gets us
> closer to a world in which relfilenumbers are never reused at all. It
> doesn't get us all the way there because of createdb() and movedb(),
> but it gets us closer and I prefer that.

I agree with you.

> I notice that the patch makes no changes to relmapper.c, and I think
> that's a problem. Notice in particular:
>
> #define MAX_MAPPINGS            62  /* 62 * 8 + 16 = 512 */
>
> I believe that making RelFileNumber into a 64-bit value will cause the
> 8 in the calculation above to change to 16, defeating the intention
> that the size of the file ought to be the smallest imaginable size of
> a disk sector. It does seem like it would have been smart to include a
> StaticAssertStmt in this file someplace that checks that the data
> structure has the expected size, and now might be a good time, perhaps
> in a separate patch, to add one. If we do nothing fancy here, the
> maximum number of mappings will have to be reduced from 62 to 31,
> which is a problem because global/pg_filenode.map currently has 48
> entries. We could try to arrange to squeeze padding out of the
> RelMapping struct, which would let us use just 12 bytes per mapping,
> which would increase the limit to 41, but that's still less than we're
> using already, never mind leaving room for future growth.
>
> I don't know what to do about this exactly. I believe it's been
> previously suggested that the actual minimum sector size on reasonably
> modern hardware is never as small as 512 bytes, so maybe the file size
> can just be increased to 1kB or something. If that idea is judged
> unsafe, I can think of two other possible approaches offhand. One is
> that we could move away from the idea of storing the OIDs in the file
> along with the RelFileNodes, and instead store the offset for a given
> RelFileNode at a fixed offset in the file. That would require either
> hard-wiring offset tables into the code someplace, or generating them
> as part of the build process, with separate tables for shared and
> database-local relation map files. The other is that we could have
> multiple 512-byte sectors and try to arrange for each relation to be
> in the same sector with the indexes of that relation, since the
> comments in relmapper.c say this:
>
>  * aborts.  An important factor here is that the indexes and toast table of
>  * a mapped catalog must also be mapped, so that the rewrites/relocations of
>  * all these files commit in a single map file update rather than being tied
>  * to transaction commit.
>
> This suggests that atomicity is required across a table and its
> indexes, but that it's needed across arbitrary sets of entries in the
> file.
>
> Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC.

I am not sure what is the best solution here, but I agree that most of
the modern hardware will have bigger sector size than 512 so we can
just change file size of 1024.

The current value of RELMAPPER_FILEMAGIC is 0x592717, I am not sure
how this version ID is decide is this some random magic number or
based on some logic?

>
> +       uint32          relNumber_low;  /* relfilenumber 32 lower bits */
> +       uint32          relNumber_hi:24;        /* relfilenumber 24 high bits */
> +       uint32          forkNum:8;              /* fork number */
>
> I still think we'd be better off with something like uint32
> relForkDetails[2]. The bitfields would be nice if they meant that we
> didn't have to do bit-shifting and masking operations ourselves, but
> with the field split this way, we do anyway. So what's the point in
> mixing the approaches?

Actually with this we were able to access the forkNum directly, but I
also think changing as relForkDetails[2] is cleaner so done that.  And
as part of the related changes in 0001 I have removed the direct
access to the forkNum.

[1] Wait event details

Procedure:
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
  relname varchar;
  pid int;
  i   int;
BEGIN
  SELECT pg_backend_pid() INTO pid;
  relname := 'test_' || pid;
  FOR i IN 1..count LOOP
    EXECUTE format('CREATE TABLE %s(a int)', relname);

    EXECUTE format('DROP TABLE %s', relname);
  END LOOP;
END;

Target test: Executed "select create_table(100);" query from pgbench
with 32 concurrent backends.

VAR_RELNUMBER_PREFETCH = 8

    905  LWLock          | LockManager
    346  LWLock          | RelFileNumberGen
    192
    190  Activity        | WalWriterMain

VAR_RELNUMBER_PREFETCH=128
   1187  LWLock          | LockManager
    247  LWLock          | RelFileNumberGen
    139  Activity        | CheckpointerMain

VAR_RELNUMBER_PREFETCH=256

   1029  LWLock          | LockManager
    158  LWLock          | BufferContent
    134  Activity        | CheckpointerMain
    134  Activity        | AutoVacuumMain
    133  Activity        | BgWriterMain
    132  Activity        | WalWriterMain
    130  Activity        | LogicalLauncherMain
    123  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=512

  1174  LWLock          | LockManager
    136  Activity        | CheckpointerMain
    136  Activity        | BgWriterMain
    136  Activity        | AutoVacuumMain
    134  Activity        | WalWriterMain
    134  Activity        | LogicalLauncherMain
     99  LWLock          | BufferContent
     35  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=2048
   1070  LWLock          | LockManager
    160  LWLock          | BufferContent
    156  Activity        | CheckpointerMain
    156
    155  Activity        | BgWriterMain
    154  Activity        | AutoVacuumMain
    153  Activity        | WalWriterMain
    149  Activity        | LogicalLauncherMain
     31  LWLock          | RelFileNumberGen
     28  Timeout         | VacuumDelay


VAR_RELNUMBER_PREFETCH=4096
Note, no wait event for RelFileNumberGen at value 4096

New patch with piggybacking XLogFlush()

VAR_RELNUMBER_PREFETCH = 8

  1105  LWLock          | LockManager
    143  LWLock          | BufferContent
    140  Activity        | CheckpointerMain
    140  Activity        | BgWriterMain
    139  Activity        | WalWriterMain
    138  Activity        | AutoVacuumMain
    137  Activity        | LogicalLauncherMain
    115  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 256
   1130  LWLock          | LockManager
    141  Activity        | CheckpointerMain
    139  Activity        | BgWriterMain
    137  Activity        | AutoVacuumMain
    136  Activity        | LogicalLauncherMain
    135  Activity        | WalWriterMain
     69  LWLock          | BufferContent
     31  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 1024
Note: no wait event for RelFileNumberGen at value 1024


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Mon, Jul 11, 2022 at 3:34 PM Andres Freund <andres@anarazel.de> wrote:
> Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
> first (likely adding O_TRUNC to flags), use durable_rename() to rename it into
> place.  The tempfile should probably be written out before the XLogInsert(),
> the durable_rename() after, although I think it'd also be correct to more
> closely approximate the current sequence.

Something like this?

I chose not to use durable_rename() here, because that allowed me to
do more of the work before starting the critical section, and it's
probably slightly more efficient this way, too. That could be changed,
though, if you really want to stick with durable_rename().

I haven't done anything about actually making the file variable-length
here, either, which I think is what we would want to do. If this seems
more or less all right, I can work on that next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

v1-0001-rough-draft-of-removing-relmap-size-restriction.patch

Re: making relfilenodes 56 bits

От

Andres Freund

Дата:

12 июля 2022 г., 02:21:57

On 2022-07-11 16:11:53 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 3:34 PM Andres Freund <andres@anarazel.de> wrote:
> > Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
> > first (likely adding O_TRUNC to flags), use durable_rename() to rename it into
> > place.  The tempfile should probably be written out before the XLogInsert(),
> > the durable_rename() after, although I think it'd also be correct to more
> > closely approximate the current sequence.
> 
> Something like this?

Yea. I've not looked carefully, but on a quick skim it looks good.

> I chose not to use durable_rename() here, because that allowed me to
> do more of the work before starting the critical section, and it's
> probably slightly more efficient this way, too. That could be changed,
> though, if you really want to stick with durable_rename().

I guess I'm not enthused in duplicating the necessary knowledge in evermore
places. We've forgotten one of the magic incantations in the past, and needing
to find all the places that need to be patched is a bit bothersome.

Perhaps we could add extract helpers out of durable_rename()?

OTOH, I don't really see what we gain by keeping things out of the critical
section? It does seem good to have the temp-file creation/truncation and write
separately, but after that I don't think it's worth much to avoid a
PANIC. What legitimate issue does it avoid?

Greetings,

Andres Freund

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

12 июля 2022 г., 16:51:12

On Mon, Jul 11, 2022 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:
> I guess I'm not enthused in duplicating the necessary knowledge in evermore
> places. We've forgotten one of the magic incantations in the past, and needing
> to find all the places that need to be patched is a bit bothersome.
>
> Perhaps we could add extract helpers out of durable_rename()?
>
> OTOH, I don't really see what we gain by keeping things out of the critical
> section? It does seem good to have the temp-file creation/truncation and write
> separately, but after that I don't think it's worth much to avoid a
> PANIC. What legitimate issue does it avoid?

OK, so then I think we should just use durable_rename(). Here's a
patch that does it that way. I briefly considered the idea of
extracting helpers, but it doesn't seem worthwhile to me. There's not
that much code in durable_rename() in the first place.

In this version, I also removed the struct padding, changed the limit
on the number of entries to a nice round 64, and made some comment
updates. I considered trying to go further and actually make the file
variable-size, so that we never again need to worry about the limit on
the number of entries, but I don't actually think that's a good idea.
It would require substantially more changes to the code in this file,
and that means there's more risk of introducing bugs, and I don't see
that there's much value anyway, because if we ever do hit the current
limit, we can just raise the limit.

If we were going to split up durable_rename(), the only intelligible
split I can see would be to have a second version of the function, or
a flag to the existing function, that caters to the situation where
the old file is already known to have been fsync()'d.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

v2-0001-Remove-the-restriction-that-the-relmap-must-be-51.patch

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

12 июля 2022 г., 18:26:08

On Mon, Jul 11, 2022 at 9:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>

> It also makes me wonder why we're using macros rather than static
> inline functions in buf_internals.h. I wonder whether we could do
> something like this, for example, and keep InvalidForkNumber as -1:
>
> static inline ForkNumber
> BufTagGetForkNum(BufferTag *tagPtr)
> {
>     int8 ret;
>
>     StaticAssertStmt(MAX_FORKNUM <= INT8_MAX);
>     ret = (int8) ((tagPtr->relForkDetails[0] >> BUFFERTAG_RELNUMBER_BITS);
>     return (ForkNumber) ret;
> }
>
> Even if we don't use that particular trick, I think we've generally
> been moving toward using static inline functions rather than macros,
> because it provides better type-safety and the code is often easier to
> read. Maybe we should also approach it that way here. Or even commit a
> preparatory patch replacing the existing macros with inline functions.
> Or maybe it's best to leave it alone, not sure.

I think it make sense to convert existing macros as well, I have
attached a patch for the same,
>
> > I had those changes in v7-0003, now I have merged with 0002.  This has
> > assert check while replaying the WAL for smgr create and smgr
> > truncate, and while during normal path when allocating the new
> > relfilenumber we are asserting for any existing file.
>
> I think a test-and-elog might be better. Most users won't be running
> assert-enabled builds, but this seems worth checking regardless.

IMHO the recovery time asserts we can convert to elog but one which we
are doing after each GetNewRelFileNumber is better to keep as an
assert as we are doing the file access so it can be costly?

> > I have done some performance tests, with very small values I can see a
> > lot of wait events for RelFileNumberGen but with bigger numbers like
> > 256 or 512 it is not really bad.  See results at the end of the
> > mail[1]
>
> It's a little hard to interpret these results because you don't say
> how often you were checking the wait events, or how often the
> operation took to complete. I suppose we can guess the relative time
> scale from the number of Activity events: if there were 190
> WalWriterMain events observed, then the time to complete the operation
> is probably 190 times how often you were checking the wait events, but
> was that every second or every half second or every tenth of a second?

I am executing it after every 0.5 sec using below script in psql
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

And running test for 60 sec
./pgbench -c 32 -j 32 -T 60 -f create_script.sql -p 54321  postgres

$ cat create_script.sql
select create_table(100);

// function body 'create_table'
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
  relname varchar;
  pid int;
  i   int;
BEGIN
  SELECT pg_backend_pid() INTO pid;
  relname := 'test_' || pid;
  FOR i IN 1..count LOOP
    EXECUTE format('CREATE TABLE %s(a int)', relname);

    EXECUTE format('DROP TABLE %s', relname);
  END LOOP;
END;
$$ LANGUAGE plpgsql;



> > I have done these changes during GetNewRelFileNumber() this required
> > to track the last logged record pointer as well but I think this looks
> > clean.  With this I can see some reduction in RelFileNumberGen wait
> > event[1]
>
> I find the code you wrote here a little bit magical. I believe it
> depends heavily on choosing to issue the new WAL record when we've
> exhausted exactly 50% of the available space. I suggest having two
> constants, one of which is the number of relfilenumber values per WAL
> record, and the other of which is the threshold for issuing a new WAL
> record. Maybe something like RFN_VALUES_PER_XLOG and
> RFN_NEW_XLOG_THRESHOLD, or something. And then work code that works
> correctly for any value of RFN_NEW_XLOG_THRESHOLD between 0 (don't log
> new RFNs until old allocation is completely exhausted) and
> RFN_VALUES_PER_XLOG - 1 (log new RFNs after using just 1 item from the
> previous allocation). That way, if in the future someone decides to
> change the constant values, they can do that and the code still works.

ok



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v1-0001-Convert-buf_internal.h-macros-to-static-inline-fu.patch

Re: making relfilenodes 56 bits

От

Andres Freund

Дата:

12 июля 2022 г., 20:09:44

Hi,

On 2022-07-12 09:51:12 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:
> > I guess I'm not enthused in duplicating the necessary knowledge in evermore
> > places. We've forgotten one of the magic incantations in the past, and needing
> > to find all the places that need to be patched is a bit bothersome.
> >
> > Perhaps we could add extract helpers out of durable_rename()?
> >
> > OTOH, I don't really see what we gain by keeping things out of the critical
> > section? It does seem good to have the temp-file creation/truncation and write
> > separately, but after that I don't think it's worth much to avoid a
> > PANIC. What legitimate issue does it avoid?
> 
> OK, so then I think we should just use durable_rename(). Here's a
> patch that does it that way. I briefly considered the idea of
> extracting helpers, but it doesn't seem worthwhile to me. There's not
> that much code in durable_rename() in the first place.

Cool.


> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates.

What does currently happen if we exceed that?

I wonder if we should just reference a new define generated by genbki.pl
documenting the number of relations that need to be tracked. Then we don't
need to maintain this manually going forward.


> I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.

Yea, I don't really see what we'd gain. For this stuff to change we need to
recompile anyway.


> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

I was thinking of something like durable_rename_prep() that'd fsync the
file/directories under their old names, and then durable_rename_exec() that
actually renames and then fsyncs.  But without a clear usecase...


> +    /* Write new data to the file. */
> +    pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE);
> +    if (write(fd, newmap, sizeof(RelMapFile)) != sizeof(RelMapFile))
...
> +    pgstat_report_wait_end();
> +

Not for this patch, but we eventually should move this sequence into a
wrapper. Perhaps combined with retry handling for short writes, the ENOSPC
stuff and an error message when the write fails. It's a bit insane how many
copies of this we have.


> diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> index b578e2ec75..5d3775ccde 100644
> --- a/src/include/utils/wait_event.h
> +++ b/src/include/utils/wait_event.h
> @@ -193,7 +193,7 @@ typedef enum
>      WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
>      WAIT_EVENT_LOGICAL_REWRITE_WRITE,
>      WAIT_EVENT_RELATION_MAP_READ,
> -    WAIT_EVENT_RELATION_MAP_SYNC,
> +    WAIT_EVENT_RELATION_MAP_RENAME,

Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
since it includes fsync etc?

Greetings,

Andres Freund

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

12 июля 2022 г., 23:35:46

On Tue, Jul 12, 2022 at 1:09 PM Andres Freund <andres@anarazel.de> wrote:
> What does currently happen if we exceed that?

elog

> > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> > index b578e2ec75..5d3775ccde 100644
> > --- a/src/include/utils/wait_event.h
> > +++ b/src/include/utils/wait_event.h
> > @@ -193,7 +193,7 @@ typedef enum
> >       WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> >       WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> >       WAIT_EVENT_RELATION_MAP_READ,
> > -     WAIT_EVENT_RELATION_MAP_SYNC,
> > +     WAIT_EVENT_RELATION_MAP_RENAME,
>
> Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> since it includes fsync etc?

Sure, I had it that way for a while and changed it at the last minute.
I can change it back.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Hannu Krosing

Дата:

13 июля 2022 г., 00:00:22

Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);

Have you really thought through making the ForkNum 8-bit ?

For example this would limit a columnar storage with each column
stored in it's own fork (which I'd say is not entirely unreasonable)
to having just about ~250 columns.

And there can easily be other use cases where we do not want to limit
number of forks so much

Cheers
Hannu

On Tue, Jul 12, 2022 at 10:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 12, 2022 at 1:09 PM Andres Freund <andres@anarazel.de> wrote:
> > What does currently happen if we exceed that?
>
> elog
>
> > > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> > > index b578e2ec75..5d3775ccde 100644
> > > --- a/src/include/utils/wait_event.h
> > > +++ b/src/include/utils/wait_event.h
> > > @@ -193,7 +193,7 @@ typedef enum
> > >       WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> > >       WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> > >       WAIT_EVENT_RELATION_MAP_READ,
> > > -     WAIT_EVENT_RELATION_MAP_SYNC,
> > > +     WAIT_EVENT_RELATION_MAP_RENAME,
> >
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>

Re: making relfilenodes 56 bits

От

Andres Freund

Дата:

13 июля 2022 г., 01:02:53

Hi,

Please don't top quote - as mentioned a couple times recently.

On 2022-07-12 23:00:22 +0200, Hannu Krosing wrote:
> Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);
> 
> Have you really thought through making the ForkNum 8-bit ?

MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
filesystem access each time), it's not feasible to make that number large.

Greetings,

Andres Freund

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

13 июля 2022 г., 01:30:58

On Tue, Jul 12, 2022 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:
> MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
> restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
> filesystem access each time), it's not feasible to make that number large.

Yeah. TBH, what I'd really like to do is kill the entire fork system
with fire and replace it with something more scalable, which would
maybe permit the sort of thing Hannu suggests here. With the current
system, forget it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

13 июля 2022 г., 07:05:54

On Tue, Jul 12, 2022 at 7:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>

> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates. I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.
> It would require substantially more changes to the code in this file,
> and that means there's more risk of introducing bugs, and I don't see
> that there's much value anyway, because if we ever do hit the current
> limit, we can just raise the limit.
>
> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

The patch looks good except one minor comment

+ * corruption.  Since the file might be more tha none standard-size disk
+ * sector in size, we cannot rely on overwrite-in-place. Instead, we generate

typo "more tha none" -> "more than one"

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

14 июля 2022 г., 14:48:32

On Wed, Jul 13, 2022 at 9:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jul 12, 2022 at 7:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
>
> > In this version, I also removed the struct padding, changed the limit
> > on the number of entries to a nice round 64, and made some comment
> > updates. I considered trying to go further and actually make the file
> > variable-size, so that we never again need to worry about the limit on
> > the number of entries, but I don't actually think that's a good idea.
> > It would require substantially more changes to the code in this file,
> > and that means there's more risk of introducing bugs, and I don't see
> > that there's much value anyway, because if we ever do hit the current
> > limit, we can just raise the limit.
> >
> > If we were going to split up durable_rename(), the only intelligible
> > split I can see would be to have a second version of the function, or
> > a flag to the existing function, that caters to the situation where
> > the old file is already known to have been fsync()'d.
>
> The patch looks good except one minor comment
>
> + * corruption.  Since the file might be more tha none standard-size disk
> + * sector in size, we cannot rely on overwrite-in-place. Instead, we generate
>
> typo "more tha none" -> "more than one"
>
I have fixed this and included this change in the new patch series.

Apart from this I have fixed all the pending issues that includes

- Change existing macros to inline functions done in 0001.
- Change pg_class index from (tbspc, relfilenode) to relfilenode and
also change RelidByRelfilenumber().  In RelidByRelfilenumber I have
changed the hash to maintain based on just the relfilenumber but we
still need to pass the tablespace to identify whether it is a shared
relation or not.  If we want we can make it bool but I don't think
that is really needed here.
- Changed logic of GetNewRelFileNumber() based on what Robert
described, and instead of tracking the pending logged relnumbercount
now I am tracking last loggedRelNumber, which help little bit in
SetNextRelFileNumber in making code cleaner, but otherwise it doesn't
make much difference.
- Some new asserts in buf_internal inline function to validate value
of computed/input relfilenumber.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

18 июля 2022 г., 14:21:00

On Thu, Jul 14, 2022 at 5:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Apart from this I have fixed all the pending issues that includes
>
> - Change existing macros to inline functions done in 0001.
> - Change pg_class index from (tbspc, relfilenode) to relfilenode and
> also change RelidByRelfilenumber().  In RelidByRelfilenumber I have
> changed the hash to maintain based on just the relfilenumber but we
> still need to pass the tablespace to identify whether it is a shared
> relation or not.  If we want we can make it bool but I don't think
> that is really needed here.
> - Changed logic of GetNewRelFileNumber() based on what Robert
> described, and instead of tracking the pending logged relnumbercount
> now I am tracking last loggedRelNumber, which help little bit in
> SetNextRelFileNumber in making code cleaner, but otherwise it doesn't
> make much difference.
> - Some new asserts in buf_internal inline function to validate value
> of computed/input relfilenumber.

I was doing some more testing by setting the FirstNormalRelFileNumber
to a high value(more than 32 bits) I have noticed a couple of problems
there e.g. relpath is still using OIDCHARS macro which says max
relfilenumber file name can be only 10 character long which is no
longer true.  So there we need to change this value to 20 and also
need to carefully rename the macros and other variable names used for
this purpose.

Similarly there was some issue in macro in buf_internal.h while
fetching the relfilenumber.  So I will relook into all those issues
and repost the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

20 июля 2022 г., 14:26:47

On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I was doing some more testing by setting the FirstNormalRelFileNumber
> to a high value(more than 32 bits) I have noticed a couple of problems
> there e.g. relpath is still using OIDCHARS macro which says max
> relfilenumber file name can be only 10 character long which is no
> longer true.  So there we need to change this value to 20 and also
> need to carefully rename the macros and other variable names used for
> this purpose.
>
> Similarly there was some issue in macro in buf_internal.h while
> fetching the relfilenumber.  So I will relook into all those issues
> and repost the patch soon.

I have fixed these existing issues and there was also some issue in
pg_dump.c which was creating problems in upgrading to the same version
while using a higher range of the relfilenumber.

There was also an issue where the user table from the old cluster's
relfilenode could conflict with the system table of the new cluster.
As a solution currently for system table object (while creating
storage first time) we are keeping the low range of relfilenumber,
basically we are using the same relfilenumber as OID so that during
upgrade the normal user table from the old cluster will not conflict
with the system tables in the new cluster.  But with this solution
Robert told me (in off list chat) a problem that in future if we want
to make relfilenumber completely unique within a cluster by
implementing the CREATEDB differently then we can not do that as we
have created fixed relfilenodes for the system tables.

I am not sure what exactly we can do to avoid that because even if we
do something  to avoid that in the new cluster the old cluster might
be already using the non-unique relfilenode so after upgrading the new
cluster will also get those non-unique relfilenode.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hi,

As oid and relfilenumber are linked with each other, I still see that if the oid value reaches the threshold limit, we are unable to create a table with storage. For example I set FirstNormalObjectId to 4294967294 (one value less than the range limit of 2^32 -1 = 4294967295). Now when I try to create a table, the CREATE TABLE command gets stuck because it is unable to find the OID for the comp type although it can find a new relfilenumber.

postgres=# create table t1(a int);
CREATE TABLE

postgres=# select oid, reltype, relfilenode from pg_class where relname = 't1';
oid | reltype | relfilenode
------------+------------+-------------
4294967295 | 4294967294 | 100000
(1 row)

postgres=# create table t2(a int);
^CCancel request sent
ERROR: canceling statement due to user request

creation of t2 table gets stuck as it is unable to find a new oid. Basically the point that I am trying to put here is even though we will be able to find the new relfile number by increasing the relfilenumber size but still the commands like above will not execute if the oid value (of 32 bits) has reached the threshold limit.

With Regards,

Ashutosh Sharma.

On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I was doing some more testing by setting the FirstNormalRelFileNumber
> to a high value(more than 32 bits) I have noticed a couple of problems
> there e.g. relpath is still using OIDCHARS macro which says max
> relfilenumber file name can be only 10 character long which is no
> longer true. So there we need to change this value to 20 and also
> need to carefully rename the macros and other variable names used for
> this purpose.
>
> Similarly there was some issue in macro in buf_internal.h while
> fetching the relfilenumber. So I will relook into all those issues
> and repost the patch soon.

I have fixed these existing issues and there was also some issue in
pg_dump.c which was creating problems in upgrading to the same version
while using a higher range of the relfilenumber.

There was also an issue where the user table from the old cluster's
relfilenode could conflict with the system table of the new cluster.
As a solution currently for system table object (while creating
storage first time) we are keeping the low range of relfilenumber,
basically we are using the same relfilenumber as OID so that during
upgrade the normal user table from the old cluster will not conflict
with the system tables in the new cluster. But with this solution
Robert told me (in off list chat) a problem that in future if we want
to make relfilenumber completely unique within a cluster by
implementing the CREATEDB differently then we can not do that as we
have created fixed relfilenodes for the system tables.

I am not sure what exactly we can do to avoid that because even if we
do something to avoid that in the new cluster the old cluster might
be already using the non-unique relfilenode so after upgrading the new
cluster will also get those non-unique relfilenode.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Amul Sul

Дата:

26 июля 2022 г., 07:35:08

On Fri, Jul 22, 2022 at 4:21 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I was doing some more testing by setting the FirstNormalRelFileNumber
> > > to a high value(more than 32 bits) I have noticed a couple of problems
> > > there e.g. relpath is still using OIDCHARS macro which says max
> > > relfilenumber file name can be only 10 character long which is no
> > > longer true.  So there we need to change this value to 20 and also
> > > need to carefully rename the macros and other variable names used for
> > > this purpose.
> > >
> > > Similarly there was some issue in macro in buf_internal.h while
> > > fetching the relfilenumber.  So I will relook into all those issues
> > > and repost the patch soon.
> >
> > I have fixed these existing issues and there was also some issue in
> > pg_dump.c which was creating problems in upgrading to the same version
> > while using a higher range of the relfilenumber.
> >
> > There was also an issue where the user table from the old cluster's
> > relfilenode could conflict with the system table of the new cluster.
> > As a solution currently for system table object (while creating
> > storage first time) we are keeping the low range of relfilenumber,
> > basically we are using the same relfilenumber as OID so that during
> > upgrade the normal user table from the old cluster will not conflict
> > with the system tables in the new cluster.  But with this solution
> > Robert told me (in off list chat) a problem that in future if we want
> > to make relfilenumber completely unique within a cluster by
> > implementing the CREATEDB differently then we can not do that as we
> > have created fixed relfilenodes for the system tables.
> >
> > I am not sure what exactly we can do to avoid that because even if we
> > do something  to avoid that in the new cluster the old cluster might
> > be already using the non-unique relfilenode so after upgrading the new
> > cluster will also get those non-unique relfilenode.
>
> Thanks for the patch, my comments from the initial review:
> 1) Since we have changed the macros to inline functions, should we
> change the function names similar to the other inline functions in the
> same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:
> -#define BUFFERTAGS_EQUAL(a,b) \
> -( \
> -       RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
> -       (a).blockNum == (b).blockNum && \
> -       (a).forkNum == (b).forkNum \
> -)
> +static inline void
> +CLEAR_BUFFERTAG(BufferTag *tag)
> +{
> +       tag->rlocator.spcOid = InvalidOid;
> +       tag->rlocator.dbOid = InvalidOid;
> +       tag->rlocator.relNumber = InvalidRelFileNumber;
> +       tag->forkNum = InvalidForkNumber;
> +       tag->blockNum = InvalidBlockNumber;
> +}
>
> 2) We could move this macros along with the other macros at the top of the file:
> +/*
> + * The freeNext field is either the index of the next freelist entry,
> + * or one of these special values:
> + */
> +#define FREENEXT_END_OF_LIST   (-1)
> +#define FREENEXT_NOT_IN_LIST   (-2)
>
> 3) typo thn should be then:
> + * can raise it as necessary if we end up with more mapped relations. For
> + * now, we just pick a round number that is modestly larger thn the expected
> + * number of mappings.
> + */
>

Few more typos in 0004 patch as well:

the a value
interger
previosly
currenly

> 4) There is one whitespace issue:
> git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
> Applying: Widen relfilenumber from 32 bits to 56 bits
> .git/rebase-apply/patch:1500: space before tab in indent.
> (relfilenumber)))); \
> warning: 1 line adds whitespace errors.
>
> Regards,
> Vignesh
>

Regards,
Amul

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

26 июля 2022 г., 08:57:28

On Mon, Jul 25, 2022 at 9:51 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi,
>
> As oid and relfilenumber are linked with each other, I still see that if the oid value reaches the threshold limit,
weare unable to create a table with storage. For example I set FirstNormalObjectId to 4294967294 (one value less than
therange limit of 2^32 -1 = 4294967295). Now when I try to create a table, the CREATE TABLE command gets stuck because
itis unable to find the OID for the comp type although it can find a new relfilenumber. 
>

First of all if the OID value reaches to max oid then it should wrap
around to the FirstNormalObjectId and find a new non conflicting OID.
Since in your case the first normaloid is 4294967294 and max oid is
42949672945 there is no scope of wraparound because in this case you
can create at most one object and once you created that then there is
no more unused oid left and with the current patch we are not at all
trying do anything about this.

Now come to the problem we are trying to solve with 56bits
relfilenode.  Here we are not trying to extend the limit of the system
to create more than 4294967294 objects.  What we are trying to solve
is to not to reuse the same disk filenames for different objects.  And
also notice that the relfilenodes can get consumed really faster than
oid so chances of wraparound is more, I mean you can truncate/rewrite
the same relation multiple times so that relation will have the same
oid but will consume multiple relfilenodes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

26 июля 2022 г., 08:59:23

On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [v10 patch set]
>
> Hi Dilip, I'm experimenting with these patches and will hopefully have
> more to say soon, but I just wanted to point out that this builds with
> warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> there is the good kind of uninitialised data on Linux, and the bad
> kind of uninitialised data on those other pesky systems?

Thanks, I have figured out the issue, I will post the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

26 июля 2022 г., 09:07:33

On Fri, Jul 22, 2022 at 4:21 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

> Thanks for the patch, my comments from the initial review:
> 1) Since we have changed the macros to inline functions, should we
> change the function names similar to the other inline functions in the
> same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:

I have thought about it while doing so but I am not sure whether it is
a good idea or not, because before my change these all were macros
with 2 naming conventions so I just changed to inline function so why
to change the name.

> -#define BUFFERTAGS_EQUAL(a,b) \
> -( \
> -       RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
> -       (a).blockNum == (b).blockNum && \
> -       (a).forkNum == (b).forkNum \
> -)
> +static inline void
> +CLEAR_BUFFERTAG(BufferTag *tag)
> +{
> +       tag->rlocator.spcOid = InvalidOid;
> +       tag->rlocator.dbOid = InvalidOid;
> +       tag->rlocator.relNumber = InvalidRelFileNumber;
> +       tag->forkNum = InvalidForkNumber;
> +       tag->blockNum = InvalidBlockNumber;
> +}
>
> 2) We could move this macros along with the other macros at the top of the file:
> +/*
> + * The freeNext field is either the index of the next freelist entry,
> + * or one of these special values:
> + */
> +#define FREENEXT_END_OF_LIST   (-1)
> +#define FREENEXT_NOT_IN_LIST   (-2)

Yeah we can do that.

> 3) typo thn should be then:
> + * can raise it as necessary if we end up with more mapped relations. For
> + * now, we just pick a round number that is modestly larger thn the expected
> + * number of mappings.
> + */
>
> 4) There is one whitespace issue:
> git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
> Applying: Widen relfilenumber from 32 bits to 56 bits
> .git/rebase-apply/patch:1500: space before tab in indent.
> (relfilenumber)))); \
> warning: 1 line adds whitespace errors.

Okay, I will fix it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

26 июля 2022 г., 09:09:13

On Tue, Jul 26, 2022 at 10:05 AM Amul Sul <sulamul@gmail.com> wrote:
>
> Few more typos in 0004 patch as well:
>
> the a value
> interger
> previosly
> currenly
>

Thanks for the review, I will fix it in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

26 июля 2022 г., 11:01:38

On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [v10 patch set]
>
> Hi Dilip, I'm experimenting with these patches and will hopefully have
> more to say soon, but I just wanted to point out that this builds with
> warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> there is the good kind of uninitialised data on Linux, and the bad
> kind of uninitialised data on those other pesky systems?

Here is the patch to fix the issue, basically, while asserting for the
file existence it was not setting the relfilenumber in the
relfilelocator before generating the path so it was just checking for
the existence of the random path so it was asserting randomly.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От

Ashutosh Sharma

Дата:

26 июля 2022 г., 15:35:46

/*
* If relfilenumber is unspecified by the caller then create storage
- * with oid same as relid.
+ * with relfilenumber same as relid if it is a system table otherwise
+ * allocate a new relfilenumber. For more details read comments atop
+ * FirstNormalRelFileNumber declaration.
*/
if (!RelFileNumberIsValid(relfilenumber))
- relfilenumber = relid;
+ {
+ relfilenumber = relid < FirstNormalObjectId ?
+ relid : GetNewRelFileNumber();

Above code says that in the case of system table we want relfilenode to be the same as object id. This technically means that the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code added in the patch, it says there is some chance for the storage path of the user tables from the old cluster conflicting with the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on the old cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict.

+/* ----------
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage
+ * will be created with the relfilenumber same as their oid. And, later for
+ * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
+ * at 100000. Thus, when upgrading from an older cluster, the relation storage
+ * path for the user table from the old cluster will not conflict with the
+ * relation storage path for the system table from the new cluster. Anyway,
+ * the new cluster must not have any user tables while upgrading, so we needn't
+ * worry about them.
+ * ----------
+ */
+#define FirstNormalRelFileNumber ((RelFileNumber) 100000)

==

When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is 512. Any reason for choosing this low arbitrary value in case of relfilenumber?

With Regards,

Ashutosh Sharma.

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [v10 patch set]
>
> Hi Dilip, I'm experimenting with these patches and will hopefully have
> more to say soon, but I just wanted to point out that this builds with
> warnings and failed on 3/4 of the CI OSes on cfbot's last run. Maybe
> there is the good kind of uninitialised data on Linux, and the bad
> kind of uninitialised data on those other pesky systems?

Here is the patch to fix the issue, basically, while asserting for the
file existence it was not setting the relfilenumber in the
relfilelocator before generating the path so it was just checking for
the existence of the random path so it was asserting randomly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

26 июля 2022 г., 17:02:42

On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,
Note: please avoid top posting.

>         /*
>          * If relfilenumber is unspecified by the caller then create storage
> -        * with oid same as relid.
> +        * with relfilenumber same as relid if it is a system table otherwise
> +        * allocate a new relfilenumber.  For more details read comments atop
> +        * FirstNormalRelFileNumber declaration.
>          */
>         if (!RelFileNumberIsValid(relfilenumber))
> -           relfilenumber = relid;
> +       {
> +           relfilenumber = relid < FirstNormalObjectId ?
> +               relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be the same as object id. This technically
meansthat the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code
addedin the patch, it says there is some chance for the storage path of the user tables from the old cluster
conflictingwith the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on
theold cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict. 

Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383.  And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way.  Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid.  But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* ----------
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid.  And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
> + * at 100000.  Thus, when upgrading from an older cluster, the relation storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster.  Anyway,
> + * the new cluster must not have any user tables while upgrading, so we needn't
> + * worry about them.
> + * ----------
> + */
> +#define FirstNormalRelFileNumber   ((RelFileNumber) 100000)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is
512.Any reason for choosing this low arbitrary value in case of relfilenumber? 

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid.  OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables.  So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Ashutosh Sharma

Дата:

26 июля 2022 г., 20:06:44

Thanks Dilip. Here are few comments that could find upon quickly reviewing the v11 patch:

/*
+ * Similar to the XLogPutNextOid but instead of writing NEXTOID log record it
+ * writes a NEXT_RELFILENUMBER log record. If '*prevrecptr' is a valid
+ * XLogRecPtrthen flush the wal upto this record pointer otherwise flush upto

XLogRecPtrthen -> XLogRecPtr then

==

+ switch (relpersistence)
+ {
+ case RELPERSISTENCE_TEMP:
+ backend = BackendIdForTempRelations();
+ break;
+ case RELPERSISTENCE_UNLOGGED:
+ case RELPERSISTENCE_PERMANENT:
+ backend = InvalidBackendId;
+ break;
+ default:
+ elog(ERROR, "invalid relpersistence: %c", relpersistence);
+ return InvalidRelFileNumber; /* placate compiler */
+ }

I think the above check should be added at the beginning of the function for the reason that if we come to the default switch case we won't be acquiring the lwlock and do other stuff to get a new relfilenumber.

==

- newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
+ * Generate a new relfilenumber. We cannot reuse the old relfilenumber
+ * because of the possibility that that relation will be moved back to the

that that relation -> that relation.

==

+ * option_parse_relfilenumber
+ *
+ * Parse relfilenumber value for an option. If the parsing is successful,
+ * returns; if parsing fails, returns false.
+ */

If parsing is successful, returns true;

--

With Regards,

Ashutosh Sharma.

On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,
Note: please avoid top posting.

> /*
> * If relfilenumber is unspecified by the caller then create storage
> - * with oid same as relid.
> + * with relfilenumber same as relid if it is a system table otherwise
> + * allocate a new relfilenumber. For more details read comments atop
> + * FirstNormalRelFileNumber declaration.
> */
> if (!RelFileNumberIsValid(relfilenumber))
> - relfilenumber = relid;
> + {
> + relfilenumber = relid < FirstNormalObjectId ?
> + relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be the same as object id. This technically means that the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code added in the patch, it says there is some chance for the storage path of the user tables from the old cluster conflicting with the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on the old cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict.

Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383. And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way. Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid. But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* ----------
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid. And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
> + * at 100000. Thus, when upgrading from an older cluster, the relation storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster. Anyway,
> + * the new cluster must not have any user tables while upgrading, so we needn't
> + * worry about them.
> + * ----------
> + */
> +#define FirstNormalRelFileNumber ((RelFileNumber) 100000)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is 512. Any reason for choosing this low arbitrary value in case of relfilenumber?

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid. OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables. So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

26 июля 2022 г., 21:37:19

On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have thought about it while doing so but I am not sure whether it is
> a good idea or not, because before my change these all were macros
> with 2 naming conventions so I just changed to inline function so why
> to change the name.

Well, the reason to change the name would be for consistency. It feels
weird to have some NAMES_LIKETHIS() and other NamesLikeThis().

Now, an argument against that is that it will make back-patching more
annoying, if any code using these functions/macros is touched. But
since the calling sequence is changing anyway (you now have to pass a
pointer rather than the object itself) that argument doesn't really
carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

26 июля 2022 г., 22:19:45

On Tue, Jul 12, 2022 at 4:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.

Committed that way, also with the fix for the typo Dilip found.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Ashutosh Sharma

Дата:

27 июля 2022 г., 10:54:07

Some more comments:

Shouldn't we retry for the new relfilenumber if "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a cases where some of the tables are dropped by the user and relfilenumber of those tables can be reused for which we would need to find the relfilenumber that can be resued. For e.g. consider below example:

postgres=# create table t1(a int);
CREATE TABLE

postgres=# create table t2(a int);
CREATE TABLE

postgres=# create table t3(a int);
ERROR: relfilenumber is out of bound

postgres=# drop table t1, t2;
DROP TABLE

postgres=# checkpoint;
CHECKPOINT

postgres=# vacuum;
VACUUM

Now if I try to recreate table t3, it should succeed, shouldn't it? But it doesn't because we simply error out by seeing the nextRelFileNumber saved in the shared memory.

postgres=# create table t1(a int);
ERROR: relfilenumber is out of bound

I think, above should have worked.

==

<caution>
<para>
Note that while a table's filenode often matches its OID, this is
<emphasis>not</emphasis> necessarily the case; some operations, like
<command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.

I think this note needs some improvement in storage.sgml. It says the table's relfilenode mostly matches its OID, but it doesn't. This will happen only in case of system table and maybe never in case of user table.

==

postgres=# create table t2(a int);
ERROR: relfilenumber is out of bound

Since this is a user-visible error, I think it would be good to mention relfilenode instead of relfilenumber. Elsewhere (including the user manual) we refer to this as a relfilenode.

With Regards,

Ashutosh Sharma.

On Tue, Jul 26, 2022 at 10:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thanks Dilip. Here are few comments that could find upon quickly reviewing the v11 patch:

/*
+ * Similar to the XLogPutNextOid but instead of writing NEXTOID log record it
+ * writes a NEXT_RELFILENUMBER log record. If '*prevrecptr' is a valid
+ * XLogRecPtrthen flush the wal upto this record pointer otherwise flush upto

XLogRecPtrthen -> XLogRecPtr then

==

+ switch (relpersistence)
+ {
+ case RELPERSISTENCE_TEMP:
+ backend = BackendIdForTempRelations();
+ break;
+ case RELPERSISTENCE_UNLOGGED:
+ case RELPERSISTENCE_PERMANENT:
+ backend = InvalidBackendId;
+ break;
+ default:
+ elog(ERROR, "invalid relpersistence: %c", relpersistence);
+ return InvalidRelFileNumber; /* placate compiler */
+ }

I think the above check should be added at the beginning of the function for the reason that if we come to the default switch case we won't be acquiring the lwlock and do other stuff to get a new relfilenumber.

==

- newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
+ * Generate a new relfilenumber. We cannot reuse the old relfilenumber
+ * because of the possibility that that relation will be moved back to the

that that relation -> that relation.

==

+ * option_parse_relfilenumber
+ *
+ * Parse relfilenumber value for an option. If the parsing is successful,
+ * returns; if parsing fails, returns false.
+ */

If parsing is successful, returns true;

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,
Note: please avoid top posting.

> /*
> * If relfilenumber is unspecified by the caller then create storage
> - * with oid same as relid.
> + * with relfilenumber same as relid if it is a system table otherwise
> + * allocate a new relfilenumber. For more details read comments atop
> + * FirstNormalRelFileNumber declaration.
> */
> if (!RelFileNumberIsValid(relfilenumber))
> - relfilenumber = relid;
> + {
> + relfilenumber = relid < FirstNormalObjectId ?
> + relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be the same as object id. This technically means that the relfilenode or oid for the system tables would not be exceeding 16383. However in the below lines of code added in the patch, it says there is some chance for the storage path of the user tables from the old cluster conflicting with the storage path of the system tables in the new cluster. Assuming that the OIDs for the user tables on the old cluster would start with 16384 (the first object ID), I see no reason why there would be a conflict.

Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383. And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way. Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid. But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* ----------
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid. And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will start
> + * at 100000. Thus, when upgrading from an older cluster, the relation storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster. Anyway,
> + * the new cluster must not have any user tables while upgrading, so we needn't
> + * worry about them.
> + * ----------
> + */
> +#define FirstNormalRelFileNumber ((RelFileNumber) 100000)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold value as 8192 whereas for relfilenode it is 512. Any reason for choosing this low arbitrary value in case of relfilenumber?

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid. OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables. So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

27 июля 2022 г., 11:11:16

On Wed, Jul 27, 2022 at 1:24 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Some more comments:

Note: Please don't top post.

> ==
>
> Shouldn't we retry for the new relfilenumber if "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There
canbe a cases where some of the tables are dropped by the user and relfilenumber of those tables can be reused for
whichwe would need to find the relfilenumber that can be resued. For e.g. consider below example: 
>
> postgres=# create table t1(a int);
> CREATE TABLE
>
> postgres=# create table t2(a int);
> CREATE TABLE
>
> postgres=# create table t3(a int);
> ERROR:  relfilenumber is out of bound
>
> postgres=# drop table t1, t2;
> DROP TABLE
>
> postgres=# checkpoint;
> CHECKPOINT
>
> postgres=# vacuum;
> VACUUM
>
> Now if I try to recreate table t3, it should succeed, shouldn't it? But it doesn't because we simply error out by
seeingthe nextRelFileNumber saved in the shared memory. 
>
> postgres=# create table t1(a int);
> ERROR:  relfilenumber is out of bound
>
> I think, above should have worked.

No, it should not, the whole point of this design is not to reuse the
relfilenumber ever within a cluster lifetime.  You might want to read
this mail[1] that by the time we use 2^56 relfilenumbers the cluster
will anyway reach its lifetime by other factors.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2BZrDms7gSjckme8YV2tzxgZ0KVfGcsjaFoKyzQX_f_Mw%40mail.gmail.com

> ==
>
> <caution>
> <para>
> Note that while a table's filenode often matches its OID, this is
> <emphasis>not</emphasis> necessarily the case; some operations, like
> <command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
> of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.
>
> I think this note needs some improvement in storage.sgml. It says the table's relfilenode mostly matches its OID, but
itdoesn't. This will happen only in case of system table and maybe never in case of user table. 

Yes, this should be changed.

> postgres=# create table t2(a int);
> ERROR:  relfilenumber is out of bound
>
> Since this is a user-visible error, I think it would be good to mention relfilenode instead of relfilenumber.
Elsewhere(including the user manual) we refer to this as a relfilenode. 

No this is expected to be an internal error because in general during
the cluster lifetime ideally, we should never reach this number.  So
we are putting this check so that it should not reach this number due
to some other computational/programming mistake.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

vignesh C

Дата:

27 июля 2022 г., 12:57:11

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.

Thanks for the updated patch, Few comments:
1) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_load_buffers
...............
if (fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
   &blkinfo[i].tablespace, &blkinfo[i].filenumber,
   &forknum, &blkinfo[i].blocknum) != 5)
...............

2) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_dump_now
...............
ret = fprintf(file, "%u,%u,%u,%u,%u\n",
  block_info_array[i].database,
  block_info_array[i].tablespace,
  block_info_array[i].filenumber,
  (uint32) block_info_array[i].forknum,
  block_info_array[i].blocknum);
...............

3) should the comment "entry point for old extension version" be on
top of pg_buffercache_pages, as the current version will use
pg_buffercache_pages_v1_4
+
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+       return pg_buffercache_pages_internal(fcinfo, OIDOID);
+}
+
+/* entry point for old extension version */
+Datum
+pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
+{
+       return pg_buffercache_pages_internal(fcinfo, INT8OID);
+}

4) we could use the new style or ereport by removing the brackets
around errcode:
+                               if (fctx->record[i].relfilenumber > OID_MAX)
+                                       ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+
errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",
+
 fctx->record[i].relfilenumber),
+
errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE")));

like:
ereport(ERROR,

errcode(ERRCODE_INVALID_PARAMETER_VALUE),

errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",

fctx->record[i].relfilenumber),

errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE"));

5) Similarly in the below code too:
+       /* check whether the relfilenumber is within a valid range */
+       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+               ereport(ERROR,
+                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+                                               (relfilenumber))));


6) Similarly in the below code too:
+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
         \
+do {
                                                         \
+       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+               ereport(ERROR,
                                                 \
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),                      \
+                                errmsg("relfilenumber " INT64_FORMAT
" is out of range",       \
+                                               (relfilenumber)))); \
+} while (0)
+


7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
this macro be used here too:
pg_filenode_relation(PG_FUNCTION_ARGS)
 {
        Oid                     reltablespace = PG_GETARG_OID(0);
-       RelFileNumber relfilenumber = PG_GETARG_OID(1);
+       RelFileNumber relfilenumber = PG_GETARG_INT64(1);
        Oid                     heaprel;

+       /* check whether the relfilenumber is within a valid range */
+       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+               ereport(ERROR,
+                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+                                               (relfilenumber))));


8) I felt this include is not required:
diff --git a/src/backend/access/transam/varsup.c
b/src/backend/access/transam/varsup.c
index 849a7ce..a2f0d35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -13,12 +13,16 @@

 #include "postgres.h"

+#include <unistd.h>
+
 #include "access/clog.h"
 #include "access/commit_ts.h"

9) should we change elog to ereport to use the New-style error reporting API
+       /* safety check, we should never get this far in a HS standby */
+       if (RecoveryInProgress())
+               elog(ERROR, "cannot assign RelFileNumber during recovery");
+
+       if (IsBinaryUpgrade)
+               elog(ERROR, "cannot assign RelFileNumber during binary
upgrade");

10) Here nextRelFileNumber is protected by RelFileNumberGenLock, the
comment stated OidGenLock. It should be slightly adjusted.
typedef struct VariableCacheData
{
/*
* These fields are protected by OidGenLock.
*/
Oid nextOid; /* next OID to assign */
uint32 oidCount; /* OIDs available before must do XLOG work */
RelFileNumber nextRelFileNumber; /* next relfilenumber to assign */
RelFileNumber loggedRelFileNumber; /* last logged relfilenumber */
XLogRecPtr loggedRelFileNumberRecPtr; /* xlog record pointer w.r.t.
* loggedRelFileNumber */

Regards,
Vignesh

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

27 июля 2022 г., 15:32:01

On Wed, Jul 27, 2022 at 3:27 PM vignesh C <vignesh21@gmail.com> wrote:
>

> Thanks for the updated patch, Few comments:
> 1) The format specifier should be changed from %u to INT64_FORMAT
> autoprewarm.c -> apw_load_buffers
> ...............
> if (fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
>    &blkinfo[i].tablespace, &blkinfo[i].filenumber,
>    &forknum, &blkinfo[i].blocknum) != 5)
> ...............
>
> 2) The format specifier should be changed from %u to INT64_FORMAT
> autoprewarm.c -> apw_dump_now
> ...............
> ret = fprintf(file, "%u,%u,%u,%u,%u\n",
>   block_info_array[i].database,
>   block_info_array[i].tablespace,
>   block_info_array[i].filenumber,
>   (uint32) block_info_array[i].forknum,
>   block_info_array[i].blocknum);
> ...............
>
> 3) should the comment "entry point for old extension version" be on
> top of pg_buffercache_pages, as the current version will use
> pg_buffercache_pages_v1_4
> +
> +Datum
> +pg_buffercache_pages(PG_FUNCTION_ARGS)
> +{
> +       return pg_buffercache_pages_internal(fcinfo, OIDOID);
> +}
> +
> +/* entry point for old extension version */
> +Datum
> +pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
> +{
> +       return pg_buffercache_pages_internal(fcinfo, INT8OID);
> +}
>
> 4) we could use the new style or ereport by removing the brackets
> around errcode:
> +                               if (fctx->record[i].relfilenumber > OID_MAX)
> +                                       ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +
> errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> an OID",
> +
>  fctx->record[i].relfilenumber),
> +
> errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> UPDATE")));
>
> like:
> ereport(ERROR,
>
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>
> errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> an OID",
>
> fctx->record[i].relfilenumber),
>
> errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> UPDATE"));
>
> 5) Similarly in the below code too:
> +       /* check whether the relfilenumber is within a valid range */
> +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> +               ereport(ERROR,
> +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                errmsg("relfilenumber " INT64_FORMAT
> " is out of range",
> +                                               (relfilenumber))));
>
>
> 6) Similarly in the below code too:
> +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
>          \
> +do {
>                                                          \
> +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> +               ereport(ERROR,
>                                                  \
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),                      \
> +                                errmsg("relfilenumber " INT64_FORMAT
> " is out of range",       \
> +                                               (relfilenumber)))); \
> +} while (0)
> +
>
>
> 7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
> this macro be used here too:
> pg_filenode_relation(PG_FUNCTION_ARGS)
>  {
>         Oid                     reltablespace = PG_GETARG_OID(0);
> -       RelFileNumber relfilenumber = PG_GETARG_OID(1);
> +       RelFileNumber relfilenumber = PG_GETARG_INT64(1);
>         Oid                     heaprel;
>
> +       /* check whether the relfilenumber is within a valid range */
> +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> +               ereport(ERROR,
> +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                errmsg("relfilenumber " INT64_FORMAT
> " is out of range",
> +                                               (relfilenumber))));
>
>
> 8) I felt this include is not required:
> diff --git a/src/backend/access/transam/varsup.c
> b/src/backend/access/transam/varsup.c
> index 849a7ce..a2f0d35 100644
> --- a/src/backend/access/transam/varsup.c
> +++ b/src/backend/access/transam/varsup.c
> @@ -13,12 +13,16 @@
>
>  #include "postgres.h"
>
> +#include <unistd.h>
> +
>  #include "access/clog.h"
>  #include "access/commit_ts.h"
>
> 9) should we change elog to ereport to use the New-style error reporting API
> +       /* safety check, we should never get this far in a HS standby */
> +       if (RecoveryInProgress())
> +               elog(ERROR, "cannot assign RelFileNumber during recovery");
> +
> +       if (IsBinaryUpgrade)
> +               elog(ERROR, "cannot assign RelFileNumber during binary
> upgrade");
>
> 10) Here nextRelFileNumber is protected by RelFileNumberGenLock, the
> comment stated OidGenLock. It should be slightly adjusted.
> typedef struct VariableCacheData
> {
> /*
> * These fields are protected by OidGenLock.
> */
> Oid nextOid; /* next OID to assign */
> uint32 oidCount; /* OIDs available before must do XLOG work */
> RelFileNumber nextRelFileNumber; /* next relfilenumber to assign */
> RelFileNumber loggedRelFileNumber; /* last logged relfilenumber */
> XLogRecPtr loggedRelFileNumberRecPtr; /* xlog record pointer w.r.t.
> * loggedRelFileNumber */

Thanks for the review I have fixed these except,
> 9) should we change elog to ereport to use the New-style error reporting API
I think this is internal error so if we use ereport we need to give
error code and all and I think for internal that is not necessary?

> 8) I felt this include is not required:
it is using access API so we do need <unistd.h>

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

27 июля 2022 г., 19:19:38

On Wed, Jul 27, 2022 at 12:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have thought about it while doing so but I am not sure whether it is
> > a good idea or not, because before my change these all were macros
> > with 2 naming conventions so I just changed to inline function so why
> > to change the name.
>
> Well, the reason to change the name would be for consistency. It feels
> weird to have some NAMES_LIKETHIS() and other NamesLikeThis().
>
> Now, an argument against that is that it will make back-patching more
> annoying, if any code using these functions/macros is touched. But
> since the calling sequence is changing anyway (you now have to pass a
> pointer rather than the object itself) that argument doesn't really
> carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
> etc.

Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well
to BufferTagEqual().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

27 июля 2022 г., 19:37:16

On Wed, 27 Jul 2022 at 9:49 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 27, 2022 at 12:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have thought about it while doing so but I am not sure whether it is
> > a good idea or not, because before my change these all were macros
> > with 2 naming conventions so I just changed to inline function so why
> > to change the name.
>
> Well, the reason to change the name would be for consistency. It feels
> weird to have some NAMES_LIKETHIS() and other NamesLikeThis().
>
> Now, an argument against that is that it will make back-patching more
> annoying, if any code using these functions/macros is touched. But
> since the calling sequence is changing anyway (you now have to pass a
> pointer rather than the object itself) that argument doesn't really
> carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
> etc.

Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well
to BufferTagEqual().

Just realised that this should have been BufferTagsEqual instead of BufferTagEqual

I will modify this and send an updated patch tomorrow.

—

Dilip

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

27 июля 2022 г., 21:09:19

On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Just realised that this should have been BufferTagsEqual instead of BufferTagEqual
>
> I will modify this and send an updated patch tomorrow.

I changed it and committed.

What was formerly 0002 will need minor rebasing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

28 июля 2022 г., 14:31:47

On Wed, Jul 27, 2022 at 11:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Just realised that this should have been BufferTagsEqual instead of BufferTagEqual
> >
> > I will modify this and send an updated patch tomorrow.
>
> I changed it and committed.
>
> What was formerly 0002 will need minor rebasing.

Thanks, I have rebased other patches,  actually, there is a new 0001
patch now.  It seems during renaming relnode related Oid to
RelFileNumber, some of the references were missed and in the last
patch set I kept it as part of main patch 0003, but I think it's
better to keep it separate.  So took out those changes and created
0001, but you think this can be committed as part of 0003 only then
also it's fine with me.

I have done some cleanup in 0002 as well, basically, earlier we were
storing the result of the BufTagGetRelFileLocator() in a separate
variable which is not required everywhere.  So wherever possible I
have avoided using the intermediate variable.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Thu, Jul 28, 2022 at 9:52 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I do wonder why do we keep relfilenodes limited to decimal digits. Why
> not use hex digits? Then we know the limit is 14 chars, as in
> 0x00FFFFFFFFFFFFFF in the MAX_RELFILENUMBER definition.

Hmm, but surely we want the error messages to be printed using the
same format that we use for the actual filenames. We could make the
filenames use hex characters too, but I'm not wild about changing
user-visible details like that.

From a DBA perspective this would be a regression in usability.

Founder - https://commandprompt.com/ - 24x7x365 Postgres since 1997
Founder and Co-Chair - https://postgresconf.org/
Founder - https://postgresql.us - United States PostgreSQL
Public speaker, published author, postgresql expert, and people believer.
Host - More than a refresh: A podcast about data and the people who wrangle it.

Re: making relfilenodes 56 bits

От

Alvaro Herrera

Дата:

29 июля 2022 г., 12:36:33

On 2022-Jul-28, Robert Haas wrote:

> On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> > not use hex digits?  Then we know the limit is 14 chars, as in
> > 0x00FFFFFFFFFFFFFF in the MAX_RELFILENUMBER definition.
> 
> Hmm, but surely we want the error messages to be printed using the
> same format that we use for the actual filenames.

Of course.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."                               (Nathaniel Smith)

Re: making relfilenodes 56 bits

От

Ashutosh Sharma

Дата:

29 июля 2022 г., 15:56:29

On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+/* ----------
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage

Above comment says that RelFileNumber zero is invalid which is technically correct because we don't have any relation file in disk with zero number. But the point is that if someone reads below definition of CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this definition relfilenumber zero is valid.

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber) \
+do { \
+ if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+ ereport(ERROR, \
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE), \
+ errmsg("relfilenumber " INT64_FORMAT " is out of range", \
+ (relfilenumber)))); \
+} while (0)
+

+ RelFileNumber relfilenumber = PG_GETARG_INT64(0);
+ CHECK_RELFILENUMBER_RANGE(relfilenumber);

It seems like the relfilenumber in above definition represents relfilenode value in pg_class which can hold zero value which actually means it's a mapped relation. I think it would be good to provide some clarity here.

With Regards,

Ashutosh Sharma.

Re: making relfilenodes 56 bits

От

Ashutosh Sharma

Дата:

29 июля 2022 г., 17:32:08

On Fri, Jul 29, 2022 at 6:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+/* ----------
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage

Above comment says that RelFileNumber zero is invalid which is technically correct because we don't have any relation file in disk with zero number. But the point is that if someone reads below definition of CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this definition relfilenumber zero is valid.

Please ignore the above comment shared in my previous email. It is a little over-thinking on my part that generated this comment in my mind. Sorry for that. Here are the other comments I have:

+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_buffercache DROP VIEW pg_buffercache;
+ALTER EXTENSION pg_buffercache DROP FUNCTION pg_buffercache_pages();
+
+/* Then we can drop them */
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+/* Now redefine */
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages_v1_4'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode int8, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);

As we are dropping the function and view I think it would be good if we *don't* use the "OR REPLACE" keyword when re-defining them.

==

+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("relfilenode" INT64_FORMAT " is too large to be represented as an OID",
+ fctx->record[i].relfilenumber),
+ errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE")));

I think it would be good to recommend users to upgrade to the latest version instead of just saying upgrade the pg_buffercache using ALTER EXTENSION ....

==

--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -39,10 +39,10 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
-- Test for filtering out WAL records of a particular table
-- ===================================================================

-SELECT oid AS sample_tbl_oid FROM pg_class WHERE relname = 'sample_tbl' \gset
+SELECT relfilenode AS sample_tbl_relfilenode FROM pg_class WHERE relname = 'sample_tbl' \gset

Is this change required? The original query is just trying to fetch table oid not relfilenode and AFAIK we haven't changed anything in table oid.

==

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber) \
+do { \
+ if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+ ereport(ERROR, \
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE), \
+ errmsg("relfilenumber " INT64_FORMAT " is out of range", \
+ (relfilenumber)))); \
+} while (0)
+

I think we can shift this macro to some header file and reuse it at several places.

==

+ * Generate a new relfilenumber. We cannot reuse the old relfilenumber
+ * because of the possibility that that relation will be moved back to the

that that relation -> that relation

With Regards,

Ashutosh Sharma.

Re: making relfilenodes 56 bits

От

vignesh C

Дата:

29 июля 2022 г., 18:44:08

On Wed, Jul 27, 2022 at 6:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 27, 2022 at 3:27 PM vignesh C <vignesh21@gmail.com> wrote:
> >
>
> > Thanks for the updated patch, Few comments:
> > 1) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_load_buffers
> > ...............
> > if (fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
> >    &blkinfo[i].tablespace, &blkinfo[i].filenumber,
> >    &forknum, &blkinfo[i].blocknum) != 5)
> > ...............
> >
> > 2) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_dump_now
> > ...............
> > ret = fprintf(file, "%u,%u,%u,%u,%u\n",
> >   block_info_array[i].database,
> >   block_info_array[i].tablespace,
> >   block_info_array[i].filenumber,
> >   (uint32) block_info_array[i].forknum,
> >   block_info_array[i].blocknum);
> > ...............
> >
> > 3) should the comment "entry point for old extension version" be on
> > top of pg_buffercache_pages, as the current version will use
> > pg_buffercache_pages_v1_4
> > +
> > +Datum
> > +pg_buffercache_pages(PG_FUNCTION_ARGS)
> > +{
> > +       return pg_buffercache_pages_internal(fcinfo, OIDOID);
> > +}
> > +
> > +/* entry point for old extension version */
> > +Datum
> > +pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
> > +{
> > +       return pg_buffercache_pages_internal(fcinfo, INT8OID);
> > +}
> >
> > 4) we could use the new style or ereport by removing the brackets
> > around errcode:
> > +                               if (fctx->record[i].relfilenumber > OID_MAX)
> > +                                       ereport(ERROR,
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> > +
> >  fctx->record[i].relfilenumber),
> > +
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE")));
> >
> > like:
> > ereport(ERROR,
> >
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> >
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> >
> > fctx->record[i].relfilenumber),
> >
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE"));
> >
> > 5) Similarly in the below code too:
> > +       /* check whether the relfilenumber is within a valid range */
> > +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +               ereport(ERROR,
> > +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                                errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +                                               (relfilenumber))));
> >
> >
> > 6) Similarly in the below code too:
> > +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
> >          \
> > +do {
> >                                                          \
> > +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> > +               ereport(ERROR,
> >                                                  \
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),                      \
> > +                                errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",       \
> > +                                               (relfilenumber)))); \
> > +} while (0)
> > +
> >
> >
> > 7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
> > this macro be used here too:
> > pg_filenode_relation(PG_FUNCTION_ARGS)
> >  {
> >         Oid                     reltablespace = PG_GETARG_OID(0);
> > -       RelFileNumber relfilenumber = PG_GETARG_OID(1);
> > +       RelFileNumber relfilenumber = PG_GETARG_INT64(1);
> >         Oid                     heaprel;
> >
> > +       /* check whether the relfilenumber is within a valid range */
> > +       if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +               ereport(ERROR,
> > +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                                errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +                                               (relfilenumber))));
> >
> >
> > 8) I felt this include is not required:
> > diff --git a/src/backend/access/transam/varsup.c
> > b/src/backend/access/transam/varsup.c
> > index 849a7ce..a2f0d35 100644
> > --- a/src/backend/access/transam/varsup.c
> > +++ b/src/backend/access/transam/varsup.c
> > @@ -13,12 +13,16 @@
> >
> >  #include "postgres.h"
> >
> > +#include <unistd.h>
> > +
> >  #include "access/clog.h"
> >  #include "access/commit_ts.h"
> >
> > 9) should we change elog to ereport to use the New-style error reporting API
> > +       /* safety check, we should never get this far in a HS standby */
> > +       if (RecoveryInProgress())
> > +               elog(ERROR, "cannot assign RelFileNumber during recovery");
> > +
> > +       if (IsBinaryUpgrade)
> > +               elog(ERROR, "cannot assign RelFileNumber during binary
> > upgrade");
> >
> > 10) Here nextRelFileNumber is protected by RelFileNumberGenLock, the
> > comment stated OidGenLock. It should be slightly adjusted.
> > typedef struct VariableCacheData
> > {
> > /*
> > * These fields are protected by OidGenLock.
> > */
> > Oid nextOid; /* next OID to assign */
> > uint32 oidCount; /* OIDs available before must do XLOG work */
> > RelFileNumber nextRelFileNumber; /* next relfilenumber to assign */
> > RelFileNumber loggedRelFileNumber; /* last logged relfilenumber */
> > XLogRecPtr loggedRelFileNumberRecPtr; /* xlog record pointer w.r.t.
> > * loggedRelFileNumber */
>
> Thanks for the review I have fixed these except,
> > 9) should we change elog to ereport to use the New-style error reporting API
> I think this is internal error so if we use ereport we need to give
> error code and all and I think for internal that is not necessary?

Ok, Sounds reasonable.

> > 8) I felt this include is not required:
> it is using access API so we do need <unistd.h>

Ok, It worked for me because I had not used the ASSERT enabled flag
while compilation.

Regards,
Vignesh

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

29 июля 2022 г., 20:24:50

On Thu, Jul 28, 2022 at 10:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I have done some cleanup in 0002 as well, basically, earlier we were
> > storing the result of the BufTagGetRelFileLocator() in a separate
> > variable which is not required everywhere.  So wherever possible I
> > have avoided using the intermediate variable.
>
> I'll have a look at this next.

I was taught that when programming in C one should avoid returning a
struct type, as BufTagGetRelFileLocator does. I would have expected it
to return void and take an argument of type RelFileLocator * into
which it writes the results. On the other hand, I was also taught that
one should avoid passing a struct type as an argument, and smgropen()
has been doing that since Tom Lane committed
87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
maybe this isn't that relevant any more on modern compilers? Or maybe
for small structs it doesn't matter much? I dunno.

Other than that, I think your 0002 looks fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Alvaro Herrera

Дата:

29 июля 2022 г., 21:12:53

On 2022-Jul-29, Robert Haas wrote:

> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
RelFileLocator as argument.  With this coding you can call one function
with the other function as its argument.

However, with the current definition of relpathbackend() and siblings,
it looks quite disastrous -- BufTagGetRelFileLocator is being called
three times.  You could argue that a solution would be to turn those
macros into static inline functions.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

29 июля 2022 г., 21:41:29

On Fri, Jul 29, 2022 at 2:12 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
> RelFileLocator as argument.  With this coding you can call one function
> with the other function as its argument.
>
> However, with the current definition of relpathbackend() and siblings,
> it looks quite disastrous -- BufTagGetRelFileLocator is being called
> three times.  You could argue that a solution would be to turn those
> macros into static inline functions.

Yeah, if we think it's OK to pass around structs, then that seems like
the right solution. Otherwise functions that take RelFileLocator
should be changed to take const RelFileLocator * and we should adjust
elsewhere accordingly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Alvaro Herrera

Дата:

29 июля 2022 г., 22:18:18

On 2022-Jul-29, Robert Haas wrote:

> Yeah, if we think it's OK to pass around structs, then that seems like
> the right solution. Otherwise functions that take RelFileLocator
> should be changed to take const RelFileLocator * and we should adjust
> elsewhere accordingly.

We do that in other places.  See get_object_address() for another
example.  Now, I don't see *why* they do it.  I suppose there's
notational convenience; for get_object_address() I think it'd be uglier
with another out argument (it already has *relp).  For smgropen() it's
not clear at all that there is any.

For the new function, there's at least a couple of places that the
calling convention makes simpler, so I don't see why you wouldn't use it
that way.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

29 июля 2022 г., 22:57:17

On Fri, Jul 29, 2022 at 3:18 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > Yeah, if we think it's OK to pass around structs, then that seems like
> > the right solution. Otherwise functions that take RelFileLocator
> > should be changed to take const RelFileLocator * and we should adjust
> > elsewhere accordingly.
>
> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.  I suppose there's
> notational convenience; for get_object_address() I think it'd be uglier
> with another out argument (it already has *relp).  For smgropen() it's
> not clear at all that there is any.
>
> For the new function, there's at least a couple of places that the
> calling convention makes simpler, so I don't see why you wouldn't use it
> that way.

All right, perhaps it's fine as Dilip has it, then.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Tom Lane

Дата:

29 июля 2022 г., 23:05:34

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2022-Jul-29, Robert Haas wrote:
>> Yeah, if we think it's OK to pass around structs, then that seems like
>> the right solution. Otherwise functions that take RelFileLocator
>> should be changed to take const RelFileLocator * and we should adjust
>> elsewhere accordingly.

> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.

If it's a big struct then avoiding copying it is good; but RelFileLocator
isn't that big.

While researching that statement I did happen to notice that no one has
bothered to update the comment immediately above struct RelFileLocator,
and it is something that absolutely does require attention if there
are plans to make RelFileNumber something other than 32 bits.

 * Note: various places use RelFileLocator in hashtable keys.  Therefore,
 * there *must not* be any unused padding bytes in this struct.  That
 * should be safe as long as all the fields are of type Oid.
 */
typedef struct RelFileLocator
{
    Oid            spcOid;            /* tablespace */
    Oid            dbOid;             /* database */
    RelFileNumber  relNumber;         /* relation */
} RelFileLocator;

            regards, tom lane

Re: making relfilenodes 56 bits

От

Tom Lane

Дата:

29 июля 2022 г., 23:08:21

Robert Haas <robertmhaas@gmail.com> writes:
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
C99 and later requires it.  But it is pass-by-value and you have
to think twice about whether you want the struct to be copied.

            regards, tom lane

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

29 июля 2022 г., 23:29:28

On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.

I think this aspect of the patch could use some more discussion.

To recap, the problem is that pg_upgrade mustn't discover that a
relfilenode that is being migrated from the old cluster is being used
for some other table in the new cluster. Since the new cluster should
only contain system tables that we assume have never been rewritten,
they'll all have relfilenodes equal to their OIDs, and thus less than
16384. On the other hand all the user tables from the old cluster will
have relfilenodes greater than 16384, so we're fine. pg_largeobject,
which also gets migrated, is a special case. Since we don't change OID
assignments from version to version, it should have either the same
relfilenode value in the old and new clusters, if never rewritten, or
else the value in the old cluster will be greater than 16384, in which
case no conflict is possible.

But if we just assign all relfilenode values from a central counter,
then we have got trouble. If the new version has more system catalog
tables than the old version, then some value that got used for a user
table in the old version might get used for a system table in the new
version, which is a problem. One idea for fixing this is to have two
RelFileNumber ranges: a system range (small values) and a user range.
System tables get values in the system range initially, and in the
user range when first rewritten. User tables always get values in the
user range. Everything works fine in this scenario except maybe for
pg_largeobject: what if it gets one value from the system range in the
old cluster, and a different value from the system range in the new
cluster, but some other system table in the new cluster gets the value
that pg_largeobject had in the old cluster? Then we've got trouble. It
doesn't help if we assign pg_largeobject a starting relfilenode from
the user range, either: now a relfilenode that needs to end up
containing the some user table from the old cluster might find itself
blocked by pg_largeobject in the new cluster.

One solution to all this is to do as Dilip proposes here: for system
relations, keep assigning the OID as the initial relfilenumber.
Actually, we really only need to do this for pg_largeobject; all the
other relfilenumber values could be assigned from a counter, as long
as they're assigned from a range distinct from what we use for user
relations.

But I don't really like that, because I feel like the whole thing
where we start out with relfilenumber=oid is a recipe for hidden bugs.
I believe we'd be better off if we decouple those concepts more
thoroughly. So here's another idea: what if we set the
next-relfilenumber counter for the new cluster to the value from the
old cluster, and then rewrote all the (thus-far-empty) system tables?
Then every system relation in the new cluster has a relfilenode value
greater than any in use in the old cluster, so we can afterwards
migrate over every relfilenode from the old cluster with no risk of
conflicting with anything. Then all the special cases go away. We
don't need system and user ranges for relfilenodes, and
pg_largeobject's not a special case, either. We can assign relfilenode
values to system relations in exactly the same we do for user
relations: assign a value from the global counter and forget about it.
If this cluster happens to be the "new cluster" for a pg_upgrade
attempt, the procedure described at the beginning of this paragraph
will move everything that might conflict out of the way.

One thing to perhaps not like about this is that it's a little more
expensive: clustering every system table in every database on a new
cluster isn't completely free. Perhaps it's not expensive enough to be
a big problem, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Thomas Munro

Дата:

30 июля 2022 г., 00:11:12

On Sat, Jul 30, 2022 at 8:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
> C99 and later requires it.  But it is pass-by-value and you have
> to think twice about whether you want the struct to be copied.

C89 had that.

As for what it actually does in a non-inlined function: on all modern
Unix-y systems, 128 bit first arguments and return values are
transferred in register pairs[1].  So if you define a struct that
holds uint32_t, uint32_t, uint64_t and compile a function that takes
one and returns it, you see the struct being transferred directly from
input registers to output registers:

   0x0000000000000000 <+0>:    mov    %rdi,%rax
   0x0000000000000003 <+3>:    mov    %rsi,%rdx
   0x0000000000000006 <+6>:    ret

Similar on ARM64.  There it's an empty function, so it must be using
the same register in and out[2].

The MSVC calling convention is different and doesn't seem to be able
to pass it through registers, so it schleps it out to memory at a
return address[3].  But that's pretty similar to the proposed
alternative anyway, so surely no worse.  *shrug*  And of course those
"constructor"-like functions are inlined anyway.

[1] https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI
[2] https://gcc.godbolt.org/z/qfPzhW7YM
[3] https://gcc.godbolt.org/z/WqvYz6xjs

Re: making relfilenodes 56 bits

От

Thomas Munro

Дата:

30 июля 2022 г., 00:17:08

On Sat, Jul 30, 2022 at 9:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> on all modern Unix-y systems,

(I meant to write AMD64 there)

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

30 июля 2022 г., 06:45:03

On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Not a full review, just a quick skim of 0003.

Thanks for the review

> > +     if (!shutdown)
> > +     {
> > +             if (ShmemVariableCache->loggedRelFileNumber < checkPoint.nextRelFileNumber)
> > +                     elog(ERROR, "nextRelFileNumber can not go backward from " INT64_FORMAT "to" INT64_FORMAT,
> > +                              checkPoint.nextRelFileNumber, ShmemVariableCache->loggedRelFileNumber);
> > +
> > +             checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
> > +     }
>
> Please don't do this; rather use %llu and cast to (long long).
> Otherwise the string becomes mangled for translation.  I think there are
> many uses of this sort of pattern in strings, but not all of them are
> translatable so maybe we don't care -- for example contrib doesn't have
> translations.  And the rmgrdesc routines don't translate either, so we
> probably don't care about it there; and nothing that uses elog either.
> But this one in particular I think should be an ereport, not an elog.
> There are several other ereports in various places of the patch also.

Okay, actually I did not understand the clear logic of when to use
%llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
integers.  So do you think it is fine to replace all INT64_FORMAT in
my patch with %llu?

> > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> >               if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) != 0)
> >               {
> >                       elog(FATAL,
> > -                              "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
> > +                              "inconsistent page found, rel %u/%u/" INT64_FORMAT ", forknum %u, blkno %u",
> >                                rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
> >                                forknum, blkno);
>
> Should this one be an ereport, and thus you do need to change it to that
> and handle it like that?

Okay, so you mean irrespective of this patch should this be converted
to ereport?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Alvaro Herrera

Дата:

30 июля 2022 г., 14:39:22

On 2022-Jul-30, Dilip Kumar wrote:

> On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Please don't do this; rather use %llu and cast to (long long).
> > Otherwise the string becomes mangled for translation.
> 
> Okay, actually I did not understand the clear logic of when to use
> %llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
> integers.  So do you think it is fine to replace all INT64_FORMAT in
> my patch with %llu?

The point here is that there are two users of the source code: one is
the compiler, and the other is gettext, which extracts the string for
the translation catalog.  The compiler is OK with UINT64_FORMAT, of
course (because the preprocessor deals with it).  But gettext is quite
stupid and doesn't understand that UINT64_FORMAT expands to some
specifier, so it truncates the string at the double quote sign just
before; in other words, it just doesn't work.  So whenever you have a
string that ends up in a translation catalog, you must not use
UINT64_FORMAT or any other preprocessor macro; it has to be a straight
specifier in the format string.

We have found that the most convenient notation is to use %llu in the
string and cast the argument to (unsigned long long), so our convention
is to use that.

For strings that do not end up in a translation catalog, there's no
reason to use %llu-and-cast; UINT64_FORMAT is okay.

> > > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> > >               if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) != 0)
> > >               {
> > >                       elog(FATAL,
> > > -                              "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
> > > +                              "inconsistent page found, rel %u/%u/" INT64_FORMAT ", forknum %u, blkno %u",
> > >                                rlocator.spcOid, rlocator.dbOid, rlocator.relNumber,
> > >                                forknum, blkno);
> >
> > Should this one be an ereport, and thus you do need to change it to that
> > and handle it like that?
> 
> Okay, so you mean irrespective of this patch should this be converted
> to ereport?

Yes, I think this should be an ereport with errcode(ERRCODE_DATA_CORRUPTED).

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

01 августа 2022 г., 08:20:59

On Sat, Jul 30, 2022 at 1:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2022-Jul-29, Robert Haas wrote:
> >> Yeah, if we think it's OK to pass around structs, then that seems like
> >> the right solution. Otherwise functions that take RelFileLocator
> >> should be changed to take const RelFileLocator * and we should adjust
> >> elsewhere accordingly.
>
> > We do that in other places.  See get_object_address() for another
> > example.  Now, I don't see *why* they do it.
>
> If it's a big struct then avoiding copying it is good; but RelFileLocator
> isn't that big.
>
> While researching that statement I did happen to notice that no one has
> bothered to update the comment immediately above struct RelFileLocator,
> and it is something that absolutely does require attention if there
> are plans to make RelFileNumber something other than 32 bits.

I think we need to update this comment in the patch where we are
making RelFileNumber 64 bits wide.  But as such I do not see a problem
in using RelFileLocator directly as key because if we make
RelFileNumber 64 bits then its structure will already be 8 byte
aligned so there should not be any padding.  However, if we use some
other structure as key which contain RelFileLocator i.e.
RelFileLocatorBackend then there will be a problem.  So for handling
that issue while computing the key size (wherever we have
RelFileLocatorBackend as key) I have avoided the padding bytes in size
by introducing this new macro[1].

[1]
#define SizeOfRelFileLocatorBackend \
(offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

01 августа 2022 г., 08:34:09

On Fri, Jul 29, 2022 at 10:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 28, 2022 at 10:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I have done some cleanup in 0002 as well, basically, earlier we were
> > > storing the result of the BufTagGetRelFileLocator() in a separate
> > > variable which is not required everywhere.  So wherever possible I
> > > have avoided using the intermediate variable.
> >
> > I'll have a look at this next.
>
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does. I would have expected it
> to return void and take an argument of type RelFileLocator * into
> which it writes the results. On the other hand, I was also taught that
> one should avoid passing a struct type as an argument, and smgropen()
> has been doing that since Tom Lane committed
> 87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
> maybe this isn't that relevant any more on modern compilers? Or maybe
> for small structs it doesn't matter much? I dunno.
>
> Other than that, I think your 0002 looks fine.

Generally, I try to avoid it, but I see in current code also if the
structure is small and by directly returning the structure it makes
the other code easy then we are doing this way[1].  I wanted to do
this way is a) if we pass as an argument then I will have to use an
extra variable which makes some code complicated, it's not a big
issue, infact I had it that way in the previous version but simplified
in one of the recent versions.  b) If I allocate memory and return
pointer then also I need to store that address and later free that.

[1]
static inline ForEachState
for_each_from_setup(const List *lst, int N)
{
ForEachState r = {lst, N};

Assert(N >= 0);
return r;
}

static inline FullTransactionId
FullTransactionIdFromEpochAndXid(uint32 epoch, TransactionId xid)
{
FullTransactionId result;

result.value = ((uint64) epoch) << 32 | xid;

return result;
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

01 августа 2022 г., 14:57:01

On Fri, Jul 29, 2022 at 8:02 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
>
> +                   ereport(ERROR,
> +                           (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                            errmsg("relfilenode" INT64_FORMAT " is too large to be represented as an OID",
> +                                   fctx->record[i].relfilenumber),
> +                            errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE")));
>
> I think it would be good to recommend users to upgrade to the latest version instead of just saying upgrade the
pg_buffercacheusing ALTER EXTENSION ....

This error would be hit if the relfilenumber is out of OID range that
means the user is using a new cluster but old pg_buffercache
extension.  So this errhint is about suggesting to upgrade the
extension.

> ==
>
> --- a/contrib/pg_walinspect/sql/pg_walinspect.sql
> +++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
> @@ -39,10 +39,10 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
>  -- Test for filtering out WAL records of a particular table
>  -- ===================================================================
>
> -SELECT oid AS sample_tbl_oid FROM pg_class WHERE relname = 'sample_tbl' \gset
> +SELECT relfilenode AS sample_tbl_relfilenode FROM pg_class WHERE relname = 'sample_tbl' \gset
>
> Is this change required? The original query is just trying to fetch table oid not relfilenode and AFAIK we haven't
changedanything in table oid.

If you notice the complete test, then you will realize that
sample_tbl_oid are used for verifying that in
pg_get_wal_records_info(), so earlier it was okay if we were using oid
instead of relfilenode because this test case is just creating table
doing some DML and verifying oid in WAL so that will be same as
relfilenode, but that is no longer true.  So we will have to check the
relfilenode that was the actual intention of the test.

>
> +    * Generate a new relfilenumber.  We cannot reuse the old relfilenumber
> +    * because of the possibility that that relation will be moved back to the
>
> that that relation -> that relation
>

I think this is a grammatically correct sentence .

I have fixed other comments, and also fixed comments from Alvaro to
use %lld instead of INT64_FORMAT inside the ereport and wherever he
suggested.

I haven't yet changed MAX_RELFILENUMBER to represent the hex
characters because then we will have to change the filename as well.
So I think there is no conclusion on this yet whether we want to keep
it as it is or in hex.  And there is another suggestion to change one
of the existing elog to an ereport, so for that I will share a
separate patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Sat, Sep 3, 2022 at 1:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 30, 2022 at 9:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > Well, that's very awkward. It doesn't seem like it would be very
> > difficult to teach pg_upgrade to call pg_restore without --clean and
> > just do the drop database itself, but that doesn't really help,
> > because pg_restore will in any event be creating the new database.
> > That doesn't seem like something we can practically refactor out,
> > because only pg_dump knows what properties to use when creating the
> > new database. What we could do is have the dump include a command like
> > SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
> > but that doesn't really help very much, because passing the whole list
> > of relfilenode values from the old database seems pretty certain to be
> > a bad idea. The whole idea here was that we'd be able to build a hash
> > table on the new database's system table OIDs, and it seems like
> > that's not going to work.
>
> Right.
>
> > We could try to salvage some portion of the idea by making
> > pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
> > set of arguments, like the smallest and largest relfilenode values
> > from the old database, and then we'd just need to move things that
> > overlap. But that feels pretty hit-or-miss to me as to whether it
> > actually avoids any work, and
> > pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
> > to write. So perhaps we have to go back to the drawing board here.
>
> So as of now, we have two open options 1) the current approach and
> what patch is following to use Oid as relfilenode for the system
> tables when initially created.  2) call
> pg_binary_upgrade_move_things_out_of_the_way() which force rewrite all
> the system tables.
>
> Another idea that I am not very sure how feasible is. Can we change
> the dump such that in binary upgrade mode it will not use template0 as
> a template database (in creating database command) but instead some
> new database as a template e.g. template-XYZ?   And later for conflict
> checking, we will create this template-XYZ database on the new cluster
> and then we will perform all the conflict check (from all the
> databases of the old cluster) and rewrite operations on this database.
> And later all the databases will be created using template-XYZ as the
> template and all the rewriting stuff we have done is still intact.
> The problems I could think of are 1) only for a binary upgrade we will
> have to change the pg_dump.  2) we will have to use another database
> name as the reserved database name but what if that name is already in
> use in the previous cluster?

While we are still thinking on this issue, I have rebased the patch on
the latest head and fixed a couple of minor issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v16-0001-Widen-relfilenumber-from-32-bits-to-56-bits.patch

Re: making relfilenodes 56 bits

От

Amit Kapila

Дата:

03 сентября 2022 г., 14:41:33

On Tue, Aug 30, 2022 at 6:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 26, 2022 at 9:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > While working on this solution I noticed one issue. Basically, the
> > > problem is that during binary upgrade when we try to rewrite a heap we
> > > would expect that “binary_upgrade_next_heap_pg_class_oid” and
> > > “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> > > creating a new heap. But we are not preserving anything so we don't
> > > have those values. One option to this problem is that we can first
> > > start the postmaster in non-binary upgrade mode perform all conflict
> > > checking and rewrite and stop the postmaster.  Then start postmaster
> > > again and perform the restore as we are doing now.  Although we will
> > > have to start/stop the postmaster one extra time we have a solution.
> >
> > Yeah, that seems OK. Or we could add a new function, like
> > binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
> > Not sure which way is better.
>
> I have found one more issue with this approach of rewriting the
> conflicting table.  Earlier I thought we could do the conflict
> checking and rewriting inside create_new_objects() right before the
> restore command.  But after implementing (while testing) this I
> realized that we DROP and CREATE the database while restoring the dump
> that means it will again generate the conflicting system tables.  So
> theoretically the rewriting should go in between the CREATE DATABASE
> and restoring the object but as of now both create database and
> restoring other objects are part of a single dump file.  I haven't yet
> analyzed how feasible it is to generate the dump in two parts, first
> part just to create the database and in second part restore the rest
> of the object.
>

Isn't this happening because we are passing "--clean
--create"/"--create" options to pg_restore in create_new_objects()? If
so, then I think one idea to decouple would be to not use those
options. Perform drop/create separately via commands (for create, we
need to generate the command as we are generating while generating the
dump in custom format), then rewrite the conflicting tables, and
finally restore the dump.

--
With Regards,
Amit Kapila.

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

04 сентября 2022 г., 06:57:44

On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> > I have found one more issue with this approach of rewriting the
> > conflicting table.  Earlier I thought we could do the conflict
> > checking and rewriting inside create_new_objects() right before the
> > restore command.  But after implementing (while testing) this I
> > realized that we DROP and CREATE the database while restoring the dump
> > that means it will again generate the conflicting system tables.  So
> > theoretically the rewriting should go in between the CREATE DATABASE
> > and restoring the object but as of now both create database and
> > restoring other objects are part of a single dump file.  I haven't yet
> > analyzed how feasible it is to generate the dump in two parts, first
> > part just to create the database and in second part restore the rest
> > of the object.
> >
>
> Isn't this happening because we are passing "--clean
> --create"/"--create" options to pg_restore in create_new_objects()? If
> so, then I think one idea to decouple would be to not use those
> options. Perform drop/create separately via commands (for create, we
> need to generate the command as we are generating while generating the
> dump in custom format), then rewrite the conflicting tables, and
> finally restore the dump.

Hmm, you are right.  So I think something like this is possible to do,
I will explore this more. Thanks for the idea.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

06 сентября 2022 г., 11:40:28

On Sun, Sep 4, 2022 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> > Isn't this happening because we are passing "--clean
> > --create"/"--create" options to pg_restore in create_new_objects()? If
> > so, then I think one idea to decouple would be to not use those
> > options. Perform drop/create separately via commands (for create, we
> > need to generate the command as we are generating while generating the
> > dump in custom format), then rewrite the conflicting tables, and
> > finally restore the dump.
>
> Hmm, you are right.  So I think something like this is possible to do,
> I will explore this more. Thanks for the idea.

I have explored this area more and also tried to come up with a
working prototype, so while working on this I realized that we would
have almost to execute all the code which is getting generated as part
of the dumpDatabase() and dumpACL() which is basically,

1. UPDATE pg_catalog.pg_database SET datistemplate = false
2. DROP DATABASE
3. CREATE DATABASE with all the database properties like ENCODING,
LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
COLLATION_VERSION, TABLESPACE
4. COMMENT ON DATABASE
5. Logic inside dumpACL()

I feel duplicating logic like this is really error-prone, but I do not
find any clear way to reuse the code as dumpDatabase() has a high
dependency on the Archive handle and generating the dump file.

So currently I have implemented most of this logic except for a few
e.g. dumpACL(), comments on the database, etc.  So before we go too
far in this direction I wanted to know the opinions of others.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

06 сентября 2022 г., 20:37:13

On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have explored this area more and also tried to come up with a
> working prototype, so while working on this I realized that we would
> have almost to execute all the code which is getting generated as part
> of the dumpDatabase() and dumpACL() which is basically,
>
> 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> 2. DROP DATABASE
> 3. CREATE DATABASE with all the database properties like ENCODING,
> LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> COLLATION_VERSION, TABLESPACE
> 4. COMMENT ON DATABASE
> 5. Logic inside dumpACL()
>
> I feel duplicating logic like this is really error-prone, but I do not
> find any clear way to reuse the code as dumpDatabase() has a high
> dependency on the Archive handle and generating the dump file.

Yeah, I don't think this is the way to go at all. The duplicated logic
is likely to get broken, and is also likely to annoy the next person
who has to maintain it.

I suggest that for now we fall back on making the initial
RelFileNumber for a system table equal to pg_class.oid. I don't really
love that system and I think maybe we should change it at some point
in the future, but all the alternatives seem too complicated to cram
them into the current patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

08 сентября 2022 г., 13:40:28

On Tue, Sep 6, 2022 at 11:07 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have explored this area more and also tried to come up with a
> working prototype, so while working on this I realized that we would
> have almost to execute all the code which is getting generated as part
> of the dumpDatabase() and dumpACL() which is basically,
>
> 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> 2. DROP DATABASE
> 3. CREATE DATABASE with all the database properties like ENCODING,
> LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> COLLATION_VERSION, TABLESPACE
> 4. COMMENT ON DATABASE
> 5. Logic inside dumpACL()
>
> I feel duplicating logic like this is really error-prone, but I do not
> find any clear way to reuse the code as dumpDatabase() has a high
> dependency on the Archive handle and generating the dump file.

Yeah, I don't think this is the way to go at all. The duplicated logic
is likely to get broken, and is also likely to annoy the next person
who has to maintain it.

Right

I suggest that for now we fall back on making the initial
RelFileNumber for a system table equal to pg_class.oid. I don't really
love that system and I think maybe we should change it at some point
in the future, but all the alternatives seem too complicated to cram
them into the current patch.

That makes sense.

On a separate note, while reviewing the latest patch I see there is some risk of using the unflushed relfilenumber in GetNewRelFileNumber() function. Basically, in the current code, the flushing logic is tightly coupled with the logging new relfilenumber logic and that might not work with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD. So the idea is we need to keep the flushing logic separate from the logging, I am working on the idea and I will post the patch soon.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

09 сентября 2022 г., 13:02:17

On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> On a separate note, while reviewing the latest patch I see there is some risk of using the unflushed relfilenumber in
GetNewRelFileNumber()function.  Basically, in the current code, the flushing logic is tightly coupled with the logging
newrelfilenumber logic and that might not work with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the
ideais we need to keep the flushing logic separate from the logging, I am working on the idea and I will post the patch
soon.

I have fixed the issue, so now we will track nextRelFileNumber,
loggedRelFileNumber and flushedRelFileNumber.  So whenever
nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
relfilenumbers.  And whenever nextRelFileNumber reaches the
flushedRelFileNumber then we will do XlogFlush for WAL upto the last
loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
avoid tracking the flushedRelFileNumber.  But I feel keeping track of
the flushedRelFileNumber looks cleaner and easier to understand.  For
more details refer to the code in GetNewRelFileNumber().

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v17-0001-Widen-relfilenumber-from-32-bits-to-56-bits.patch

Re: making relfilenodes 56 bits

От

Amul Sul

Дата:

20 сентября 2022 г., 17:16:20

On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > On a separate note, while reviewing the latest patch I see there is some risk of using the unflushed relfilenumber
inGetNewRelFileNumber() function.  Basically, in the current code, the flushing logic is tightly coupled with the
loggingnew relfilenumber logic and that might not work with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So
theidea is we need to keep the flushing logic separate from the logging, I am working on the idea and I will post the
patchsoon. 
>
> I have fixed the issue, so now we will track nextRelFileNumber,
> loggedRelFileNumber and flushedRelFileNumber.  So whenever
> nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
> loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
> relfilenumbers.  And whenever nextRelFileNumber reaches the
> flushedRelFileNumber then we will do XlogFlush for WAL upto the last
> loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
> VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
> avoid tracking the flushedRelFileNumber.  But I feel keeping track of
> the flushedRelFileNumber looks cleaner and easier to understand.  For
> more details refer to the code in GetNewRelFileNumber().
>

Here are a few minor suggestions I came across while reading this
patch, might be useful:

+#ifdef USE_ASSERT_CHECKING
+
+   {

Unnecessary space after USE_ASSERT_CHECKING.
--

+               return InvalidRelFileNumber;    /* placate compiler */

I don't think we needed this after the error on the latest branches.
--

+   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
+   if (shutdown)
+       checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+   else
+       checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+
+   LWLockRelease(RelFileNumberGenLock);

This is done for the good reason, I think, it should have a comment
describing different checkPoint.nextRelFileNumber assignment
need and crash recovery perspective.
--

+#define SizeOfRelFileLocatorBackend \
+   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

Can append empty parenthesis "()" to the macro name, to look like a
function call at use or change the macro name to uppercase?
--

 +   if (val < 0 || val > MAX_RELFILENUMBER)
..
 if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \

How about adding a macro for this condition as RelFileNumberIsValid()?
We can replace all the checks referring to MAX_RELFILENUMBER with this.

Regards,
Amul

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

20 сентября 2022 г., 20:14:46

On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [ new patch ]

+typedef pg_int64 RelFileNumber;

This seems really random to me. First, why isn't this an unsigned
type? OID is unsigned and I don't see a reason to change to a signed
type. But even if we were going to change to a signed type, why
pg_int64? That is declared like this:

/* Define a signed 64-bit integer type for use in client API declarations. */
typedef PG_INT64_TYPE pg_int64;

Surely this is not a client API declaration....

Note that if we change this a lot of references to INT64_FORMAT will
need to become UINT64_FORMAT.

I think we should use int64 at the SQL level, because we don't have an
unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
similar. But internally I think using unsigned is cleaner.

+ * RelFileNumber is unique within a cluster.

Not really, because of CREATE DATABASE. Probably just drop this line.
Or else expand it: we never assign the same RelFileNumber twice within
the lifetime of the same cluster, but there can be multiple relations
with the same RelFileNumber e.g. because CREATE DATABASE duplicates
the RelFileNumber values from the template database. But maybe we
don't need this here, as it's already explained in relfilelocator.h.

+    ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS);

Why not declare ret as ForkNumber instead of casting twice?

+    uint64      relnum;
+
+    Assert(relnumber <= MAX_RELFILENUMBER);
+    Assert(forknum <= MAX_FORKNUM);
+
+    relnum = relnumber;

Perhaps it'd be better to write uint64 relnum = relnumber instead of
initializing on a separate line.

+#define RELNUMBERCHARS  20      /* max chars printed by %llu */

Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if
there's some reason to stick with a signed type).

+        elog(ERROR, "relfilenumber is out of bound");

It would have to be "out of bounds", with an "s". But maybe "is too
large" would be better.

+    nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+    loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+    flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber;

Maybe it would be a good idea to asset that next <= flushed and
flushed <= logged?

+#ifdef USE_ASSERT_CHECKING
+
+    {
+        RelFileLocatorBackend rlocator;
+        char       *rpath;

Let's add a comment here, like "Because the RelFileNumber counter only
ever increases and never wraps around, it should be impossible for the
newly-allocated RelFileNumber to already be in use. But, if Asserts
are enabled, double check that there's no main-fork relation file with
the new RelFileNumber already on disk."

+        elog(ERROR, "cannot forward RelFileNumber during recovery");

forward -> set (or advance)

+    if (relnumber >= ShmemVariableCache->loggedRelFileNumber)

It probably doesn't make any difference, but to me it seems better to
test flushedRelFileNumber rather than logRelFileNumber here. What do
you think?

     /*
      * We set up the lockRelId in case anything tries to lock the dummy
-     * relation.  Note that this is fairly bogus since relNumber may be
-     * different from the relation's OID.  It shouldn't really matter though.
-     * In recovery, we are running by ourselves and can't have any lock
-     * conflicts.  While syncing, we already hold AccessExclusiveLock.
+     * relation.  Note we are setting relId to just FirstNormalObjectId which
+     * is completely bogus.  It shouldn't really matter though. In recovery,
+     * we are running by ourselves and can't have any lock conflicts.  While
+     * syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
-    rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
+    rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;

Boy, this makes me uncomfortable. The existing logic is pretty bogus,
and we're replacing it with some other bogus thing. Do we know whether
anything actually does try to use this for locking?

One notable difference between the existing logic and your change is
that, with the existing logic, we use a bogus value that will differ
from one relation to the next, whereas with this change, it will
always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
(Oid) rlocator.relNumber would be a more natural adaptation?

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)                \
+do {                                                                \
+    if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+        ereport(ERROR,                                              \
+                errcode(ERRCODE_INVALID_PARAMETER_VALUE),           \
+                errmsg("relfilenumber %lld is out of range",    \
+                        (long long) (relfilenumber))); \
+} while (0)

Here, you take the approach of casting the relfilenumber to long long
and then using %lld. But elsewhere, you use
INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we
ought to use it everywhere.

 typedef struct
 {
-    Oid         reltablespace;
-    RelFileNumber relfilenumber;
-} RelfilenumberMapKey;
-
-typedef struct
-{
-    RelfilenumberMapKey key;    /* lookup key - must be first */
+    RelFileNumber relfilenumber;    /* lookup key - must be first */
     Oid         relid;          /* pg_class.oid */
 } RelfilenumberMapEntry;

This feels like a bold change. Are you sure it's safe? i.e. Are you
certain that there's no way that a relfilenumber could repeat within a
database? If we're going to bank on that, we could adapt this more
heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
parameter. I think maybe we should push this change into an 0002 patch
(or later) and have 0001 just do a minimal adaptation for the changed
data type.

 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-    Datum       values[18];
-    bool        nulls[18];
+    Datum       values[19];
+    bool        nulls[19];

Documentation updated is needed.

-Note that while a table's filenode often matches its OID, this is
-<emphasis>not</emphasis> necessarily the case; some operations, like
+Note that table's filenode are completely different than its OID. Although for
+system catalogs initial filenode matches with its OID, but some
operations, like
 <command>TRUNCATE</command>, <command>REINDEX</command>,
<command>CLUSTER</command> and some forms
 of <command>ALTER TABLE</command>, can change the filenode while
preserving the OID.
-Avoid assuming that filenode and table OID are the same.

Suggest: Note that a table's filenode will normally be different than
the OID. For system tables, the initial filenode will be equal to the
table OID, but it will be different if the table has ever been
subjected to a rewriting operation, such as TRUNCATE, REINDEX,
CLUSTER, or some forms of ALTER TABLE. For user tables, even the
initial filenode will be different than the table OID.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

21 сентября 2022 г., 13:09:19

On Tue, Sep 20, 2022 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the review, please see my response inline for some of the
comments, rest all are accepted.

> On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > [ new patch ]
>
> +typedef pg_int64 RelFileNumber;
>
> This seems really random to me. First, why isn't this an unsigned
> type? OID is unsigned and I don't see a reason to change to a signed
> type. But even if we were going to change to a signed type, why
> pg_int64? That is declared like this:
>
> /* Define a signed 64-bit integer type for use in client API declarations. */
> typedef PG_INT64_TYPE pg_int64;
>
> Surely this is not a client API declaration....
>
> Note that if we change this a lot of references to INT64_FORMAT will
> need to become UINT64_FORMAT.
>
> I think we should use int64 at the SQL level, because we don't have an
> unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
> So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
> similar. But internally I think using unsigned is cleaner.

Yeah you are right we can make it uint64.  With respect to this, we
can not directly use uint64 because that is declared in c.h and that
can not be used in
postgres_ext.h IIUC.  So what are the other option maybe we can
typedef the RelFIleNumber similar to what c.h done for uint64 i.e.

#ifdef HAVE_LONG_INT_64
typedef unsigned long int uint64;
#elif defined(HAVE_LONG_LONG_INT_64)
typedef long long int int64;
#endif

And maybe same for UINT64CONST ?

I am not liking duplicating this logic but is there any better
alternative for doing this?  Can we move the existing definitions from
c.h file to some common file (common for client and server)?

>
> +    if (relnumber >= ShmemVariableCache->loggedRelFileNumber)
>
> It probably doesn't make any difference, but to me it seems better to
> test flushedRelFileNumber rather than logRelFileNumber here. What do
> you think?

Actually based on this condition are logging more so it make more
sense to check w.r.t loggedRelFileNumber, but OTOH technically,
without flushing log we are not supposed to use the relfilenumber so
make more sense to test flushedRelFileNumber.  But since both are the
same I am fine with flushedRelFileNumber.

>      /*
>       * We set up the lockRelId in case anything tries to lock the dummy
> -     * relation.  Note that this is fairly bogus since relNumber may be
> -     * different from the relation's OID.  It shouldn't really matter though.
> -     * In recovery, we are running by ourselves and can't have any lock
> -     * conflicts.  While syncing, we already hold AccessExclusiveLock.
> +     * relation.  Note we are setting relId to just FirstNormalObjectId which
> +     * is completely bogus.  It shouldn't really matter though. In recovery,
> +     * we are running by ourselves and can't have any lock conflicts.  While
> +     * syncing, we already hold AccessExclusiveLock.
>       */
>      rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
> -    rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
> +    rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;
>
> Boy, this makes me uncomfortable. The existing logic is pretty bogus,
> and we're replacing it with some other bogus thing. Do we know whether
> anything actually does try to use this for locking?
>
> One notable difference between the existing logic and your change is
> that, with the existing logic, we use a bogus value that will differ
> from one relation to the next, whereas with this change, it will
> always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
> (Oid) rlocator.relNumber would be a more natural adaptation?
>
> +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)                \
> +do {                                                                \
> +    if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> +        ereport(ERROR,                                              \
> +                errcode(ERRCODE_INVALID_PARAMETER_VALUE),           \
> +                errmsg("relfilenumber %lld is out of range",    \
> +                        (long long) (relfilenumber))); \
> +} while (0)
>
> Here, you take the approach of casting the relfilenumber to long long
> and then using %lld. But elsewhere, you use
> INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we
> ought to use it everywhere.

Based on the discussion [1], it seems we can not use
INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?

[1] https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

>  typedef struct
>  {
> -    Oid         reltablespace;
> -    RelFileNumber relfilenumber;
> -} RelfilenumberMapKey;
> -
> -typedef struct
> -{
> -    RelfilenumberMapKey key;    /* lookup key - must be first */
> +    RelFileNumber relfilenumber;    /* lookup key - must be first */
>      Oid         relid;          /* pg_class.oid */
>  } RelfilenumberMapEntry;
>
> This feels like a bold change. Are you sure it's safe? i.e. Are you
> certain that there's no way that a relfilenumber could repeat within a
> database?

IIUC, as of now, CREATE DATABASE is the only option which can create
the duplicate relfilenumber but that would be in different databases.
So based on that theory I think it should be safe.

If we're going to bank on that, we could adapt this more
> heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> parameter.

Yeah we might, although we need a bool to identify whether it is
shared relation or not.

I think maybe we should push this change into an 0002 patch
> (or later) and have 0001 just do a minimal adaptation for the changed
> data type.

Yeah that make sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

23 сентября 2022 г., 07:23:48

On Wed, Sep 21, 2022 at 3:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Yeah you are right we can make it uint64.  With respect to this, we
> can not directly use uint64 because that is declared in c.h and that
> can not be used in
> postgres_ext.h IIUC.  So what are the other option maybe we can
> typedef the RelFIleNumber similar to what c.h done for uint64 i.e.
>
> #ifdef HAVE_LONG_INT_64
> typedef unsigned long int uint64;
> #elif defined(HAVE_LONG_LONG_INT_64)
> typedef long long int int64;
> #endif
>
> I am not liking duplicating this logic but is there any better
> alternative for doing this?  Can we move the existing definitions from
> c.h file to some common file (common for client and server)?

Here is the updated patch which fixes all the agreed comments. Except
this one which needs more thoughts, for now I have used unsigned long
int.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

23 сентября 2022 г., 07:27:48

On Tue, Sep 20, 2022 at 7:46 PM Amul Sul <sulamul@gmail.com> wrote:
>

Thanks for the review

> Here are a few minor suggestions I came across while reading this
> patch, might be useful:
>
> +#ifdef USE_ASSERT_CHECKING
> +
> +   {
>
> Unnecessary space after USE_ASSERT_CHECKING.

Changed

>
> +               return InvalidRelFileNumber;    /* placate compiler */
>
> I don't think we needed this after the error on the latest branches.
> --

Changed

> +   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
> +   if (shutdown)
> +       checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
> +   else
> +       checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
> +
> +   LWLockRelease(RelFileNumberGenLock);
>
> This is done for the good reason, I think, it should have a comment
> describing different checkPoint.nextRelFileNumber assignment
> need and crash recovery perspective.
> --

Done

> +#define SizeOfRelFileLocatorBackend \
> +   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))
>
> Can append empty parenthesis "()" to the macro name, to look like a
> function call at use or change the macro name to uppercase?
> --

Yeah we could SizeOfXXX macros are general practice I see used
everywhere in Postgres code so left as it is.

>  +   if (val < 0 || val > MAX_RELFILENUMBER)
> ..
>  if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
>
> How about adding a macro for this condition as RelFileNumberIsValid()?
> We can replace all the checks referring to MAX_RELFILENUMBER with this.

Actually, RelFileNumberIsValid is used to just check whether it is
InvalidRelFileNumber value i.e. 0.  Maybe for this we can introduce
RelFileNumberInValidRange() but I am not sure whether it would be
cleaner than what we have now, so left as it is for now.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

26 сентября 2022 г., 19:26:17

On Wed, Sep 21, 2022 at 6:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah you are right we can make it uint64.  With respect to this, we
> can not directly use uint64 because that is declared in c.h and that
> can not be used in
> postgres_ext.h IIUC.

Ugh.

> Can we move the existing definitions from
> c.h file to some common file (common for client and server)?

Yeah, I think that would be a good idea. Here's a quick patch that
moves them to common/relpath.h, which seems like a possibly-reasonable
choice, though perhaps you or someone else will have a better idea.

> Based on the discussion [1], it seems we can not use
> INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
> I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?
>
> [1] https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

Oh, hmm. So you're saying if the string is not translated then use
(U)INT64_FORMAT but if it is translated then cast? I guess that makes
sense. It feels a bit strange to have the style dependent on the
context like that, but maybe it's fine. I'll reread with that idea in
mind.

> If we're going to bank on that, we could adapt this more
> > heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> > parameter.
>
> Yeah we might, although we need a bool to identify whether it is
> shared relation or not.

Why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

move-relfilenumber-decls-v1.patch

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

27 сентября 2022 г., 09:33:18

On Mon, Sep 26, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> > Can we move the existing definitions from
> > c.h file to some common file (common for client and server)?
>
> Yeah, I think that would be a good idea. Here's a quick patch that
> moves them to common/relpath.h, which seems like a possibly-reasonable
> choice, though perhaps you or someone else will have a better idea.

Looks fine to me.

> > Based on the discussion [1], it seems we can not use
> > INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
> > I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?
> >
> > [1] https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql
>
> Oh, hmm. So you're saying if the string is not translated then use
> (U)INT64_FORMAT but if it is translated then cast?

Right

I guess that makes
> sense. It feels a bit strange to have the style dependent on the
> context like that, but maybe it's fine. I'll reread with that idea in
> mind.

Ok

> > If we're going to bank on that, we could adapt this more
> > > heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> > > parameter.
> >
> > Yeah we might, although we need a bool to identify whether it is
> > shared relation or not.
>
> Why?

Because if entry is not in cache then we need to look into the
relmapper and for that we need to know whether it is a shared relation
or not.  And I don't think we can identify that just by looking at
relfilenumber.


Another open comment which I missed in last reply

>      /*
>       * We set up the lockRelId in case anything tries to lock the dummy
> -     * relation.  Note that this is fairly bogus since relNumber may be
> -     * different from the relation's OID.  It shouldn't really matter though.
> -     * In recovery, we are running by ourselves and can't have any lock
> -     * conflicts.  While syncing, we already hold AccessExclusiveLock.
> +     * relation.  Note we are setting relId to just FirstNormalObjectId which
> +     * is completely bogus.  It shouldn't really matter though. In recovery,
> +     * we are running by ourselves and can't have any lock conflicts.  While
> +     * syncing, we already hold AccessExclusiveLock.
>       */
>      rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
> -    rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
> +    rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;
>
> Boy, this makes me uncomfortable. The existing logic is pretty bogus,
> and we're replacing it with some other bogus thing. Do we know whether
> anything actually does try to use this for locking?

Looking at the code it seems it is not used for locking.  I also test
by setting some special value for relid in
CreateFakeRelcacheEntry() and validating that id is never used for
locking in SET_LOCKTAG_RELATION.  And ran check-world so I could not
see we are ever trying to create lock tag using fake relcache entry.

> One notable difference between the existing logic and your change is
> that, with the existing logic, we use a bogus value that will differ
> from one relation to the next, whereas with this change, it will
> always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
> (Oid) rlocator.relNumber would be a more natural adaptation?

I agree, so changed it this way.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hi!

I'm not in the context of this thread, but I've notice something strange by attempting to rebase my patch set from 64XID thread.

As far as I'm aware, this patch set is adding "relfilenumber". So, in pg_control_checkpoint, we have next changes:

diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..d441cd97e2 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -79,8 +79,8 @@ pg_control_system(PG_FUNCTION_ARGS)
Datum
pg_control_checkpoint(PG_FUNCTION_ARGS)
{
- Datum values[18];
- bool nulls[18];
+ Datum values[19];
+ bool nulls[19];
TupleDesc tupdesc;
HeapTuple htup;
ControlFileData *ControlFile;
@@ -129,6 +129,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
XIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
TIMESTAMPTZOID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 19, "next_relfilenumber",
+ INT8OID, -1, 0);
tupdesc = BlessTupleDesc(tupdesc);

/* Read the control file. */

In other words, we have 19 attributes. But tupdesc here is constructed for 18 elements:

tupdesc = CreateTemplateTupleDesc(18);

Is that normal or not? Again, I'm not in this thread and if that is completely ok, I'm sorry about the noise.

Best regards,

Maxim Orlov.

Re: making relfilenodes 56 bits

От

Robert Haas

Дата:

29 сентября 2022 г., 17:57:32

On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> In other words, we have 19 attributes. But tupdesc here is constructed for 18 elements:
> tupdesc = CreateTemplateTupleDesc(18);
>
> Is that normal or not? Again, I'm not in this thread and if that is completely ok, I'm sorry about the noise.

I think that's a mistake. Thanks for the report.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

От

Tom Lane

Дата:

29 сентября 2022 г., 21:39:44

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov <orlovmg@gmail.com> wrote:
>> In other words, we have 19 attributes. But tupdesc here is constructed for 18 elements:
>> tupdesc = CreateTemplateTupleDesc(18);

> I think that's a mistake. Thanks for the report.

The assertions in TupleDescInitEntry would have caught that,
if only utils/misc/pg_controldata.c had more than zero test coverage.
Seems like somebody ought to do something about that.

            regards, tom lane

Re: making relfilenodes 56 bits

От

Michael Paquier

Дата:

30 сентября 2022 г., 03:12:56

On Thu, Sep 29, 2022 at 02:39:44PM -0400, Tom Lane wrote:
> The assertions in TupleDescInitEntry would have caught that,
> if only utils/misc/pg_controldata.c had more than zero test coverage.
> Seems like somebody ought to do something about that.

While passing by, I have noticed this thread.  We don't really care
about the contents returned by these functions, and one simple trick
to check their execution is SELECT FROM.  Like in the attached, for
example.
--
Michael

Вложения

Re: making relfilenodes 56 bits

От

Tom Lane

Дата:

30 сентября 2022 г., 04:23:38

Michael Paquier <michael@paquier.xyz> writes:
> While passing by, I have noticed this thread.  We don't really care
> about the contents returned by these functions, and one simple trick
> to check their execution is SELECT FROM.  Like in the attached, for
> example.

Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
any actual checks on the sanity of the output?  I realize that the
output's far from static, but still ...

            regards, tom lane

Re: making relfilenodes 56 bits

От

Michael Paquier

Дата:

21 октября 2022 г., 09:00:56

On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> any actual checks on the sanity of the output?  I realize that the
> output's far from static, but still ...

Honestly, checking all the fields is not that exciting, but the
maximum I can think of that would be portable enough is something like
the attached.  No arithmetic operators for xid limits things a bit,
but at least that's something.

Thoughts?
--
Michael

Вложения

Re: making relfilenodes 56 bits

От

vignesh C

Дата:

04 января 2023 г., 15:15:41

On Fri, 21 Oct 2022 at 11:31, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > any actual checks on the sanity of the output?  I realize that the
> > output's far from static, but still ...
>
> Honestly, checking all the fields is not that exciting, but the
> maximum I can think of that would be portable enough is something like
> the attached.  No arithmetic operators for xid limits things a bit,
> but at least that's something.
>
> Thoughts?

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
33ab0a2a527e3af5beee3a98fc07201e555d6e45 ===
=== applying patch ./controldata-regression-2.patch
patching file src/test/regress/expected/misc_functions.out
Hunk #1 succeeded at 642 with fuzz 2 (offset 48 lines).
patching file src/test/regress/sql/misc_functions.sql
Hunk #1 FAILED at 223.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/sql/misc_functions.sql.rej

[1] - http://cfbot.cputube.org/patch_41_3711.log

Regards,
Vignesh

Re: making relfilenodes 56 bits

От

Dilip Kumar

Дата:

06 января 2023 г., 09:13:43

On Wed, Jan 4, 2023 at 5:45 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 21 Oct 2022 at 11:31, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > > any actual checks on the sanity of the output?  I realize that the
> > > output's far from static, but still ...
> >
> > Honestly, checking all the fields is not that exciting, but the
> > maximum I can think of that would be portable enough is something like
> > the attached.  No arithmetic operators for xid limits things a bit,
> > but at least that's something.
> >
> > Thoughts?
>
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:
>

Because of the extra WAL overhead, we are not continuing with the
patch, I will withdraw it.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: Make relfile tombstone files conditional on WAL level

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения