Обсуждение: Problems with >2GB tables on Linux 2.0

Поиск
Список
Период
Сортировка

Problems with >2GB tables on Linux 2.0

От
gjerde@icebox.org
Дата:
Hi,
I saw a message a couple of weeks ago from someone having problems with
larger than 2GB tables.  I have similar problems.

PostgreSQL: anon-cvs as of today (2/5/1999)
OS: Redhat Linux 5.2 (running 2.0.35)

I created a database called mcrl, and a table called mcrl3_1.
I copied in a set of 450MB of data twice(which comes to pg file size of
2.4GB or so).

When it hit 2GB I got this message: mcrl=> copy mcrl3_1 FROM '/home/gjerde/mcrl/MCR3_1.txt'; ERROR:  mcrl3_1: cannot
extend

The table file looks like this:
[postgres@snowman mcrl]$ ls -l mcrl*
-rw-------   1 postgres postgres 2147482624 Feb  5 16:49 mcrl3_1

It did NOT create the .1 file however, which I did see when I tried this
on 6.4.2(but still didn't work).

I looked around in the code(specifically src/backend/storage/smgr/*.c),
but couldn't figure too much of it out.  I'll have to figure out how
postgres handles the database files first..

Hope this helps,
Ole Gjerde



Ignore me :) Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
gjerde@icebox.org
Дата:
I just found out that you guys have been discussing this problem.. oops..
I looked through the mailing-list archive, and didn't find any posts, but
february isn't in the archive yet :)

Ole Gjerde



Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Thomas Reinke
Дата:
I may be dating myself really badly here, but isn't there a hard limit
on
the file system at 2Gig? I thought the file size attribute in Unix is
represented as a 32 bit signed long, which happens to be a max value
of 2147483648. If I'm right, it means the problem is fundamentally
with the file system, not with PostGres, and you won't solve this
unless the os supports larger files.

gjerde@icebox.org wrote:
> 
> Hi,
> I saw a message a couple of weeks ago from someone having problems with
> larger than 2GB tables.  I have similar problems.
> 
> PostgreSQL: anon-cvs as of today (2/5/1999)
> OS: Redhat Linux 5.2 (running 2.0.35)
> 
> I created a database called mcrl, and a table called mcrl3_1.
> I copied in a set of 450MB of data twice(which comes to pg file size of
> 2.4GB or so).
> 
> When it hit 2GB I got this message:
>   mcrl=> copy mcrl3_1 FROM '/home/gjerde/mcrl/MCR3_1.txt';
>   ERROR:  mcrl3_1: cannot extend
> 
> The table file looks like this:
> [postgres@snowman mcrl]$ ls -l mcrl*
> -rw-------   1 postgres postgres 2147482624 Feb  5 16:49 mcrl3_1
> 
> It did NOT create the .1 file however, which I did see when I tried this
> on 6.4.2(but still didn't work).
> 
> I looked around in the code(specifically src/backend/storage/smgr/*.c),
> but couldn't figure too much of it out.  I'll have to figure out how
> postgres handles the database files first..
> 
> Hope this helps,
> Ole Gjerde

-- 
------------------------------------------------------------
Thomas Reinke                            Tel: (416) 460-7021
Director of Technology                   Fax: (416) 598-2319
E-Soft Inc.                         http://www.e-softinc.com


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
The Hermit Hacker
Дата:
On Sat, 6 Feb 1999, Thomas Reinke wrote:

> I may be dating myself really badly here, but isn't there a hard limit
> on
> the file system at 2Gig? I thought the file size attribute in Unix is
> represented as a 32 bit signed long, which happens to be a max value
> of 2147483648. If I'm right, it means the problem is fundamentally
> with the file system, not with PostGres, and you won't solve this
> unless the os supports larger files.

PostgreSQL has internal code that is supposed to automagically break up a
table into 2gb chunks so that thsi isn't a problem...
> 
> gjerde@icebox.org wrote:
> > 
> > Hi,
> > I saw a message a couple of weeks ago from someone having problems with
> > larger than 2GB tables.  I have similar problems.
> > 
> > PostgreSQL: anon-cvs as of today (2/5/1999)
> > OS: Redhat Linux 5.2 (running 2.0.35)
> > 
> > I created a database called mcrl, and a table called mcrl3_1.
> > I copied in a set of 450MB of data twice(which comes to pg file size of
> > 2.4GB or so).
> > 
> > When it hit 2GB I got this message:
> >   mcrl=> copy mcrl3_1 FROM '/home/gjerde/mcrl/MCR3_1.txt';
> >   ERROR:  mcrl3_1: cannot extend
> > 
> > The table file looks like this:
> > [postgres@snowman mcrl]$ ls -l mcrl*
> > -rw-------   1 postgres postgres 2147482624 Feb  5 16:49 mcrl3_1
> > 
> > It did NOT create the .1 file however, which I did see when I tried this
> > on 6.4.2(but still didn't work).
> > 
> > I looked around in the code(specifically src/backend/storage/smgr/*.c),
> > but couldn't figure too much of it out.  I'll have to figure out how
> > postgres handles the database files first..
> > 
> > Hope this helps,
> > Ole Gjerde
> 
> -- 
> ------------------------------------------------------------
> Thomas Reinke                            Tel: (416) 460-7021
> Director of Technology                   Fax: (416) 598-2319
> E-Soft Inc.                         http://www.e-softinc.com
> 

Marc G. Fournier                                
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Hannu Krosing
Дата:
Thomas Reinke wrote:
> 
> I may be dating myself really badly here, but isn't there a hard limit
> on
> the file system at 2Gig? I thought the file size attribute in Unix is
> represented as a 32 bit signed long, which happens to be a max value
> of 2147483648. If I'm right, it means the problem is fundamentally
> with the file system, not with PostGres, and you won't solve this
> unless the os supports larger files.

There is logic insid PostgreSQL to overflof to nex file at 2GB, but 
apparently this is currently broken.

AFAIK, there are people working on it now

--------------
Hannu


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Sat, 6 Feb 1999, Hannu Krosing wrote:

> Thomas Reinke wrote:
> > 
> > I may be dating myself really badly here, but isn't there a hard limit
> > on
> > the file system at 2Gig? I thought the file size attribute in Unix is
> > represented as a 32 bit signed long, which happens to be a max value
> > of 2147483648. If I'm right, it means the problem is fundamentally
> > with the file system, not with PostGres, and you won't solve this
> > unless the os supports larger files.
> 
> There is logic insid PostgreSQL to overflof to nex file at 2GB, but 
> apparently this is currently broken.
> 
> AFAIK, there are people working on it now

Yes, me ;-)

I have an idea where the failure is occuring, but I'm still testing the
relavent parts of the code.

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf



Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Sun, 7 Feb 1999, Peter T Mount wrote:

> On Sat, 6 Feb 1999, Hannu Krosing wrote:
> 
> > Thomas Reinke wrote:
> > > 
> > > I may be dating myself really badly here, but isn't there a hard limit
> > > on
> > > the file system at 2Gig? I thought the file size attribute in Unix is
> > > represented as a 32 bit signed long, which happens to be a max value
> > > of 2147483648. If I'm right, it means the problem is fundamentally
> > > with the file system, not with PostGres, and you won't solve this
> > > unless the os supports larger files.
> > 
> > There is logic insid PostgreSQL to overflof to nex file at 2GB, but 
> > apparently this is currently broken.
> > 
> > AFAIK, there are people working on it now
> 
> Yes, me ;-)
> 
> I have an idea where the failure is occuring, but I'm still testing the
> relavent parts of the code.

Well, just now I think I know what's going on.

First, I've reduced the size that postgres breaks the file to 2Mb (256
blocks). I then ran the test script that imports some large records into a
test table.

As expected, the splitting of the file works fine. So the code isn't
broken. What I think is happening is that the code extends the table, then
tests to see if it's at the 2Gig limit, and when it is, creates the next
file for that table.

However, I think the OS has problems with a file exactly 2Gb in size.

I've attached a patch that should reduce the max table size by 1 block.
This should prevent us from hitting the physical limit.

Note: I haven't tested this patch yet!

It compiles but, because the test takes 4 hours for my machine to reach
2Gb, and I have a few other things to do today, I'll run it overnight.

Hopefully, first thing tomorrow, we'll know if it works.

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf

*** ./backend/storage/smgr/md.c.orig    Mon Feb  1 17:55:57 1999
--- ./backend/storage/smgr/md.c Sun Feb  7 14:48:35 1999
***************
*** 77,86 ****  *  * 19 Mar 98  darrenk  *  */  #ifndef LET_OS_MANAGE_FILESIZE
! #define RELSEG_SIZE           ((8388608 / BLCKSZ) * 256) #endif  /* routines declared here */
--- 77,91 ----  *  * 19 Mar 98  darrenk  *
+  * After testing, we need to add one less block to the file, otherwise
+  * we extend beyond the 2-gig limit.
+  *
+  * 07 Feb 99  Peter Mount
+  *  */  #ifndef LET_OS_MANAGE_FILESIZE
! #define RELSEG_SIZE   (((8388608 / BLCKSZ) * 256)-BLCKSZ) #endif  /* routines declared here */




Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Tom Lane
Дата:
Peter T Mount <peter@retep.org.uk> writes:
> As expected, the splitting of the file works fine. So the code isn't
> broken. What I think is happening is that the code extends the table, then
> tests to see if it's at the 2Gig limit, and when it is, creates the next
> file for that table.

> However, I think the OS has problems with a file exactly 2Gb in size.

Oh!  I didn't realize that we were trying to extend the file to
*exactly* 2Gb.  Indeed that's a very dangerous thing to do: the file
size in bytes will be 0x80000000, which will appear to be negative when
viewed as a signed 32-bit integer; unless your OS is very careful about
signed vs. unsigned arithmetic, it will break.

For that matter it's not impossible that our own code contains similar
problems, if it does much calculating with byte offsets into the file.
(The pushups that darrenk had to do in order to calculate RELSEG_SIZE
in the first place should have suggested to him that running right at
the overflow limit was not such a hot idea...)

I'd suggest setting the limit a good deal less than 2Gb to avoid any
risk of arithmetic overflow.  Maybe 200000 8K blocks, instead of 262144.

And change the comment while you're at it, not just the code ;-)
        regards, tom lane


RE: [HACKERS] Problems with >2GB tables on Linux 2.0

От
"Stupor Genius"
Дата:
> For that matter it's not impossible that our own code contains similar
> problems, if it does much calculating with byte offsets into the file.
> (The pushups that darrenk had to do in order to calculate RELSEG_SIZE
> in the first place should have suggested to him that running right at
> the overflow limit was not such a hot idea...)

Not my code to begin with...

RELSEG_SIZE was always there hard-coded to 262144 to assume the block
size would be 8k.  At the time of my changes, I didn't think thru what
it was for, I only changed the code that was there to calculate it and
get the same value as before for variable disc block sizes.

I agree that running right at the limit is a Bad Thing, but analyzing
that wasn't my main area of concern with that patch.

darrenk


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Sun, 7 Feb 1999, Tom Lane wrote:

> Peter T Mount <peter@retep.org.uk> writes:
> > As expected, the splitting of the file works fine. So the code isn't
> > broken. What I think is happening is that the code extends the table, then
> > tests to see if it's at the 2Gig limit, and when it is, creates the next
> > file for that table.
> 
> > However, I think the OS has problems with a file exactly 2Gb in size.
> 
> Oh!  I didn't realize that we were trying to extend the file to
> *exactly* 2Gb.  Indeed that's a very dangerous thing to do: the file
> size in bytes will be 0x80000000, which will appear to be negative when
> viewed as a signed 32-bit integer; unless your OS is very careful about
> signed vs. unsigned arithmetic, it will break.
> 
> For that matter it's not impossible that our own code contains similar
> problems, if it does much calculating with byte offsets into the file.
> (The pushups that darrenk had to do in order to calculate RELSEG_SIZE
> in the first place should have suggested to him that running right at
> the overflow limit was not such a hot idea...)
> 
> I'd suggest setting the limit a good deal less than 2Gb to avoid any
> risk of arithmetic overflow.  Maybe 200000 8K blocks, instead of 262144.

That might be an idea.

I've just re-synced by copy of the cvs source, so I'll set it there, and
we'll know by the morning if it's worked or not.

> And change the comment while you're at it, not just the code ;-)

Will do :-)

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf



RE: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Sun, 7 Feb 1999, Stupor Genius wrote:

> > For that matter it's not impossible that our own code contains similar
> > problems, if it does much calculating with byte offsets into the file.
> > (The pushups that darrenk had to do in order to calculate RELSEG_SIZE
> > in the first place should have suggested to him that running right at
> > the overflow limit was not such a hot idea...)
> 
> Not my code to begin with...
> 
> RELSEG_SIZE was always there hard-coded to 262144 to assume the block
> size would be 8k.  At the time of my changes, I didn't think thru what
> it was for, I only changed the code that was there to calculate it and
> get the same value as before for variable disc block sizes.
> 
> I agree that running right at the limit is a Bad Thing, but analyzing
> that wasn't my main area of concern with that patch.

I agree with you. I think that the original error stemmed from when
RELSEG_SIZE was originally set.

Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
works out to be 1.6Gb. That should stay well away from the overflow
problem.

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf



RE: [HACKERS] Problems with >2GB tables on Linux 2.0

От
gjerde@icebox.org
Дата:
On Sun, 7 Feb 1999, Peter T Mount wrote:
> Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
> works out to be 1.6Gb. That should stay well away from the overflow
> problem.

Hi,
I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
and it works beautifully now!

I imported about 2.2GB of data(table file size) and it looks like this:
-rw-------   1 postgres postgres 1998585856 Feb  7 16:22 mcrl3_1
-rw-------   1 postgres postgres 219611136 Feb  7 16:49 mcrl3_1.1
-rw-------   1 postgres postgres 399368192 Feb  7 16:49
mcrl3_1_partnumber_index

And it works fine.. I did some selects on data that should have ended up
in the .1 file, and it works great.  The best thing about it, is that it
seems at least as fast as MSSQL on the same data, if not faster..

It did take like 45 minutes to create that index.. Isn't that a bit
long(AMD K6-2 350MHz)?  :)

Suggestion:  How hard would it be to make copy tablename FROM 'somefile'
give some feedback?  Either some kind of percentage or just print out
something after each 10k row chunks or something like that.

Thanks,
Ole Gjerde



RE: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Sun, 7 Feb 1999 gjerde@icebox.org wrote:

> On Sun, 7 Feb 1999, Peter T Mount wrote:
> > Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
> > works out to be 1.6Gb. That should stay well away from the overflow
> > problem.
> 
> Hi,
> I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
> and it works beautifully now!

Problem here is that RELSEG_SIZE is dependent on the block size. Seeing we
can increase the block size from 8k, this would break.

As I type, my machine is populating the test table.

> I imported about 2.2GB of data(table file size) and it looks like this:
> -rw-------   1 postgres postgres 1998585856 Feb  7 16:22 mcrl3_1
> -rw-------   1 postgres postgres 219611136 Feb  7 16:49 mcrl3_1.1
> -rw-------   1 postgres postgres 399368192 Feb  7 16:49
> mcrl3_1_partnumber_index
> 
> And it works fine.. I did some selects on data that should have ended up
> in the .1 file, and it works great.  The best thing about it, is that it
> seems at least as fast as MSSQL on the same data, if not faster..

This is what I got when I tested it using a reduced file size. It's what
made me decide to reduce the size by 1 in the patch I posted earlier.

However, I'm using John's suggestion of reducing the file size a lot more,
to ensure we don't hit any math errors, etc. So the max file size is about
1.6Gb.

> It did take like 45 minutes to create that index.. Isn't that a bit
> long(AMD K6-2 350MHz)?  :)

Well, it's taking my poor old P133 about 2 hours to hit 2Gb at the moment.

> Suggestion:  How hard would it be to make copy tablename FROM 'somefile'
> give some feedback?  Either some kind of percentage or just print out
> something after each 10k row chunks or something like that.

Attached is the test script I'm using, minus the data file.

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf

Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Tom Lane
Дата:
Peter T Mount <peter@retep.org.uk> writes:
>> I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
>> and it works beautifully now!

> Problem here is that RELSEG_SIZE is dependent on the block size. Seeing we
> can increase the block size from 8k, this would break.

Of course it should really be defined as

#define RELSEG_SIZE        (2000000000 / BLCKSZ)

for some suitable magic constant.
        regards, tom lane


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
"D'Arcy" "J.M." Cain
Дата:
Thus spake Tom Lane
> I'd suggest setting the limit a good deal less than 2Gb to avoid any
> risk of arithmetic overflow.  Maybe 200000 8K blocks, instead of 262144.

Why not make it substantially lower by default?  Makes it easier to split
a database across spindles.  Even better, how about putting extra extents
into different directories like data/base.1, data/base.2, etc?  Then as
the database grows you can add drives, move the extents into them and
mount the new drives.  The software doesn't even notice the change.

Just a thought.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 424 2871     (DoD#0082)    (eNTP)   |  what's for dinner.


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Hannu Krosing
Дата:
"D'Arcy J.M. Cain" wrote:
> 
> Thus spake Tom Lane
> > I'd suggest setting the limit a good deal less than 2Gb to avoid any
> > risk of arithmetic overflow.  Maybe 200000 8K blocks, instead of 262144.
> 
> Why not make it substantially lower by default?  Makes it easier to split
> a database across spindles.  Even better, how about putting extra extents
> into different directories like data/base.1, data/base.2, etc?  Then as
> the database grows you can add drives, move the extents into them and
> mount the new drives.  The software doesn't even notice the change.

It would be also a great way to help optimization if indexes were in 
a separate directory from the tables.

And of course our current way of keeping all the large object files in
one
directory (even _the same_ with other data) sucks. 

It has kept me away from using large objects at all, as I've heard that 
Linux (or rather ext2fs) is not very good at dealing with huge
directories.
An I have no use for only a few large objects ;)

There have been suggestions about splitting up the large object storage
by 
the hex representation of the oid value (= part of current filename),
but a good start would be to put them just in a separate directory under 
pg_data. The temp files are also good candidates for putting in a
separate 
directory.

The next step would be of course dataspaces, probably most easyly 
implemented as directories:

CREATE DATASPACE PG_DATA1 STORAGE='/mnt/scsi.105.7/data1'; 
SET DEFAULT_DATASPACE TO PG_DATA1;
CREATE TABLE ... IN DATASPACE PG_DATA; 
CREATE INDEX ... ;

Then we would'nt have to move and symlink them tables and indexes
manually.

--------------
Hannu


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Tom Lane
Дата:
"D'Arcy" "J.M." Cain <darcy@druid.net> writes:
> Thus spake Tom Lane
>> I'd suggest setting the limit a good deal less than 2Gb to avoid any
>> risk of arithmetic overflow.  Maybe 200000 8K blocks, instead of 262144.

> Why not make it substantially lower by default?

Configure-time option, anyone ;-) ?

> Makes it easier to split
> a database across spindles.  Even better, how about putting extra extents
> into different directories like data/base.1, data/base.2, etc?

This could be a pretty good idea.  Right now, if you need to split a
database across multiple filesystems, you have to do a bunch of tedious
hand manipulation of symlinks.  With an option like this, you could
automatically distribute your larger tables across filesystems...
set up the subdirectories as symlinks once, and forget it...
        regards, tom lane


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Bruce Momjian
Дата:
> > mcrl3_1_partnumber_index
> > 
> > And it works fine.. I did some selects on data that should have ended up
> > in the .1 file, and it works great.  The best thing about it, is that it
> > seems at least as fast as MSSQL on the same data, if not faster..
> 
> This is what I got when I tested it using a reduced file size. It's what
> made me decide to reduce the size by 1 in the patch I posted earlier.
> 
> However, I'm using John's suggestion of reducing the file size a lot more,
> to ensure we don't hit any math errors, etc. So the max file size is about
> 1.6Gb.

I can imagine people finding that strange.  It it really needed.  Is
there some math that could overflow with a larger value?


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Bruce Momjian
Дата:
> On Sun, 7 Feb 1999, Peter T Mount wrote:
> > Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
> > works out to be 1.6Gb. That should stay well away from the overflow
> > problem.
> 
> Hi,
> I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
> and it works beautifully now!
> 
> I imported about 2.2GB of data(table file size) and it looks like this:
> -rw-------   1 postgres postgres 1998585856 Feb  7 16:22 mcrl3_1
> -rw-------   1 postgres postgres 219611136 Feb  7 16:49 mcrl3_1.1
> -rw-------   1 postgres postgres 399368192 Feb  7 16:49
> mcrl3_1_partnumber_index

Great.  This has been on the TODO list for quite some time.  Glad it is
fixed.

> 
> And it works fine.. I did some selects on data that should have ended up
> in the .1 file, and it works great.  The best thing about it, is that it
> seems at least as fast as MSSQL on the same data, if not faster..
> 
> It did take like 45 minutes to create that index.. Isn't that a bit
> long(AMD K6-2 350MHz)?  :)
> 
> Suggestion:  How hard would it be to make copy tablename FROM 'somefile'
> give some feedback?  Either some kind of percentage or just print out
> something after each 10k row chunks or something like that.

We could, but it would then make the output file larger.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Sun, 7 Feb 1999, Bruce Momjian wrote:

> > > mcrl3_1_partnumber_index
> > > 
> > > And it works fine.. I did some selects on data that should have ended up
> > > in the .1 file, and it works great.  The best thing about it, is that it
> > > seems at least as fast as MSSQL on the same data, if not faster..
> > 
> > This is what I got when I tested it using a reduced file size. It's what
> > made me decide to reduce the size by 1 in the patch I posted earlier.
> > 
> > However, I'm using John's suggestion of reducing the file size a lot more,
> > to ensure we don't hit any math errors, etc. So the max file size is about
> > 1.6Gb.
>
> I can imagine people finding that strange.  It it really needed.  Is
> there some math that could overflow with a larger value?

Not sure. My original choice was to subtract 1 from the calculated
maximum, which meant it would split just before the 2Gb limit.

However, running with the value set at the lower value:
1998585856 Feb  8 02:25 /opt/db/base/test/smallcat 599007232 Feb  8 03:21 /opt/db/base/test/smallcat.1

Total 26653000 rows loaded

Would anyone really notice the lower value?

Perhaps we could make this another compile time setting, like the block
size?

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf



Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Bruce Momjian
Дата:
> Not sure. My original choice was to subtract 1 from the calculated
> maximum, which meant it would split just before the 2Gb limit.
> 
> However, running with the value set at the lower value:
> 
>  1998585856 Feb  8 02:25 /opt/db/base/test/smallcat
>   599007232 Feb  8 03:21 /opt/db/base/test/smallcat.1
> 
> Total 26653000 rows loaded
> 
> Would anyone really notice the lower value?
> 
> Perhaps we could make this another compile time setting, like the block
> size?

I guess all I am saying is I prefer the max-1 value.  Seems more
logical.  Could be set in config.h.in, though.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Tom Lane
Дата:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> However, I'm using John's suggestion of reducing the file size a lot more,
>> to ensure we don't hit any math errors, etc. So the max file size is about
>> 1.6Gb.

> I can imagine people finding that strange.  It it really needed.  Is
> there some math that could overflow with a larger value?

Well, that's the question all right --- are you sure that there's not?
I think "max - 1 blocks" is pushing it, since code that computes
something like "the byte offset of the block after next" would fail.
Even if there isn't any such code today, it seems possible that there
might be someday.

I'd be comfortable with 2 billion (2000000000) bytes as the filesize
limit, or Andreas' proposal of 1Gb.

I also like the proposals to allow the filesize limit to be configured
even lower to ease splitting huge tables across filesystems.

To make that work easily, we really should adopt a layout where the data
files don't all go in the same directory.  Perhaps the simplest is:

* First or only segment of a table goes in top-level data directory, same as now.

* First extension segment is .../data/1/tablename.1, second is .../data/2/tablename.2, etc.  (Using numbers for the
subdirectorynames prevents name conflict with ordinary tables.)
 

Then, just configuring the filesize limit small (a few tens/hundreds
of MB) and setting up symlinks for the subdirectories data/1, data/2,
etc gets the job done.

Starting to feel old --- I remember when a "few tens of MB" was a
monstrous hard disk, never mind a single file ...
        regards, tom lane


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Mon, 8 Feb 1999, Bruce Momjian wrote:

> > Not sure. My original choice was to subtract 1 from the calculated
> > maximum, which meant it would split just before the 2Gb limit.
> > 
> > However, running with the value set at the lower value:
> > 
> >  1998585856 Feb  8 02:25 /opt/db/base/test/smallcat
> >   599007232 Feb  8 03:21 /opt/db/base/test/smallcat.1
> > 
> > Total 26653000 rows loaded
> > 
> > Would anyone really notice the lower value?
> > 
> > Perhaps we could make this another compile time setting, like the block
> > size?
> 
> I guess all I am saying is I prefer the max-1 value.  Seems more
> logical.  Could be set in config.h.in, though.

That's what I thought when I posted the small patch. However, there now
seems to be a consensus for a smaller segment size. Toms (for some reason
I called him John yesterday?) idea of 200000 (1.6Gb) works, and I know it
works ok on smaller segment sizes (I used 2Mb segments to see that it
worked past the second segment).

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf



Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Mon, 8 Feb 1999, Tom Lane wrote:

> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >> However, I'm using John's suggestion of reducing the file size a lot more,
> >> to ensure we don't hit any math errors, etc. So the max file size is about
> >> 1.6Gb.
> 
> > I can imagine people finding that strange.  It it really needed.  Is
> > there some math that could overflow with a larger value?
> 
> Well, that's the question all right --- are you sure that there's not?
> I think "max - 1 blocks" is pushing it, since code that computes
> something like "the byte offset of the block after next" would fail.
> Even if there isn't any such code today, it seems possible that there
> might be someday.
> 
> I'd be comfortable with 2 billion (2000000000) bytes as the filesize
> limit, or Andreas' proposal of 1Gb.

I'm starting to like Andreas' proposal as the new default.

> I also like the proposals to allow the filesize limit to be configured
> even lower to ease splitting huge tables across filesystems.
> 
> To make that work easily, we really should adopt a layout where the data
> files don't all go in the same directory.  Perhaps the simplest is:
> 
> * First or only segment of a table goes in top-level data directory,
>   same as now.
> 
> * First extension segment is .../data/1/tablename.1, second is
>   .../data/2/tablename.2, etc.  (Using numbers for the subdirectory
>   names prevents name conflict with ordinary tables.)

How about dropping the suffix, so you would have:
.../data/2/tablename

Doing that doesn't mean having to increase the filename buffer size, just
the format and arg order (from %s.%d to %d/%s).

I'd think we could add a test when the new segment is created for the
symlink/directory. If it doesn't exist, then create it. Otherwise a poor
unsuspecting user would have their database fall over, not realising where
the error is.

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf



Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Thomas Reinke
Дата:
Peter T Mount wrote:

> How about dropping the suffix, so you would have:
> 
>         .../data/2/tablename
> 
> Doing that doesn't mean having to increase the filename buffer size, just
> the format and arg order (from %s.%d to %d/%s).
> 
> I'd think we could add a test when the new segment is created for the
> symlink/directory. If it doesn't exist, then create it. Otherwise a poor
> unsuspecting user would have their database fall over, not realising where
> the error is.

This sounds like attempting to solve the "2 Gig limit" along with data
distribution across multiple drives at the same time. I like the
solution
to the first part, but I'm not entirely sure that this is an effective
solution to data distribution.

Specifically, distribution of data should be administerable in some
fashion. So, that means you would want to control how much data
is placed on each drive.  Further, you would not want to fill up
one drive before floating over to the next

Consider 10 1 Gig tables split across two 6 gig drives. Doesn't work
very well if the overlow only happens after 1 Gig. Then, consider
10 1 Gig tables split across a 3 Gig and an 8 Gig drive. Again, doesn't
work very well with ANY sort of fixed size splitting scheme.

I'd suggest making the max file size 1 Gig default, configurable
someplace, and solving the data distribution as a separate effort.

Thomas

------------------------------------------------------------
Thomas Reinke                            Tel: (416) 460-7021
Director of Technology                   Fax: (416) 598-2319
E-Soft Inc.                         http://www.e-softinc.com


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Tom Lane
Дата:
Peter T Mount wrote:
>> How about dropping the suffix, so you would have:
>> .../data/2/tablename
>> Doing that doesn't mean having to increase the filename buffer size, just
>> the format and arg order (from %s.%d to %d/%s).

I thought of that also, but concluded it was a bad idea, because it
means you cannot symlink several of the /n subdirectories to the same
place.  It also seems just plain risky/errorprone to have different
files named the same thing...

>> I'd think we could add a test when the new segment is created for the
>> symlink/directory. If it doesn't exist, then create it.

Absolutely, the system would need to auto-create a /n subdirectory if
one didn't already exist.

Thomas Reinke <reinke@e-softinc.com> writes:
> ... I'm not entirely sure that this is an effective
> solution to data distribution.

Well, I'm certain we could do better if we wanted to put some direct
effort into that issue, but we can get a usable scheme this way with
practically no effort except writing a little how-to documentation.

Assume you have N big tables where you know what N is.  (You probably
have a lot of little tables as well, which we assume can be ignored for
the purposes of space allocation.)  If you configure the max file size
as M megabytes, the toplevel data directory will have M * N megabytes
of stuff (plus little files).  If all the big tables are about the same
size, say K * M meg apiece, then you wind up with K-1 subdirectories
each also containing M * N meg, which you can readily scatter across
different filesystems by setting up the subdirectories as symlinks.
In practice the later subdirectories are probably less full because
the big tables aren't all equally big, but you can put more of them
on a single filesystem to make up for that.

If N varies considerably over time then this scheme doesn't work so
well, but I don't see any scheme that would cope with a very variable
database without physically moving files around every so often.

When we get to the point where people are routinely complaining what
a pain in the neck it is to manage big databases this way, it'll be
time enough to improve the design and write some scripts to help
rearrange files on the fly.  Right now, I would just like to see a
scheme that doesn't require the dbadmin to symlink each individual
table file in order to split a big database.  (It could probably be
argued that even doing that much is ahead of the demand, but since
it's so cheap to provide this little bit of functionality we might
as well do it.)

> I'd suggest making the max file size 1 Gig default, configurable
> someplace, and solving the data distribution as a separate effort.

We might actually be saying the same thing, if by that remark you
mean that we can come back later and write "real" data distribution
management tools.  I'm just pointing out that given a configurable
max file size we can have a primitive facility almost for free.
        regards, tom lane


Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Peter T Mount
Дата:
On Mon, 8 Feb 1999, Tom Lane wrote:

> Peter T Mount wrote:
> >> How about dropping the suffix, so you would have:
> >> .../data/2/tablename
> >> Doing that doesn't mean having to increase the filename buffer size, just
> >> the format and arg order (from %s.%d to %d/%s).
> 
> I thought of that also, but concluded it was a bad idea, because it
> means you cannot symlink several of the /n subdirectories to the same
> place.  It also seems just plain risky/errorprone to have different
> files named the same thing...

That's true.

[snip]

> >> I'd think we could add a test when the new segment is created for the
> >> symlink/directory. If it doesn't exist, then create it.
> 
> Absolutely, the system would need to auto-create a /n subdirectory if
> one didn't already exist.
> 
> > I'd suggest making the max file size 1 Gig default, configurable
> > someplace, and solving the data distribution as a separate effort.
> 
> We might actually be saying the same thing, if by that remark you
> mean that we can come back later and write "real" data distribution
> management tools.  I'm just pointing out that given a configurable
> max file size we can have a primitive facility almost for free.

We are saying the same thing. To implement having the %d/%s.%d format we'd
need to just add 11 bytes to the temporary buffer (keeping the same
capacity as the cuurent code).

Peter

--       Peter T Mount peter@retep.org.uk     Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf



Re: [HACKERS] Problems with >2GB tables on Linux 2.0

От
Tom Lane
Дата:
Say guys,

I just noticed that RELSEG_SIZE still hasn't been reduced per the
discussion from early February.  Let's make sure that doesn't slip
through the cracks, OK?

I think Peter Mount was supposed to be off testing this issue.
Peter, did you learn anything further?

We should probably apply the patch to REL6_4 as well...
        regards, tom lane