Обсуждение: Storage Location / Tablespaces (try 3)
Me again, I have some more details on my storage location patch This patch would allow the system admin (DBA) to specify the location of databases, tables/indexes and temporary objects (temp tables and temp sort space) independent of the database/system default location. This patch would replace the current "LOCATION" code. Please let me know if you have any questions/comments. I would like to see this feature make 7.3. I believe it will take about 1 month of coding and testing after I get started. Thanks Jim ============================================================================== Storage Location Patch (Try 3) (If people like TABLESPACE instead of LOCATION then s/LOCATION/TABLESPACE/g below) This patch would add the following NEW commands ---------------------------------------------------- CREATE LOCATION name PATH 'dbpath'; DROP LOCATION name; where dbpath is any directory that the postgresql backend can write to. (I know this is how Oracle works, don't know about the other major db systems) The following NEW GLOBAL system table would be added. ----------------------------------------------------- PG_LOCATION ( LOC_NAME name, LOC_PATH text -- This should be able to take any path name. ); (initdb would add (PGDATA,'/usr/local/pgsql/data') The following system tables would need to be modified ----------------------------------------------------- PG_DATABASE drop datpath add DATA_LOC_NAME name or DATA_LOC_OID OID add INDEX_LOC_NAME name orINDEX_LOC_OID OID add TEMP_LOC_NAME name or TEMP_LOC_OID OID PG_CLASS to add LOC_NAME name or LOC_OID OID DATA_LOC_* and INDEX_LOC_* would default to PGDATA if not specified. (I like *LOC_NAME better but I believe the rest of the systems tables use OID) The following command syntax would be modified ------------------------------------------------------ CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY TEMP_LOCATION=ZZZ CREATE TABLE aaa (...) WITH LOCATION=XXX; CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX; CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX; CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB; Now for an example ------------------------------------------------------ First: postgresql is installed at /usr/local/pgsql userid postgres the postgres user also is the owner of /pg01/pg02 /pg03 the dba executes the following script CREATE LOCATION pg01 PATH '/pg01'; CREATE LOCATION pg02 PATH '/pg02'; CREATE LOCATION pg03 PATH '/pg03'; CREATE LOCATION bigdata PATH '/bigdata'; CREATE LOCATION bigidx PATH '/bigidx'; \q PG_LOCATION now has pg01 | /pg01 pg02 | /pg02 pg03 | /pg03 bigdata | /bigdata bigidx | /bigidx Now the following command is run CREATE DATABASE jim1 WITH DATA_LOCATION='pg01' INDEX_LOCATION='pg02' TEMP_LOCATION='pg03' -- OID of 'jim1' tuple is 1786146 on disk the directories look like this /pg01/1786146 <<-- Default DATA Location /pg02/1786146 <<-- Default INDEX Location /pg03/1786146 <<-- Default Temp Location All files from the above directories will have symbolic links to /usr/local/pgsql/data/base/1786146/ Now the system will have 1 BIG table that will get its own disk for data and its own disk for index create table big (a text,b text ..., primary key (a,b) location 'bigidx'); oid of big table is 1786150 oid of big table primary key index is 1786151 on disk directories look like this /bigdata/1786146/1786150 /bigidx/1786146/1786151 /usr/local/pgsql/data/base/1786146/1786150 symbolic link to /bigdata/1786146/1786150 /usr/local/pgsql/data/base/1786146/1786151 symbolic link to /bigdata/1786146/1786151 The symbolic links will enable the rest of the software to be location independent.
> (If people like TABLESPACE instead of LOCATION then > s/LOCATION/TABLESPACE/g > below) I like "tablespace" :-) > This patch would add the following NEW commands > ---------------------------------------------------- > CREATE LOCATION name PATH 'dbpath'; > DROP LOCATION name; > The following command syntax would be modified > ------------------------------------------------------ > CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY > TEMP_LOCATION=ZZZ > CREATE TABLE aaa (...) WITH LOCATION=XXX; > CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX; > CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX; > CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB; Sounds great, but shouldn't we use syntax that is already around, like Oracle's or DB2's or ... > The symbolic links will enable the rest of the software to be location > independent. I see, that this is the least intrusive way, but I am not sure this is the best way to do it. It would probably be better to pass the Tablespace oid around (or look it up). That would also leave the door open for other "Tablespace types" (currently "Filesystem directory" an OS managed tablespace :-). Andreas
Andreas, My first try passed the tablespace OID arround but someone pointed out the the WAL code doesn't know what the tablespace OID is or what it's location is. This is why I would like to use the symbolic links. Tom do you have any ideas on this? Jim > > (If people like TABLESPACE instead of LOCATION then > > s/LOCATION/TABLESPACE/g > > below) > > I like "tablespace" :-) > > > This patch would add the following NEW commands > > ---------------------------------------------------- > > CREATE LOCATION name PATH 'dbpath'; > > DROP LOCATION name; > > > The following command syntax would be modified > > ------------------------------------------------------ > > CREATE DATABASE WITH DATA_LOCATION=XXX INDEX_LOCATION=YYY > > TEMP_LOCATION=ZZZ > > CREATE TABLE aaa (...) WITH LOCATION=XXX; > > CREATE TABLE bbb (c1 text primary key location CCC) WITH LOCATION=XXX; > > CREATE TABLE ccc (c2 text unique location CCC) WITH LOCATION=XXX; > > CREATE INDEX XXX on SAMPLE (C2) WITH LOCATION BBB; > > Sounds great, but shouldn't we use syntax that is already around, > like Oracle's or DB2's or ... > > > The symbolic links will enable the rest of the software to be location > > independent. > > I see, that this is the least intrusive way, but I am not sure this > is the best way to do it. It would probably be better to pass the > Tablespace oid around (or look it up). > > That would also leave the door open for other "Tablespace types" (currently > "Filesystem directory" an OS managed tablespace :-). > > Andreas
"Jim Buttafuoco" <jim@buttafuoco.net> writes:
> My first try passed the tablespace OID arround but someone pointed out the the
> WAL code doesn't know what the tablespace OID is or what it's location is.
The low-level file access code (including WAL references) names tables
by two OIDs, which currently are database OID and relfilenode (the
latter is NOT to be considered equivalent to table OID, even though it
presently always is equal).
I believe that the correct implementation approach is to revise things
so that the low-level name of a table is tablespace OID + relfilenode;
this physical table name would in concept be completely distinct from
the logical table identification (database OID + table OID). The file
reference path would become something like
"$PGDATA/base/tablespaceoid/relfilenode", where tablespaceoid might
reference a symlink to a directory instead of a plain directory.
Tablespace management then consists of setting up those symlinks
correctly, and there is essentially zero impact on the low-level access
code.
The hard part of this is that we are probably being sloppy in some
places about the difference between physical and logical table
identifications. Those places will need to be found and fixed.
This needs to happen anyway, of course, since the point of introducing
relfilenode was to allow table versioning, which we still want.
Vadim suggested long ago that bufmgr, smgr, and below should have
nothing to do with referencing files by relcache entries; they should
only deal in physical file identifiers. That requires some tedious but
(in principle) straightforward API changes.
BTW, if tablespaces can be shared by databases then DROP DATABASE
becomes rather tricky: how do you zap the correct files out of a shared
tablespace, keeping in mind that you are not logged into the doomed
database and can't look at its catalogs? The best idea I've seen for
this so far is:
1. Access path for tables is really$PGDATA/base/databaseoid/tablespaceoid/relfilenode.
(BTW, we could save some work if we chdir'd into
$PGDATA/base/databaseoid at backend start and then used only relative
tablespaceoid/relfilenode paths. Right now we tend to use absolute
paths because the bootstrap code doesn't do that chdir; which seems
like a stupid solution...)
2. A shared tablespace directory contains a subdirectory for each database
that has files in the tablespace. Thus, the actual filesystem location
of a table is something like<tablespace>/databaseoid/relfilenode
The symlink from a database's $PGDATA/base/databaseoid/ directory to
the tablespace points at <tablespace>/databaseoid. The first attempt to
create a table in a tablespace from a particular database will create
the hard subdirectory and set up the symlink; or perhaps that should be
done by an explicit tablespace management operation to "connect" the
database to the tablespace.
3. To drop a database, we examine the symlinks in its
$PGDATA/base/databaseoid/ and rm -rf each referenced tablespace
subdirectory before rm -rf'ing $PGDATA/base/databaseoid.
regards, tom lane