Обсуждение: One database vs. hundreds?
I'm hoping to get some advice on a design question I'm grappling with. I have a database now that in many respects may be regarded as an collection of a few hundred much smaller "parallel databases", all having the same schema. What I mean by this is that, as far as the intended use of this particular system there are no meaningful queries whose results would include information from more than one of these parallel component databases. Furthermore, one could delete all the records of any one of these parallel components without affecting the referential integrity of the rest of the database. Therefore, both for performance and maintenance reasons, the idea of splitting this database into its components looks very attractive. This would result in a system with hundreds of small databases (and in the future possibly reaching into the low thousands). I don't have experience with such a situation, and I'm wondering if there are issues I should be concerned about. Alternatively, maybe there are techniques to achieve the benefits of this split without actually carrying it out. The two benefits I see are in the areas of performance and maintenance. As for performance, I assume (naively, I'm sure) that searches will be faster in the individual component databases, simply because it's a search among fewer pieces of information. And for maintenance, I think the split would make the system more robust during database updates, because only a small component would be updated at a time, and the rest of the system would completely insulated from this. I'd very much appreciate your thoughts on these issues. I imagine I'm not the first person to confront this kind of design choice. Does it have a standard name that I could use in a Google search? TIA! kj
Kynn Jones wrote: > I'm hoping to get some advice on a design question I'm grappling with. > I have a database now that in many respects may be regarded as an > collection of a few hundred much smaller "parallel databases", all > having the same schema. What I mean by this is that, as far as the > intended use of this particular system there are no meaningful queries > whose results would include information from more than one of these > parallel component databases. Furthermore, one could delete all the > records of any one of these parallel components without affecting the > referential integrity of the rest of the database. > > Therefore, both for performance and maintenance reasons, the idea of > splitting this database into its components looks very attractive. > This would result in a system with hundreds of small databases (and in > the future possibly reaching into the low thousands). I don't have > experience with such a situation, and I'm wondering if there are > issues I should be concerned about. > > Alternatively, maybe there are techniques to achieve the benefits of > this split without actually carrying it out. The two benefits I see > are in the areas of performance and maintenance. As for performance, > I assume (naively, I'm sure) that searches will be faster in the > individual component databases, simply because it's a search among > fewer pieces of information. And for maintenance, I think the split > would make the system more robust during database updates, because > only a small component would be updated at a time, and the rest of the > system would completely insulated from this. > > I'd very much appreciate your thoughts on these issues. > > I imagine I'm not the first person to confront this kind of design > choice. Does it have a standard name that I could use in a Google > search? > > > Hi Kynn. I don't have experience in this type of application, but we use pgsql partitioning for other reasons and it has some of the features you want (data separation, query performance, ...). It can be worth reading: http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html Regards, -- Kamil
am Tue, dem 28.08.2007, um 14:23:00 +0200 mailte Kamil Srot folgendes: > > Kynn Jones wrote: > >I'm hoping to get some advice on a design question I'm grappling with. > > I have a database now that in many respects may be regarded as an > >collection of a few hundred much smaller "parallel databases", all > >having the same schema. What I mean by this is that, as far as the > >intended use of this particular system there are no meaningful queries > >whose results would include information from more than one of these > > I don't have experience in this type of application, but we use pgsql > partitioning for other reasons > and it has some of the features you want (data separation, query > performance, ...). > It can be worth reading: > http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html He don't need table partitioning, this is a different thing. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
am Tue, dem 28.08.2007, um 8:08:36 -0400 mailte Kynn Jones folgendes: > I'm hoping to get some advice on a design question I'm grappling with. > I have a database now that in many respects may be regarded as an > collection of a few hundred much smaller "parallel databases", all > having the same schema. What I mean by this is that, as far as the > intended use of this particular system there are no meaningful queries > whose results would include information from more than one of these > parallel component databases. Furthermore, one could delete all the Maybe different schemas, one schema for every "parallel databases", can help you. And different rights for the users. Why one database with many schemas? I suppose, you have objects to share with all users, for instance: - programming languages - stored procedures - maybe shared data Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
On Tuesday 28 August 2007 06:32:32 A. Kretschmer wrote: > am Tue, dem 28.08.2007, um 14:23:00 +0200 mailte Kamil Srot folgendes: > > Kynn Jones wrote: > > >I'm hoping to get some advice on a design question I'm grappling with. > > > I have a database now that in many respects may be regarded as an > > >collection of a few hundred much smaller "parallel databases", all > > >having the same schema. What I mean by this is that, as far as the > > >intended use of this particular system there are no meaningful queries > > >whose results would include information from more than one of these > > > > I don't have experience in this type of application, but we use pgsql > > partitioning for other reasons > > and it has some of the features you want (data separation, query > > performance, ...). > > It can be worth reading: > > http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html > > He don't need table partitioning, this is a different thing. > > > Andreas I find that creating multiple schema's is often better than creating multiple db's since in the remote chance you might want to query across the schema's it's a no-brainer where queries across db's are a bit more difficult. That sais AFAIK db's vs. schema's outside of the query Issue I mentioned above are merely an organizational mechanism. Unless you're talking about multiple clusters it's kinda the old 6 vs. half a dozen arguement - meaning it doesn't really matter per performance or functionality from strictly a postgres perspective. The deciscion then becomes an application architecture question(s) such as how important is it to isolate these db's/schema's from each other and is there a forseeable need to query across them in the future, etc... Hope this helps.. My vote would be for schema's unless there's a requirement on the table for isolation - it leaves your options open & more flexible for the future. /Kevin
Thank you very much for your replies. Given the differences in the opinions expressed, I thought I would describe the database briefly. The purpose of the database is basically translation of terms. Imagine a collection of disjoint sets A, B, C, ... Now imagine that for each element of a set multiple *names* exist, although even within each set, the number of names varies from one element to the next. To complicate matters further, imagine that there are several *authorities* X, Y, Z... on the subject of naming these elements, and they don't always agree. So currently, the relevant portion of schema is something like SET ... one record per available set AUTHORITY ... one record per naming authority ELEMENT ... which refers to SET and has one record per element in any of the sets NAME ... which refers to ELEMENT and to AUTHORITY and whose records represent the various names assigned to each element by the various authorities. The largest tables, by far, are ELEMENT and NAME. It is clear from this description that ELEMENT can be partitioned according to the set each element belongs to, since these sets are disjoint (for the moment I'm using the word "partitioned" in its general sense, not in the sense of Pg's table partitioning that Kamil proposed). Similarly NAME can be partitioned according to the authorities associated with each naming (even when two authorities agree on a name for a given element, NAME includes separate records to represent this fact). Furthermore, each one of these authority-based partitions can be in turn partitioned according to the set that the referred-to element belongs to. Therefore instead of having a single database we could have many databases, one for each combination of set and authority, with the simplified schema ELEMENT NAME ... refers to ELEMENT From a semantic point of view, all the kinds of queries that the system is designed to support would be satisfied by this design. Given this it seems to me that Pg's table partitioning does indeed fit the bill after all. I like the idea of being able to bulk-update entire chunks of the database by creating temporary a partition, testing it, deleting the old one, and renaming the temporary partition to replace it. Furthermore, it seems like this arrangement, contrary to my original multiple-DB idea, does not preclude the occasional global query spanning multiple authorities or sets (though these queries would be useful mostly for administrative purposes). I join btober in my amazement and admiration at the overall coolness of PostgreSQL. I'm sending massive props to the developers right now... Many thanks!!! kj
Kynn Jones wrote: > I'm hoping to get some advice on a design question I'm grappling with. > I have a database now that in many respects may be regarded as an > collection of a few hundred much smaller "parallel databases", all > having the same schema. What I mean by this is that, as far as the > intended use of this particular system there are no meaningful queries > whose results would include information from more than one of these > parallel component databases. Furthermore, one could delete all the > records of any one of these parallel components without affecting the > referential integrity of the rest of the database. > > Therefore, both for performance and maintenance reasons, the idea of > splitting this database into its components looks very attractive. > This would result in a system with hundreds of small databases (and in > the future possibly reaching into the low thousands). I don't have > experience with such a situation, and I'm wondering if there are > issues I should be concerned about. > > Alternatively, maybe there are techniques to achieve the benefits of > this split without actually carrying it out. The two benefits I see > are in the areas of performance and maintenance. As for performance, > I assume (naively, I'm sure) that searches will be faster in the > individual component databases, simply because it's a search among > fewer pieces of information. And for maintenance, I think the split > would make the system more robust during database updates, because > only a small component would be updated at a time, and the rest of the > system would completely insulated from this. > > I'd very much appreciate your thoughts on these issues. > > I imagine I'm not the first person to confront this kind of design > choice. Does it have a standard name that I could use in a Google > search? > I see four ways you can go with this and the choice will be the one that makes more sense to the way you see yourself accessing the data with only one of which affecting the performance the way you mention. 1. is to have each set of tables and data in a separate database. 2. is to have each set of tables duplicated in a different schema name. 3. is to have it all in one set of tables with a column to identify which set the row belongs to. 4. is to use the above with ddl-partitioning to achieve data separation removing the performance costs of having it all in one table. The first leaves you with complete data separation, you log in differently to access each data set. If you using psql then you would use a connect command to look at different data. Other clients may mean you need to disconnect then connect. The second allows you to to use the same log in command and you can either use schema qualifications with the table names to access the data set you want, or change the default schema to change between data sets. The difference will depend on your client. You can use schema qualified names with your client inserting the selected schema when generating the sql. Or you may use SET search_path TO myschema; then SELECT without schema qualified names, leaving your sql as it is. The last two are similar with the forth using the ddl setup to separate the data into separate tables for you, so you will still have the same (or very close) performance as if each is separated by schema or db. Here you would have to include datasetidcolumn=x within each sql statement. Again this may be inserted by your client as it generates the sql. Whether the tables are separated by dbname, schema name or ddl-partitioning each will be a separate table using their own indexes etc. As far as backup/restore goes, you won't find much difference with either option - 3 will mean they will be all together in one backup file, the others mean adding a line to your backup script to dump the data set to it's own file. If you just want to dumpall to one file then it won't matter either way. Apart from the first the other options allow you to share data between data sets - a postcode list, state names, country names... The choice may also be affected by your knowledge of sql. The ddl-partitioning may make your table structure more complex and harder for you to think with (mainly when defining a new data set). Yet adding a column will mean adding it in one place (the parent table) not repeating it for each db or schema. As you mention increasing to a few thousand db's - what sort of server load are you expecting? Would moving a few hundred of the db's to another server be easier than setting up load balancing/replication when things get too busy for your current server? Of course if you ever want to report on stats comparing all db's then the last two will be the easiest. You may change your mind there later - even if each data set is a clients financial records - what if you want to ask which clients spend too much money on office supplies? which clients will be affected by the new tax changes? or which clients have the largest profit margins? list top ten clients with the most turnover. If they all relate to different websites then which is the least productive and not worth continuing? I guess I rattled on a bit there - but that should cover most of the points you will need to consider to make the choice. Even though you say you don't need the data all together think forward to what may happen in the future, and how much work it will be if you do change your mind. -- Shane Ambler pgSQL@Sheeky.Biz Get Sheeky @ http://Sheeky.Biz