Обсуждение: Architecting a database
I am in the process of moving a system that has been built around FoxPro tables for the last 18 years into a PostgreSQL based system. Over time I came up with decent strategies for making the FoxPro tables work well with the workload that was placed on them, but we are getting to the point that the locking mechanisms are causing problems when some of the more used tables are being written to. With the FoxPro tables I had one directory that contained the tables that had global data that was common to all clients. Things like documents that had been received and logged, checks that had been cut, etc. Then each client had his own directory which housed tables that had information relating to that specific client. Setting things up like this kept me from having any tables that were too terribly large so record addition and index creation were not very time consuming. I am wondering how I should architect this in PostgreSQL. Should I follow a similar strategy and have a separate database for each client and one database that contains the global data? With the dBase and ISAM tables I have a good idea of how to handle them since I have been working with them since dBASE originally came out. With the PostgreSQL type tables I am not so certain how the data is arranged within the one file. Does having the data all in one database allow PostgreSQL to better utilize indexes and caches or does having a number of smaller databases provide performance increases? In case it is important, there are 2000 clients involved, so that would be 2000 databases if I followed my current FoxPro related structure. Of course, I suppose it is always possible to combine a number of groups into a database if the number of databases is an issue. Tables within the client specific databases are generally name and address information as well as tables for 10 different types of accounts which require different structures and those tables hold anywhere from 10,000 transactions a piece for some smaller groups and 1 million for larger groups. I believe we have read to write ratio of about 1 to 15. Thanks for any input.
<tony@exquisiteimages.com> wrote: > With the dBase and ISAM tables I have a good idea of how to handle > them since I have been working with them since dBASE originally > came out. Ah, someone with whom I can reminisce about CP/M and WordStar? :-) > With the PostgreSQL type tables I am not so certain how the data > is arranged within the one file. Does having the data all in one > database allow PostgreSQL to better utilize indexes and caches or > does having a number of smaller databases provide performance > increases? In case it is important, there are 2000 clients > involved, so that would be 2000 databases if I followed my current > FoxPro related structure. Well, there are many options here. You could have: - one PostgreSQL cluster for each client, - one database for each client (all in one cluster), - one schema for each client (all in one database), or - a client_id column in each table to segregate data. The first would probably be a maintenance nightmare; it's just listed for completeness. The cluster is the level at which you start and stop the database engine, do real-time backups through the database transaction logging, etc. You probably don't want to do that individually for each of 2,000 clients, I'm assuming. Besides that, each cluster has its own memory cache, which would probably be a problem for you. (The caching issues go away for all the following options.) The database is the level at which you can get a connection. You can see some cluster-level resources within all databases, like the list of databases and the list of users, but for the most part, each database is independent, even though they're running in the same executable engine. It would be relatively easy to keep the whole cluster (all databases) backed up (especially after 9.0 is release this summer), and you could have a cluster on another machine for standby, if desired. You are able to do dumps of individual databases, but only as snapshots of a moment in time or through external tools. It's hard to efficiently join data from a table in one database to a table in another. A schema is a logical separation within a database. Table client1.account is a different table from client2.account. While a user can be limited to tables within a single schema, a user with rights to all the tables can join between them as needed. You could put common reference data in a public schema which all users could access in addition to their private schemas. The implications of putting multiple clients in a table, with a client's rows identified by a client_id column, are probably fairly obvious. If many of those 2,000 clients have tables with millions of rows, performance could suffer without very careful indexing, managing tables with billions of rows can become challenging, and there could be concerns about how to ensure that data from one client isn't accidentally shown to another. Hopefully that's enough to allow you to make a good choice. If any of that wasn't clear, please ask. -Kevin
On 6/25/10 3:28 PM, Kevin Grittner wrote: > <tony@exquisiteimages.com> wrote: >> With the PostgreSQL type tables I am not so certain how the data >> is arranged within the one file. Does having the data all in one >> database allow PostgreSQL to better utilize indexes and caches or >> does having a number of smaller databases provide performance >> increases? In case it is important, there are 2000 clients >> involved, so that would be 2000 databases if I followed my current >> FoxPro related structure. > > The implications of putting multiple clients in a table, with a > client's rows identified by a client_id column, are probably fairly > obvious. If many of those 2,000 clients have tables with millions of > rows, performance could suffer without very careful indexing, > managing tables with billions of rows can become challenging, and > there could be concerns about how to ensure that data from one > client isn't accidentally shown to another. You should also ask whether there are social (that is, nontechncal) reasons to avoid multiple clients per table. When a customer asks about security and you tell them, "You get your own database, nobody else can log in," they tend tolike that. If you tell them that their data is mixed with everyone else's, but "we've done a really good job with ourapp software and we're pretty sure there are no bugs that would let anyone see your data," that may not fly. People will trust Postgres security (assuming you actually do it right) because it's an open source, trusted product usedby some really big companies. But your own app? Do you even trust it? Even if your application IS secure, it may not matter. It's what the customer believes or worries about that can sell yourproduct. We've also found another really good reason for separate databases. It lets you experiment without any impact on anythingelse. We have scripts that can create a database in just a few minutes, load it up, and have it ready to demo injust a few minutes. If we don't end up using it, we just blow it off and its gone. No other database is impacted at all. Craig
Kevin Grittner wrote: > A schema is a logical separation within a database. Table > client1.account is a different table from client2.account. While a > user can be limited to tables within a single schema, a user with > rights to all the tables can join between them as needed. You could > put common reference data in a public schema which all users could > access in addition to their private schemas My guess would be that this app will end up being best split by schema. I wonder whether it *also* needs to be split by database, too. 2000 clusters is clearly a nightmare, and putting all the client data into one big table has both performance and security issues; that leaves database and schema as possible splits. However, having 2000 databases in a cluster is probably too many; having 2000 schemas in a database might also be too many. There are downsides to expanding either of those to such a high quantity. In order to keep both those in the domain where they perform well and are managable, it may be that what's needed is, say, 50 databases with 40 schemas each, rather than 2000 of either. Hard to say the ideal ratio. However, I think that at the application design level, it would be wise to consider each client as having a database+schema pair unique to them, and with the assumption some shared data may need to be replicated to all the databases in the cluster. Then it's possible to shift the trade-off around as needed once the app is built. Building that level of flexibility in shouldn't be too hard if it's in the design from day one, but it would be painful bit of refactoring to do later. Once there's a prototype, then some benchmark work running that app could be done to figure out the correct ratio between the two. It might even make sense to consider full scalability from day one and make the unique client connection info host:port:database:schema. P.S. Very refreshing to get asked about this before rather than after a giant app that doesn't perform well is deployed. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Interesting point you made about the read to write ratio of 1 to 15.
How frequently will you be adding new entities or in the case of storing the customers in one database table, how frequently will you be adding new objects of a certain entity type. How many entity types do you foresee existing? i.e. "Customer?" Will Customer have subtypes or is a Customer the single entity in the database?
How frequent and for how long are write operations and are they heavily transaction based? Will you need to support complex reporting in the future? What is the max number of customers? And how much data (approximate) will a single customer record consume in bytes? At what rate does it grow? (in bytes)
Will your system need to support any type of complex reporting in the future (despite it being write intensive)?
I'd take a look at memcached, plproxy, pgpool, and some of the other cool stuff in the postgresql community.
At a minimum, it might help you architect the system in such a manner that you don't box yourself in.
Last, KV stores for heavy write intensive operations in distributed environments are certainly interesting - a hybrid solution could work.
Sounds like a fun project!
Bryan
On Fri, Jun 25, 2010 at 7:02 PM, Greg Smith <greg@2ndquadrant.com> wrote:
Kevin Grittner wrote:My guess would be that this app will end up being best split by schema. I wonder whether it *also* needs to be split by database, too. 2000 clusters is clearly a nightmare, and putting all the client data into one big table has both performance and security issues; that leaves database and schema as possible splits. However, having 2000 databases in a cluster is probably too many; having 2000 schemas in a database might also be too many. There are downsides to expanding either of those to such a high quantity.A schema is a logical separation within a database. Table
client1.account is a different table from client2.account. While a
user can be limited to tables within a single schema, a user with
rights to all the tables can join between them as needed. You could
put common reference data in a public schema which all users could
access in addition to their private schemas
In order to keep both those in the domain where they perform well and are managable, it may be that what's needed is, say, 50 databases with 40 schemas each, rather than 2000 of either. Hard to say the ideal ratio. However, I think that at the application design level, it would be wise to consider each client as having a database+schema pair unique to them, and with the assumption some shared data may need to be replicated to all the databases in the cluster. Then it's possible to shift the trade-off around as needed once the app is built. Building that level of flexibility in shouldn't be too hard if it's in the design from day one, but it would be painful bit of refactoring to do later. Once there's a prototype, then some benchmark work running that app could be done to figure out the correct ratio between the two. It might even make sense to consider full scalability from day one and make the unique client connection info host:port:database:schema.
P.S. Very refreshing to get asked about this before rather than after a giant app that doesn't perform well is deployed.
--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On 26/06/2010 3:36 AM, tony@exquisiteimages.com wrote: > I am in the process of moving a system that has been built around FoxPro > tables for the last 18 years into a PostgreSQL based system. > > Over time I came up with decent strategies for making the FoxPro tables > work well with the workload that was placed on them, but we are getting to > the point that the locking mechanisms are causing problems when some of > the more used tables are being written to. > > With the FoxPro tables I had one directory that contained the tables that > had global data that was common to all clients. Things like documents that > had been received and logged, checks that had been cut, etc. Then each > client had his own directory which housed tables that had information > relating to that specific client. > I am wondering how I should architect this in PostgreSQL. Should I follow > a similar strategy and have a separate database for each client and one > database that contains the global data? No - use separate schema within a single database. You can't do inter-database queries in PostgreSQL, and most things you're used to using different "databases" for are best done with separate schema (namespaces) within one database. A schema is almost a logical directory, really. > With the dBase and ISAM tables I > have a good idea of how to handle them since I have been working with them > since dBASE originally came out. With the PostgreSQL type tables I am not > so certain how the data is arranged within the one file. Does having the > data all in one database allow PostgreSQL to better utilize indexes and > caches or does having a number of smaller databases provide performance > increases? It doesn't really make much difference, and for easier management a single database for a single app is very much the way to go. > In case it is important, there are 2000 clients involved, so > that would be 2000 databases if I followed my current FoxPro related > structure. Nonono! Definitely use different schema if you need to separate things this way. -- Craig Ringer
On Fri, Jun 25, 2010 at 3:36 PM, <tony@exquisiteimages.com> wrote: > I am in the process of moving a system that has been built around FoxPro > tables for the last 18 years into a PostgreSQL based system. > > Over time I came up with decent strategies for making the FoxPro tables > work well with the workload that was placed on them, but we are getting to > the point that the locking mechanisms are causing problems when some of > the more used tables are being written to. > > With the FoxPro tables I had one directory that contained the tables that > had global data that was common to all clients. Things like documents that > had been received and logged, checks that had been cut, etc. Then each > client had his own directory which housed tables that had information > relating to that specific client. Setting things up like this kept me from > having any tables that were too terribly large so record addition and > index creation were not very time consuming. > > I am wondering how I should architect this in PostgreSQL. Should I follow > a similar strategy and have a separate database for each client and one > database that contains the global data? With the dBase and ISAM tables I > have a good idea of how to handle them since I have been working with them > since dBASE originally came out. With the PostgreSQL type tables I am not > so certain how the data is arranged within the one file. Does having the > data all in one database allow PostgreSQL to better utilize indexes and > caches or does having a number of smaller databases provide performance > increases? In case it is important, there are 2000 clients involved, so > that would be 2000 databases if I followed my current FoxPro related > structure. Of course, I suppose it is always possible to combine a number > of groups into a database if the number of databases is an issue. > > Tables within the client specific databases are generally name and address > information as well as tables for 10 different types of accounts which > require different structures and those tables hold anywhere from 10,000 > transactions a piece for some smaller groups and 1 million for larger > groups. I believe we have read to write ratio of about 1 to 15. > > Thanks for any input. congratulations. I developed on foxpro for years and I can tell you you've come to the right place: your porting process should be relatively pain free. foxpro had a couple of nice features that aren't found in too many other places: expression indexes (which we have) and first class queries (we have, if you count pl/pgsql). foxpro was also an enormous headache on so many levels which is why I assume you are here. I've long harbored suspicion that Microsoft enjoyed adding to those headaches rather than subtracting from them. Others have answered the data organization question. You definitely want to use schemas to logically separate private application data inside your database...this is the purpose of schemas basically. Data in SQL tables is considered unordered (we have no concept of recno) unless an explicit ordering criteria is given. Direct access to the tables (BROWSE) has no analog in SQL. A query is sent to the database, results are gathered, buffered, and sent back. This is the #1 thing you will have to get used to coming from dbase style coding. Locking model in postgres is completely different (better). Records are implicitly locked by writing to them and the locks are released at transaction end (optimistic locking plus). As a bonus, data doesn't get corrupted when you break the rules =). For backend data processing tasks I advise you to use pl/pgsql. Coming from foxpro you should have no problems. You are going to have to replace your GUI and report generator. First question is whether or not go web...and following that which technologies to use. You may have already figured all this out but perhaps you haven't. Foxpro does have odbc connectivity so you may have entertained ideas of simply moving your application w/o porting the code. This may or may not work (just a heads up) -- the performance of foxpro odbc translation is not so great and some of your code won't translate well. If you didn't use foxpro for the front end, it's going to depend on what you're using. Once you get used to postgres and how it reads and writes data, don't worry so much about performance. As long as you avoid certain paradigms postgres doesn't write, the performance of your new database should absolutely nuke what you're used to, especially in the multi user case. You will have no problems on the backend -- it's the front end where your main concerns should be. good luck. merlin
tony@exquisiteimages.com writes: > I am wondering how I should architect this in PostgreSQL. Should I follow > a similar strategy and have a separate database for each client and one > database that contains the global data? As others said already, there's more problems to foresee doing so that there are advantages. If you must separate data for security concerns, your situation would be much more comfortable using schema. If it's all about performances, see about partitioning the data, and maybe not even on the client id but monthly, e.g., depending on the queries you run in your application. Regards, -- dim
Thanks for all of the input everyone. I believe I am going to put together a test case using schemas and partitioning and then doubling the amount of data currently in the system to give me an idea of how things will be performing a couple of years down the road. I was looking at a server using the new Opteron 6100 series for the new server and it would have 32 cores, but the speed is 2ghz. I read a post earlier today that mentioned in passing that it was better to have a faster processor than more cores. I was wondering whether or not this would be a good selection since there are CPUs in the Intel branch that are quad core up to 3.3ghz.
On Jun 30, 2010, at 11:12 AM, tony@exquisiteimages.com wrote: > I read a post > earlier today that mentioned in passing that it was better to have a > faster processor than more cores. This really depends on your workload and how much you value latency vs. throughput. If you tend to have a lot of very simplequeries, more cores => more throughput, and it may not matter much if your queries take 20ms or 30ms if you can bedoing a dozen or two more of them concurrently in an AMD system than in an Intel one. On the other hand, if you have lessclients, or more latency-sensitive clients, then fewer-but-faster cores is usually a win. Either way, the amount of power you can get for your money is pretty impressive.