Обсуждение: Architecting a database

Поиск

Список

Период

Сортировка

Architecting a database

От

tony@exquisiteimages.com

Дата:

25 июня 2010 г., 20:04:31

I am in the process of moving a system that has been built around FoxPro
tables for the last 18 years into a PostgreSQL based system.

Over time I came up with decent strategies for making the FoxPro tables
work well with the workload that was placed on them, but we are getting to
the point that the locking mechanisms are causing problems when some of
the more used tables are being written to.

With the FoxPro tables I had one directory that contained the tables that
had global data that was common to all clients. Things like documents that
had been received and logged, checks that had been cut, etc. Then each
client had his own directory which housed tables that had information
relating to that specific client. Setting things up like this kept me from
having any tables that were too terribly large so record addition and
index creation were not very time consuming.

I am wondering how I should architect this in PostgreSQL. Should I follow
a similar strategy and have a separate database for each client and one
database that contains the global data? With the dBase and ISAM tables I
have a good idea of how to handle them since I have been working with them
since dBASE originally came out. With the PostgreSQL type tables I am not
so certain how the data is arranged within the one file. Does having the
data all in one database allow PostgreSQL to better utilize indexes and
caches or does having a number of smaller databases provide performance
increases? In case it is important, there are 2000 clients involved, so
that would be 2000 databases if I followed my current FoxPro related
structure. Of course, I suppose it is always possible to combine a number
of groups into a database if the number of databases is an issue.

Tables within the client specific databases are generally name and address
information as well as tables for 10 different types of accounts which
require different structures and those tables hold anywhere from 10,000
transactions a piece for some smaller groups and 1 million for larger
groups. I believe we have read to write ratio of about 1 to 15.

Thanks for any input.

Re: Architecting a database

От

"Kevin Grittner"

Дата:

25 июня 2010 г., 22:29:13

<tony@exquisiteimages.com> wrote:

> With the dBase and ISAM tables I have a good idea of how to handle
> them since I have been working with them since dBASE originally
> came out.

Ah, someone with whom I can reminisce about CP/M and WordStar?  :-)

> With the PostgreSQL type tables I am not so certain how the data
> is arranged within the one file. Does having the data all in one
> database allow PostgreSQL to better utilize indexes and caches or
> does having a number of smaller databases provide performance
> increases? In case it is important, there are 2000 clients
> involved, so that would be 2000 databases if I followed my current
> FoxPro related structure.

Well, there are many options here.  You could have:
 - one PostgreSQL cluster for each client,
 - one database for each client (all in one cluster),
 - one schema for each client (all in one database), or
 - a client_id column in each table to segregate data.

The first would probably be a maintenance nightmare; it's just
listed for completeness.  The cluster is the level at which you
start and stop the database engine, do real-time backups through the
database transaction logging, etc.  You probably don't want to do
that individually for each of 2,000 clients, I'm assuming.  Besides
that, each cluster has its own memory cache, which would probably be
a problem for you.  (The caching issues go away for all the
following options.)

The database is the level at which you can get a connection.  You
can see some cluster-level resources within all databases, like the
list of databases and the list of users, but for the most part, each
database is independent, even though they're running in the same
executable engine.  It would be relatively easy to keep the whole
cluster (all databases) backed up (especially after 9.0 is release
this summer), and you could have a cluster on another machine for
standby, if desired.  You are able to do dumps of individual
databases, but only as snapshots of a moment in time or through
external tools.  It's hard to efficiently join data from a table in
one database to a table in another.

A schema is a logical separation within a database.  Table
client1.account is a different table from client2.account.  While a
user can be limited to tables within a single schema, a user with
rights to all the tables can join between them as needed.  You could
put common reference data in a public schema which all users could
access in addition to their private schemas.

The implications of putting multiple clients in a table, with a
client's rows identified by a client_id column, are probably fairly
obvious. If many of those 2,000 clients have tables with millions of
rows, performance could suffer without very careful indexing,
managing tables with billions of rows can become challenging, and
there could be concerns about how to ensure that data from one
client isn't accidentally shown to another.

Hopefully that's enough to allow you to make a good choice.  If any
of that wasn't clear, please ask.

-Kevin

Re: Architecting a database

От

Craig James

Дата:

25 июня 2010 г., 23:37:33

On 6/25/10 3:28 PM, Kevin Grittner wrote:
> <tony@exquisiteimages.com>  wrote:
>> With the PostgreSQL type tables I am not so certain how the data
>> is arranged within the one file. Does having the data all in one
>> database allow PostgreSQL to better utilize indexes and caches or
>> does having a number of smaller databases provide performance
>> increases? In case it is important, there are 2000 clients
>> involved, so that would be 2000 databases if I followed my current
>> FoxPro related structure.
>
> The implications of putting multiple clients in a table, with a
> client's rows identified by a client_id column, are probably fairly
> obvious. If many of those 2,000 clients have tables with millions of
> rows, performance could suffer without very careful indexing,
> managing tables with billions of rows can become challenging, and
> there could be concerns about how to ensure that data from one
> client isn't accidentally shown to another.

You should also ask whether there are social (that is, nontechncal) reasons to avoid multiple clients per table.

When a customer asks about security and you tell them, "You get your own database, nobody else can log in," they tend
tolike that.  If you tell them that their data is mixed with everyone else's, but "we've done a really good job with
ourapp software and we're pretty sure there are no bugs that would let anyone see your data," that may not fly. 

People will trust Postgres security (assuming you actually do it right) because it's an open source, trusted product
usedby some really big companies.  But your own app?  Do you even trust it? 

Even if your application IS secure, it may not matter.  It's what the customer believes or worries about that can sell
yourproduct. 

We've also found another really good reason for separate databases.  It lets you experiment without any impact on
anythingelse.  We have scripts that can create a database in just a few minutes, load it up, and have it ready to demo
injust a few minutes.  If we don't end up using it, we just blow it off and its gone.  No other database is impacted at
all.

Craig

Re: Architecting a database

От

Greg Smith

Дата:

26 июня 2010 г., 00:03:05

Kevin Grittner wrote:
> A schema is a logical separation within a database.  Table
> client1.account is a different table from client2.account.  While a
> user can be limited to tables within a single schema, a user with
> rights to all the tables can join between them as needed.  You could
> put common reference data in a public schema which all users could
> access in addition to their private schemas

My guess would be that this app will end up being best split by schema.
I wonder whether it *also* needs to be split by database, too.  2000
clusters is clearly a nightmare, and putting all the client data into
one big table has both performance and security issues; that leaves
database and schema as possible splits.  However, having 2000 databases
in a cluster is probably too many; having 2000 schemas in a database
might also be too many.  There are downsides to expanding either of
those to such a high quantity.

In order to keep both those in the domain where they perform well and
are managable, it may be that what's needed is, say, 50 databases with
40 schemas each, rather than 2000 of either.  Hard to say the ideal
ratio.  However, I think that at the application design level, it would
be wise to consider each client as having a database+schema pair unique
to them, and with the assumption some shared data may need to be
replicated to all the databases in the cluster.  Then it's possible to
shift the trade-off around as needed once the app is built.  Building
that level of flexibility in shouldn't be too hard if it's in the design
from day one, but it would be painful bit of refactoring to do later.
Once there's a prototype, then some benchmark work running that app
could be done to figure out the correct ratio between the two.  It might
even make sense to consider full scalability from day one and make the
unique client connection info host:port:database:schema.

P.S. Very refreshing to get asked about this before rather than after a
giant app that doesn't perform well is deployed.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Architecting a database

От

Bryan Hinton

Дата:

26 июня 2010 г., 01:41:42

Interesting point you made about the read to write ratio of 1 to 15.

How frequently will you be adding new entities or in the case of storing the customers in one database table, how frequently will you be adding new objects of a certain entity type. How many entity types do you foresee existing? i.e. "Customer?" Will Customer have subtypes or is a Customer the single entity in the database?

How frequent and for how long are write operations and are they heavily transaction based? Will you need to support complex reporting in the future? What is the max number of customers? And how much data (approximate) will a single customer record consume in bytes? At what rate does it grow? (in bytes)

Will your system need to support any type of complex reporting in the future (despite it being write intensive)?

I'd take a look at memcached, plproxy, pgpool, and some of the other cool stuff in the postgresql community.

At a minimum, it might help you architect the system in such a manner that you don't box yourself in.

Last, KV stores for heavy write intensive operations in distributed environments are certainly interesting - a hybrid solution could work.

Sounds like a fun project!

Bryan

On Fri, Jun 25, 2010 at 7:02 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Kevin Grittner wrote:
A schema is a logical separation within a database. Table
client1.account is a different table from client2.account. While a
user can be limited to tables within a single schema, a user with
rights to all the tables can join between them as needed. You could
put common reference data in a public schema which all users could
access in addition to their private schemas

My guess would be that this app will end up being best split by schema. I wonder whether it *also* needs to be split by database, too. 2000 clusters is clearly a nightmare, and putting all the client data into one big table has both performance and security issues; that leaves database and schema as possible splits. However, having 2000 databases in a cluster is probably too many; having 2000 schemas in a database might also be too many. There are downsides to expanding either of those to such a high quantity.

In order to keep both those in the domain where they perform well and are managable, it may be that what's needed is, say, 50 databases with 40 schemas each, rather than 2000 of either. Hard to say the ideal ratio. However, I think that at the application design level, it would be wise to consider each client as having a database+schema pair unique to them, and with the assumption some shared data may need to be replicated to all the databases in the cluster. Then it's possible to shift the trade-off around as needed once the app is built. Building that level of flexibility in shouldn't be too hard if it's in the design from day one, but it would be painful bit of refactoring to do later. Once there's a prototype, then some benchmark work running that app could be done to figure out the correct ratio between the two. It might even make sense to consider full scalability from day one and make the unique client connection info host:port:database:schema.

P.S. Very refreshing to get asked about this before rather than after a giant app that doesn't perform well is deployed.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Architecting a database

От

Craig Ringer

Дата:

26 июня 2010 г., 04:43:30

On 26/06/2010 3:36 AM, tony@exquisiteimages.com wrote:
> I am in the process of moving a system that has been built around FoxPro
> tables for the last 18 years into a PostgreSQL based system.
>
> Over time I came up with decent strategies for making the FoxPro tables
> work well with the workload that was placed on them, but we are getting to
> the point that the locking mechanisms are causing problems when some of
> the more used tables are being written to.
>
> With the FoxPro tables I had one directory that contained the tables that
> had global data that was common to all clients. Things like documents that
> had been received and logged, checks that had been cut, etc. Then each
> client had his own directory which housed tables that had information
> relating to that specific client.

> I am wondering how I should architect this in PostgreSQL. Should I follow
> a similar strategy and have a separate database for each client and one
> database that contains the global data?

No - use separate schema within a single database.

You can't do inter-database queries in PostgreSQL, and most things
you're used to using different "databases" for are best done with
separate schema (namespaces) within one database.

A schema is almost a logical directory, really.

> With the dBase and ISAM tables I
> have a good idea of how to handle them since I have been working with them
> since dBASE originally came out. With the PostgreSQL type tables I am not
> so certain how the data is arranged within the one file. Does having the
> data all in one database allow PostgreSQL to better utilize indexes and
> caches or does having a number of smaller databases provide performance
> increases?

It doesn't really make much difference, and for easier management a
single database for a single app is very much the way to go.

> In case it is important, there are 2000 clients involved, so
> that would be 2000 databases if I followed my current FoxPro related
> structure.

Nonono! Definitely use different schema if you need to separate things
this way.

--
Craig Ringer

Re: Architecting a database

От

Merlin Moncure

Дата:

26 июня 2010 г., 15:49:36

On Fri, Jun 25, 2010 at 3:36 PM,  <tony@exquisiteimages.com> wrote:
> I am in the process of moving a system that has been built around FoxPro
> tables for the last 18 years into a PostgreSQL based system.
>
> Over time I came up with decent strategies for making the FoxPro tables
> work well with the workload that was placed on them, but we are getting to
> the point that the locking mechanisms are causing problems when some of
> the more used tables are being written to.
>
> With the FoxPro tables I had one directory that contained the tables that
> had global data that was common to all clients. Things like documents that
> had been received and logged, checks that had been cut, etc. Then each
> client had his own directory which housed tables that had information
> relating to that specific client. Setting things up like this kept me from
> having any tables that were too terribly large so record addition and
> index creation were not very time consuming.
>
> I am wondering how I should architect this in PostgreSQL. Should I follow
> a similar strategy and have a separate database for each client and one
> database that contains the global data? With the dBase and ISAM tables I
> have a good idea of how to handle them since I have been working with them
> since dBASE originally came out. With the PostgreSQL type tables I am not
> so certain how the data is arranged within the one file. Does having the
> data all in one database allow PostgreSQL to better utilize indexes and
> caches or does having a number of smaller databases provide performance
> increases? In case it is important, there are 2000 clients involved, so
> that would be 2000 databases if I followed my current FoxPro related
> structure. Of course, I suppose it is always possible to combine a number
> of groups into a database if the number of databases is an issue.
>
> Tables within the client specific databases are generally name and address
> information as well as tables for 10 different types of accounts which
> require different structures and those tables hold anywhere from 10,000
> transactions a piece for some smaller groups and 1 million for larger
> groups. I believe we have read to write ratio of about 1 to 15.
>
> Thanks for any input.

congratulations.  I developed on foxpro for years and I can tell you
you've come to the right place: your porting process should be
relatively pain free.  foxpro had a couple of nice features that
aren't found in too many other places: expression indexes (which we
have) and first class queries (we have, if you count pl/pgsql).
foxpro was also an enormous headache on so many levels which is why I
assume you are here.  I've long harbored suspicion that Microsoft
enjoyed adding to those headaches rather than subtracting from them.

Others have answered the data organization question.  You definitely
want to use schemas to logically separate private application data
inside your database...this is the purpose of schemas basically.

Data in SQL tables is considered unordered (we have no concept of
recno) unless an explicit ordering criteria is given.  Direct access
to the tables (BROWSE) has no analog in SQL.  A query is sent to the
database, results are gathered, buffered, and sent back.  This is the
#1 thing you will have to get used to coming from dbase style coding.

Locking model in postgres is completely different (better).  Records
are implicitly locked by writing to them and the locks are released at
transaction end (optimistic locking plus). As a bonus, data doesn't
get corrupted when you break the rules =).

For backend data processing tasks I advise you to use pl/pgsql.
Coming from foxpro you should have no problems.  You are going to have
to replace your GUI and report generator.  First question is whether
or not go web...and following that which technologies to use.  You may
have already figured all this out but perhaps you haven't.    Foxpro
does have odbc connectivity so you may have entertained ideas of
simply moving your application w/o porting the code.   This may or may
not work (just a heads up) -- the performance of foxpro odbc
translation is not so great and some of your code won't translate
well.  If you didn't use foxpro for the front end, it's going to
depend on what you're using.

Once you get used to postgres and how it reads and writes data, don't
worry so much about performance.  As long as you avoid certain
paradigms postgres doesn't write, the performance of your new database
should absolutely nuke what you're used to, especially in the multi
user case.  You will have no problems on the backend -- it's the front
end where your main concerns should be.  good luck.

merlin

Re: Architecting a database

От

Dimitri Fontaine

Дата:

28 июня 2010 г., 08:01:32

tony@exquisiteimages.com writes:
> I am wondering how I should architect this in PostgreSQL. Should I follow
> a similar strategy and have a separate database for each client and one
> database that contains the global data?

As others said already, there's more problems to foresee doing so that
there are advantages. If you must separate data for security concerns,
your situation would be much more comfortable using schema.

If it's all about performances, see about partitioning the data, and
maybe not even on the client id but monthly, e.g., depending on the
queries you run in your application.

Regards,
--
dim

Re: Architecting a database

От

tony@exquisiteimages.com

Дата:

30 июня 2010 г., 18:12:37

Thanks for all of the input everyone.

I believe I am going to put together a test case using schemas and
partitioning and then doubling the amount of data currently in the system
to give me an idea of how things will be performing a couple of years down
the road.

I was looking at a server using the new Opteron 6100 series for the new
server and it would have 32 cores, but the speed is 2ghz. I read a post
earlier today that mentioned in passing that it was better to have a
faster processor than more cores. I was wondering whether or not this
would be a good selection since there are CPUs in the Intel branch that
are quad core up to 3.3ghz.

Re: Architecting a database

От

Ben Chobot

Дата:

30 июня 2010 г., 18:28:46

On Jun 30, 2010, at 11:12 AM, tony@exquisiteimages.com wrote:

>  I read a post
> earlier today that mentioned in passing that it was better to have a
> faster processor than more cores.

This really depends on your workload and how much you value latency vs. throughput. If you tend to have a lot of very
simplequeries, more cores => more throughput, and it may not matter much if your queries take 20ms or 30ms if you can
bedoing a dozen or two more of them concurrently in an AMD system than in an Intel one. On the other hand, if you have
lessclients, or more latency-sensitive clients, then fewer-but-faster cores is usually a win. 

Either way, the amount of power you can get for your money is pretty impressive.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Architecting a database