Обсуждение: Scale, Normalization, and Table Count

Поиск
Список
Период
Сортировка

Scale, Normalization, and Table Count

От
Tom Cross
Дата:
Much to my horror I got an error message while attempting to back up my
database tonight:

pg_dump: NOTICE:  ShmemAlloc: out of memory
pg_dump: Attempt to lock table "thread_204" failed.  ERROR:  LockAcquire:
lock table 1 is out of memory

Obviously this is related to my shared memory allocations. I'm not sure
how much shared memory my kernel is allowing. I'm running a rather default
redhat 7.3 installation. I plan on looking into this.

However, in investigating this problem I've found that it also seems to
imply that I have too many tables in my database. I found a post in the
postgres discussion boards about someone who encountered this who had
200,000 tables. He was told that this was too many, and that postgres
was not designed to handle databases which were so poorly designed.

While I can't imagine that I have more then 10,000 tables at this time, I
think that as this system scales I will easily surpass 200,000. I don't
think that this is due to poor database design, but rather the
requirements of my application. However, standard disclaimer applies here.
I'm an engineer, but I'm not a DB specialist. This is why I'm asking.

Essentially, I have a discussion system. There could be millions of
messages in the system. Lets pretend that there are 10 million for the
sake of arguement. Of course, these messages are all on different topics.
There are, lets say, 1000 messages on each topic. Users will only access
one topic at a time. It seems intuitive that it would be less expensive
to create an individual table for each topic then it would be to put all
of the messages in a single table. If I can directly access the 1000
messages that I need by loading the table that they are in, I save myself
from having to search a table which contains 10 million messages.

In general, disk space is less expensive then memory or processor power,
and so this seems to be a reasonable approach.

I seems clear to me that there are some things I need to be careful about
in pursuing this.

1. The table names, as you'll note from my error message, are
automatically generated. There are limits to table name sizes, and I need
to make sure that my database stays well above that limit. This is tunable
in postgres, and a 128 character table name length, for example, could
handle an astronomical number of table names...

2. I've learned this evening that there is a problem with shared memory
allocation. There also appears to be a concern about the number of files
in a UNIX directory. Right now the directory for this database has some
20,000 files in it. This seems to be easy to solve. I'll just create
dynamically generated databases which include my dynamically generated
tables. Each database will only be allowed, say, 5000 tables. (Is this
too many?)

I wonder what the limits on database names are... I haven't researched it.

3. At some point I'm clearly going to have to contend with the number of
inodes available on my file system. This is also easy to solve, as I can
put each database on its own partition. I think that it will eventually
make sense, in fact, to put different databases on different database
servers.

But I have to wonder...

Am I barking up the wrong tree here with all these dynamically generated
tables? Is it really more reasonable to search a table with 10 million
entries for the 1000 I want every time a user wants to read something? Are
there other pitfalls that I need to be aware of? Is there a key book that
discusses this sort of thing that I could have on my desk?

Thanks,
Tom Cross


Re: Scale, Normalization, and Table Count

От
Martijn van Oosterhout
Дата:
I can't speak about most of this, but as to the normalisation...

On Sun, Oct 06, 2002 at 02:39:57AM -0700, Tom Cross wrote:
> Essentially, I have a discussion system. There could be millions of
> messages in the system. Lets pretend that there are 10 million for the
> sake of arguement. Of course, these messages are all on different topics.
> There are, lets say, 1000 messages on each topic. Users will only access
> one topic at a time. It seems intuitive that it would be less expensive
> to create an individual table for each topic then it would be to put all
> of the messages in a single table. If I can directly access the 1000
> messages that I need by loading the table that they are in, I save myself
> from having to search a table which contains 10 million messages.

A database system does not "load" tables, it pulls only those parts it
needs. Databases are designed to solve exactly this problem. They will, as a
rule, do a much better job of storing data than people.

> 2. I've learned this evening that there is a problem with shared memory
> allocation. There also appears to be a concern about the number of files
> in a UNIX directory. Right now the directory for this database has some
> 20,000 files in it. This seems to be easy to solve. I'll just create
> dynamically generated databases which include my dynamically generated
> tables. Each database will only be allowed, say, 5000 tables. (Is this
> too many?)

As you can see, you're just digging deeper and deeper holes for yourself.
All on the (incorrect) assumption that accessing large tables are slow.
Perhaps you should try the everything in one table approach. If it is
slower, it's a bug in the database not in your code. Just imagine how much
it would simplify the code.

> 3. At some point I'm clearly going to have to contend with the number of
> inodes available on my file system. This is also easy to solve, as I can
> put each database on its own partition. I think that it will eventually
> make sense, in fact, to put different databases on different database
> servers.

Digging deeper...

> Am I barking up the wrong tree here with all these dynamically generated
> tables? Is it really more reasonable to search a table with 10 million
> entries for the 1000 I want every time a user wants to read something? Are
> there other pitfalls that I need to be aware of? Is there a key book that
> discusses this sort of thing that I could have on my desk?

This is why indexes where invented. We've have tables of phone calls
totalling millions. Looking up the calls for a single customer only takes
milliseconds...

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Scale, Normalization, and Table Count

От
Richard Huxton
Дата:
On Sunday 06 Oct 2002 10:39 am, Tom Cross wrote:
> Much to my horror I got an error message while attempting to back up my
> database tonight:

[snipped discussion of a design with many thousands of tables versus one large
table with many millions of rows]

> Am I barking up the wrong tree here with all these dynamically generated
> tables? Is it really more reasonable to search a table with 10 million
> entries for the 1000 I want every time a user wants to read something? Are
> there other pitfalls that I need to be aware of? Is there a key book that
> discusses this sort of thing that I could have on my desk?

Given the number of hoops you're trying to jump through (to keep the dog
metaphor) I'd say you are barking up the wrong tree.

Databases are designed to handle large tables and retrieve information
efficiently if indexed correctly. Try adding a "topic_id" column and indexing
it - generate a million test messages and see how it goes. A useful tuning
might be to gradually increase the "sort_mem" value in your config file.

In general, one table for each type of thing in your database is the rule.

- Richard Huxton

Re: Scale, Normalization, and Table Count

От
Tom Lane
Дата:
Tom Cross <decius@whack.org> writes:
> Much to my horror I got an error message while attempting to back up my
> database tonight:

> pg_dump: NOTICE:  ShmemAlloc: out of memory
> pg_dump: Attempt to lock table "thread_204" failed.  ERROR:  LockAcquire:
> lock table 1 is out of memory

> Obviously this is related to my shared memory allocations.

The easiest response to this is to increase your
max_locks_per_transaction parameter in postgresql.conf.
The default (64) seems to be plenty for most people, as we hear reports
of such problems only once in a blue moon.  But you're right that a
system with many thousands of tables might need more.

> There are, lets say, 1000 messages on each topic. Users will only access
> one topic at a time. It seems intuitive that it would be less expensive
> to create an individual table for each topic then it would be to put all
> of the messages in a single table.

I concur with the nearby responses that say this is indeed a bad design.

            regards, tom lane