Обсуждение: Scale, Normalization, and Table Count
Much to my horror I got an error message while attempting to back up my database tonight: pg_dump: NOTICE: ShmemAlloc: out of memory pg_dump: Attempt to lock table "thread_204" failed. ERROR: LockAcquire: lock table 1 is out of memory Obviously this is related to my shared memory allocations. I'm not sure how much shared memory my kernel is allowing. I'm running a rather default redhat 7.3 installation. I plan on looking into this. However, in investigating this problem I've found that it also seems to imply that I have too many tables in my database. I found a post in the postgres discussion boards about someone who encountered this who had 200,000 tables. He was told that this was too many, and that postgres was not designed to handle databases which were so poorly designed. While I can't imagine that I have more then 10,000 tables at this time, I think that as this system scales I will easily surpass 200,000. I don't think that this is due to poor database design, but rather the requirements of my application. However, standard disclaimer applies here. I'm an engineer, but I'm not a DB specialist. This is why I'm asking. Essentially, I have a discussion system. There could be millions of messages in the system. Lets pretend that there are 10 million for the sake of arguement. Of course, these messages are all on different topics. There are, lets say, 1000 messages on each topic. Users will only access one topic at a time. It seems intuitive that it would be less expensive to create an individual table for each topic then it would be to put all of the messages in a single table. If I can directly access the 1000 messages that I need by loading the table that they are in, I save myself from having to search a table which contains 10 million messages. In general, disk space is less expensive then memory or processor power, and so this seems to be a reasonable approach. I seems clear to me that there are some things I need to be careful about in pursuing this. 1. The table names, as you'll note from my error message, are automatically generated. There are limits to table name sizes, and I need to make sure that my database stays well above that limit. This is tunable in postgres, and a 128 character table name length, for example, could handle an astronomical number of table names... 2. I've learned this evening that there is a problem with shared memory allocation. There also appears to be a concern about the number of files in a UNIX directory. Right now the directory for this database has some 20,000 files in it. This seems to be easy to solve. I'll just create dynamically generated databases which include my dynamically generated tables. Each database will only be allowed, say, 5000 tables. (Is this too many?) I wonder what the limits on database names are... I haven't researched it. 3. At some point I'm clearly going to have to contend with the number of inodes available on my file system. This is also easy to solve, as I can put each database on its own partition. I think that it will eventually make sense, in fact, to put different databases on different database servers. But I have to wonder... Am I barking up the wrong tree here with all these dynamically generated tables? Is it really more reasonable to search a table with 10 million entries for the 1000 I want every time a user wants to read something? Are there other pitfalls that I need to be aware of? Is there a key book that discusses this sort of thing that I could have on my desk? Thanks, Tom Cross
I can't speak about most of this, but as to the normalisation... On Sun, Oct 06, 2002 at 02:39:57AM -0700, Tom Cross wrote: > Essentially, I have a discussion system. There could be millions of > messages in the system. Lets pretend that there are 10 million for the > sake of arguement. Of course, these messages are all on different topics. > There are, lets say, 1000 messages on each topic. Users will only access > one topic at a time. It seems intuitive that it would be less expensive > to create an individual table for each topic then it would be to put all > of the messages in a single table. If I can directly access the 1000 > messages that I need by loading the table that they are in, I save myself > from having to search a table which contains 10 million messages. A database system does not "load" tables, it pulls only those parts it needs. Databases are designed to solve exactly this problem. They will, as a rule, do a much better job of storing data than people. > 2. I've learned this evening that there is a problem with shared memory > allocation. There also appears to be a concern about the number of files > in a UNIX directory. Right now the directory for this database has some > 20,000 files in it. This seems to be easy to solve. I'll just create > dynamically generated databases which include my dynamically generated > tables. Each database will only be allowed, say, 5000 tables. (Is this > too many?) As you can see, you're just digging deeper and deeper holes for yourself. All on the (incorrect) assumption that accessing large tables are slow. Perhaps you should try the everything in one table approach. If it is slower, it's a bug in the database not in your code. Just imagine how much it would simplify the code. > 3. At some point I'm clearly going to have to contend with the number of > inodes available on my file system. This is also easy to solve, as I can > put each database on its own partition. I think that it will eventually > make sense, in fact, to put different databases on different database > servers. Digging deeper... > Am I barking up the wrong tree here with all these dynamically generated > tables? Is it really more reasonable to search a table with 10 million > entries for the 1000 I want every time a user wants to read something? Are > there other pitfalls that I need to be aware of? Is there a key book that > discusses this sort of thing that I could have on my desk? This is why indexes where invented. We've have tables of phone calls totalling millions. Looking up the calls for a single customer only takes milliseconds... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
On Sunday 06 Oct 2002 10:39 am, Tom Cross wrote: > Much to my horror I got an error message while attempting to back up my > database tonight: [snipped discussion of a design with many thousands of tables versus one large table with many millions of rows] > Am I barking up the wrong tree here with all these dynamically generated > tables? Is it really more reasonable to search a table with 10 million > entries for the 1000 I want every time a user wants to read something? Are > there other pitfalls that I need to be aware of? Is there a key book that > discusses this sort of thing that I could have on my desk? Given the number of hoops you're trying to jump through (to keep the dog metaphor) I'd say you are barking up the wrong tree. Databases are designed to handle large tables and retrieve information efficiently if indexed correctly. Try adding a "topic_id" column and indexing it - generate a million test messages and see how it goes. A useful tuning might be to gradually increase the "sort_mem" value in your config file. In general, one table for each type of thing in your database is the rule. - Richard Huxton
Tom Cross <decius@whack.org> writes: > Much to my horror I got an error message while attempting to back up my > database tonight: > pg_dump: NOTICE: ShmemAlloc: out of memory > pg_dump: Attempt to lock table "thread_204" failed. ERROR: LockAcquire: > lock table 1 is out of memory > Obviously this is related to my shared memory allocations. The easiest response to this is to increase your max_locks_per_transaction parameter in postgresql.conf. The default (64) seems to be plenty for most people, as we hear reports of such problems only once in a blue moon. But you're right that a system with many thousands of tables might need more. > There are, lets say, 1000 messages on each topic. Users will only access > one topic at a time. It seems intuitive that it would be less expensive > to create an individual table for each topic then it would be to put all > of the messages in a single table. I concur with the nearby responses that say this is indeed a bad design. regards, tom lane