Re: Millions of tables

Поиск
Список
Период
Сортировка
От Craig James
Тема Re: Millions of tables
Дата
Msg-id CAFwQ8rdx6Uz4zE7dQ4jvr-oYbx+aamkyBnuURca-voyNUd85=g@mail.gmail.com
обсуждение исходный текст
Ответ на Millions of tables  (Greg Spiegelberg <gspiegelberg@gmail.com>)
Ответы Re: Millions of tables
Список pgsql-performance
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

You don't give enough details to fully explain the problem you're trying to solve.
  • Will records ever be updated or deleted? If so, what percentage and at what frequency?
  • What specifically are you storing (e.g. list of integers, strings, people's sex habits, ...)? Or more importantly, are these fixed- or variable-sized records?
  • Once the 1,200 trillion records are loaded, is that it? Or do more data arrive, and if so, at what rate?
  • Do your queries change, or is there a fixed set of queries?
  • How complex are the joins?
The reason I ask these specific questions is because, as others have pointed out, this might be a perfect case for a custom (non-relational) database. Relational databases are general-purpose tools, sort of like a Swiss-Army knife. A Swiss-Army knife does most things passably, but if you want to carve wood, or butcher meat, or slice vegetables, you get a knife meant for that specific task.

I've written several custom database-storage systems for very specific high-performance systems. It's generally a couple weeks of work, and you have a tailored performance and storage that's hard for a general-purpose relational system to match.

The difficulty of building such a system depends a lot on the answers to the questions above.

Craig


No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

В списке pgsql-performance по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: Storing large documents - one table or partition by doc?
Следующее
От: "Mike Sofen"
Дата:
Сообщение: Re: Millions of tables