Обсуждение: Millions of tables

Поиск
Список
Период
Сортировка

Millions of tables

От
Greg Spiegelberg
Дата:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg

Re: Millions of tables

От
julyanto SUTANDANG
Дата:
Dear Greg,

Have you checked PostgresXL ?
with millions of table, how the apps choose which table is approriate?
in my opinion, with that scale it should go with parallel query with
data sharing like what PostgresXL is done.

Thanks,


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source and Open Mind Company)
www.equnix.co.id
Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments)
may be legally privileged and/or confidential and is intended only for
the use of the addressee(s). No addressee should forward, print, copy,
or otherwise reproduce this message in any manner that would allow it
to be viewed by any individual not originally listed as a recipient.
If the reader of this message is not the intended recipient, you are
hereby notified that any unauthorized disclosure, dissemination,
distribution, copying or the taking of any action in reliance on the
information herein is strictly prohibited. If you have received this
communication in error, please immediately notify the sender and
delete this message.Unless it is made by the authorized person, any
views expressed in this message are those of the individual sender and
may not necessarily reflect the views of PT Equnix Business Solutions.


On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg
<gspiegelberg@gmail.com> wrote:
> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as VACUUM
> and ANALYZE should not be done.  Each will run forever and cause much grief.
> Backups are problematic in the traditional pg_dump and PITR space.  Large
> JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test
> case) are no-no's.  A system or database crash could take potentially hours
> to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads for
> a single record fetch.
>
> No data is loaded... yet  Table and index creation only.  I am interested in
> the opinions of all including tests I may perform.  If you had this setup,
> what would you capture / analyze?  I have a job running preparing data.  I
> did this on a much smaller scale (50k tables) and data load via function
> allowed close to 6,000 records/second.  The schema has been simplified since
> and last test reach just over 20,000 records/second with 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg


Re: Millions of tables

От
"Mike Sofen"
Дата:

From: Greg Spiegelberg  Sent: Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those tables. 

… A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

 

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

 

I'm not looking for alternatives yet but input to my test.

_________

 

Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the risks/limitations of the known for the risks of the unknown.  The unknown being, in the numerous places where postgres historical development may have cut corners, you may be the first to exercise those corners and flame out like the recent SpaceX rocket.

 

Put it another way – you’re going to bet your career (perhaps) or a client’s future on an architectural model that just doesn’t seem feasible.  I think you’ve got a remarkable design problem to solve, and am glad you’ve chosen to share that problem with us.

 

And I do think it will boil down to this: it’s not that you CAN do it on Postgres (which you clearly can), but once in production, assuming things are actually stable, how will you handle the data management aspects like inevitable breakage, data integrity issues, backups, restores, user contention for resources, fault tolerance and disaster recovery.  Just listing the tables will take forever.  Add a column?  Never.  I do think the amount of testing you’ll need to do prove that every normal data management function still works at that table count…that in itself is going to be not a lot of fun.

 

This one hurts my head.  Ironically, the most logical destination for this type of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, etc…and I’m not a Hadoopie.

 

I am looking forward to hearing how this all plays out, it will be quite an adventure!  All the best,

 

Mike Sofen (Synthetic Genomics…on Postgres 9.5x)

Re: Millions of tables

От
Greg Spiegelberg
Дата:
Precisely why I shared with the group.  I must understand the risks involved.  I need to explore if it can be stable at this size when does it become unstable?  Aside from locking down user access to superuser, is there a way to prohibit database-wide VACUUM & ANALYZE?  Certainly putting my trust in autovacuum :) which is something I have not yet fully explored how to best tune.

Couple more numbers... ~231 GB is the size of PGDATA with 8M empty tables and 16M empty indexes.  ~5% of inodes on the file system have been used.  Sar data during the 8M table creation shows a very stable and regular I/O pattern.  Not a blip worth mentioning.

Another point worth mentioning, the tables contain a boolean, int8's and timestamptz's only.  Nothing of variable size like bytea, text, json or xml.  Each of the 8M tables will contain on the very high side between 140k and 200k records.  The application also has a heads up as to which table contains which record.  The searches come in saying "give me record X from partition key Y" where Y identifies the table and X is used in the filter on the table.

Last point, add column will never be done.  I can hear eyes rolling :) but the schema and it's intended use is complete.  You'll have to trust me on that one.

-Greg

On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen <msofen@runbox.com> wrote:

From: Greg Spiegelberg  Sent: Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those tables. 

… A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

 

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

 

I'm not looking for alternatives yet but input to my test.

_________

 

Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the risks/limitations of the known for the risks of the unknown.  The unknown being, in the numerous places where postgres historical development may have cut corners, you may be the first to exercise those corners and flame out like the recent SpaceX rocket.

 

Put it another way – you’re going to bet your career (perhaps) or a client’s future on an architectural model that just doesn’t seem feasible.  I think you’ve got a remarkable design problem to solve, and am glad you’ve chosen to share that problem with us.

 

And I do think it will boil down to this: it’s not that you CAN do it on Postgres (which you clearly can), but once in production, assuming things are actually stable, how will you handle the data management aspects like inevitable breakage, data integrity issues, backups, restores, user contention for resources, fault tolerance and disaster recovery.  Just listing the tables will take forever.  Add a column?  Never.  I do think the amount of testing you’ll need to do prove that every normal data management function still works at that table count…that in itself is going to be not a lot of fun.

 

This one hurts my head.  Ironically, the most logical destination for this type of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, etc…and I’m not a Hadoopie.

 

I am looking forward to hearing how this all plays out, it will be quite an adventure!  All the best,

 

Mike Sofen (Synthetic Genomics…on Postgres 9.5x)


Re: Millions of tables

От
Greg Spiegelberg
Дата:
I did look at PostgresXL and CitusDB.  Both are admirable however neither could support the need to read a random record consistently under 30ms.  It's a similar problem Cassandra and others have: network latency.  At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index.  I have other requirements that prohibit use of any technology that is eventually consistent.

I liken the problem to fishing.  To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake.  However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.

-Greg

On Sun, Sep 25, 2016 at 9:04 PM, julyanto SUTANDANG <julyanto@equnix.co.id> wrote:
Dear Greg,

Have you checked PostgresXL ?
with millions of table, how the apps choose which table is approriate?
in my opinion, with that scale it should go with parallel query with
data sharing like what PostgresXL is done.

Thanks,


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source and Open Mind Company)
www.equnix.co.id
Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments)
may be legally privileged and/or confidential and is intended only for
the use of the addressee(s). No addressee should forward, print, copy,
or otherwise reproduce this message in any manner that would allow it
to be viewed by any individual not originally listed as a recipient.
If the reader of this message is not the intended recipient, you are
hereby notified that any unauthorized disclosure, dissemination,
distribution, copying or the taking of any action in reliance on the
information herein is strictly prohibited. If you have received this
communication in error, please immediately notify the sender and
delete this message.Unless it is made by the authorized person, any
views expressed in this message are those of the individual sender and
may not necessarily reflect the views of PT Equnix Business Solutions.


On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg
<gspiegelberg@gmail.com> wrote:
> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as VACUUM
> and ANALYZE should not be done.  Each will run forever and cause much grief.
> Backups are problematic in the traditional pg_dump and PITR space.  Large
> JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test
> case) are no-no's.  A system or database crash could take potentially hours
> to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads for
> a single record fetch.
>
> No data is loaded... yet  Table and index creation only.  I am interested in
> the opinions of all including tests I may perform.  If you had this setup,
> what would you capture / analyze?  I have a job running preparing data.  I
> did this on a much smaller scale (50k tables) and data load via function
> allowed close to 6,000 records/second.  The schema has been simplified since
> and last test reach just over 20,000 records/second with 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg

Re: Millions of tables

От
Gavin Flower
Дата:
Hi Greg,

Please follow the conventions of this mailing list,  to avoid confusion
- see bottom of this posting for further comments


On 26/09/16 17:05, Greg Spiegelberg wrote:
> Precisely why I shared with the group.  I must understand the risks
> involved.  I need to explore if it can be stable at this size when
> does it become unstable?  Aside from locking down user access to
> superuser, is there a way to prohibit database-wide VACUUM & ANALYZE?
> Certainly putting my trust in autovacuum :) which is something I have
> not yet fully explored how to best tune.
>
> Couple more numbers... ~231 GB is the size of PGDATA with 8M empty
> tables and 16M empty indexes.  ~5% of inodes on the file system have
> been used.  Sar data during the 8M table creation shows a very stable
> and regular I/O pattern.  Not a blip worth mentioning.
>
> Another point worth mentioning, the tables contain a boolean, int8's
> and timestamptz's only.  Nothing of variable size like bytea, text,
> json or xml.  Each of the 8M tables will contain on the very high side
> between 140k and 200k records.  The application also has a heads up as
> to which table contains which record.  The searches come in saying
> "give me record X from partition key Y" where Y identifies the table
> and X is used in the filter on the table.
>
> Last point, add column will never be done.  I can hear eyes rolling :)
> but the schema and it's intended use is complete.  You'll have to
> trust me on that one.
>
> -Greg
>
> On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen <msofen@runbox.com
> <mailto:msofen@runbox.com>> wrote:
>
>     *From:*Greg Spiegelberg *Sent:* Sunday, September 25, 2016 7:50 PM
>     … Over the weekend, I created 8M tables with 16M indexes on those
>     tables.
>
>     … A system or database crash could take potentially hours to days
>     to recover.  There are likely other issues ahead.
>
>     You may wonder, "why is Greg attempting such a thing?"  I looked
>     at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's
>     face it, it's antiquated and don't get me started on "Hadoop".
>     Problem with the "one big table" solution is I anticipate 1,200
>     trillion records.  Random access is expected and the customer
>     expects <30ms reads for a single record fetch.
>
>     I'm not looking for alternatives yet but input to my test.
>
>     _________
>
>     Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the
>     risks/limitations of the known for the risks of the unknown.  The
>     unknown being, in the numerous places where postgres historical
>     development may have cut corners, you may be the first to exercise
>     those corners and flame out like the recent SpaceX rocket.
>
>     Put it another way – you’re going to bet your career (perhaps) or
>     a client’s future on an architectural model that just doesn’t seem
>     feasible.  I think you’ve got a remarkable design problem to
>     solve, and am glad you’ve chosen to share that problem with us.
>
>     And I do think it will boil down to this: it’s not that you CAN do
>     it on Postgres (which you clearly can), but once in production,
>     assuming things are actually stable, how will you handle the data
>     management aspects like inevitable breakage, data integrity
>     issues, backups, restores, user contention for resources, fault
>     tolerance and disaster recovery.  Just listing the tables will
>     take forever.  Add a column? Never.  I do think the amount of
>     testing you’ll need to do prove that every normal data management
>     function still works at that table count…that in itself is going
>     to be not a lot of fun.
>
>     This one hurts my head.  Ironically, the most logical destination
>     for this type of data may actually be Hadoop – auto-scale,
>     auto-shard, fault tolerant, etc…and I’m not a Hadoopie.
>
>     I am looking forward to hearing how this all plays out, it will be
>     quite an adventure!  All the best,
>
>     Mike Sofen (Synthetic Genomics…on Postgres 9.5x)
>
>
In this list, the convention is to post replies at the end (with some rare exceptions), or interspersed when
appropriate,
and to omit parts no longer relevant.

The motivation of bottom posting like this: is that people get to see the context before the reply, AND emails don't
end
up getting longer & longer as people reply at the beginning forgetting to trim the now irrelevant stuff at the end.


Cheers,
Gavin



Re: Millions of tables

От
Jeff Janes
Дата:

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.

Why would the auto versions of those cause less grief than the manual versions?
 
  Backups are problematic in the traditional pg_dump and PITR space. 

Is there a third option to those two spaces?  File-system snapshots?
 
Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover. 

Isn't that a show-stopper?
 
There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

Sorry, I don't really follow.  Whether you have 1 table or millions, eventually someone has to go get the data off the disk. Why would the number of tables make much of a difference to that fundamental?

Also, how many tablespaces do you anticipate having?  Can you get 120 petabytes of storage all mounted to one machine?


No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

Go through and put one row (or 8kB worth of rows) into each of 8 million table.  The stats collector and the autovacuum process will start going nuts.  Now, maybe you can deal with it.  But maybe not.  That is the first non-obvious thing I'd look at.

Cheers,

Jeff

Re: Millions of tables

От
julyanto SUTANDANG
Дата:
-sorry for my last email, which also not bottom posting-

Hi Greg, 
On Mon, Sep 26, 2016 at 11:19 AM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
I did look at PostgresXL and CitusDB.  Both are admirable however neither could support the need to read a random record consistently under 30ms.  It's a similar problem Cassandra and others have: network latency.  At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index.  I have other requirements that prohibit use of any technology that is eventually consistent.
 Then, you can get below 30ms, but how many process you might have to have conncurently? 
This is something that you should consider, single machine can only have less than 50 HT for intel, 192HT for Power8, still it is far below millions compare with the number of tables (8Million) 
If you use index correctly, you would not need sequencial scan since the scanning run on the memory (index loaded into memory)
Do you plan to query thru Master table of the partition? it is quite slow actually, considering millions rule to check for every query. 

with 8 Millions of data, you would require very big data storage for sure and it would not fit mounted into single machine unless you would planning to use IBM z machines.


I liken the problem to fishing.  To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake.  However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.

Re: Millions of tables

От
Álvaro Hernández Tortosa
Дата:

On 26/09/16 05:50, Greg Spiegelberg wrote:
> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
> time has said not to have millions of tables.  I too have long
> believed it until recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
> for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on
> those tables.  Table creation initially took 0.018031 secs, average
> 0.027467 and after tossing out outliers (qty 5) the maximum creation
> time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and
> 8.435049 seconds.  Tables were created by a single process. Do note
> that table creation is done via plpgsql function as there are other
> housekeeping tasks necessary though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and
> cause much grief.  Backups are problematic in the traditional pg_dump
> and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance
> (I am abusing it in my test case) are no-no's.  A system or database
> crash could take potentially hours to days to recover.  There are
> likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face
> it, it's antiquated and don't get me started on "Hadoop".  I looked at
> many others and ultimately the recommended use of each vendor was to
> have one table for all data.  That overcomes the millions of tables
> problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200
> trillion records.  Random access is expected and the customer expects
> <30ms reads for a single record fetch.
>
> No data is loaded... yet  Table and index creation only. I am
> interested in the opinions of all including tests I may perform.  If
> you had this setup, what would you capture / analyze?  I have a job
> running preparing data.  I did this on a much smaller scale (50k
> tables) and data load via function allowed close to 6,000
> records/second.  The schema has been simplified since and last test
> reach just over 20,000 records/second with 300k tables.
>
> I'm not looking for alternatives yet but input to my test. Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg

     Hi Greg.

     This is a problem (creating a large number of tables; really large
indeed) that we researched in my company a while ago. You might want to
read about it: https://www.pgcon.org/2013/schedule/events/595.en.html

     Cheers,

     Álvaro


--

Álvaro Hernández Tortosa


-----------
8Kdata



Re: Millions of tables

От
Stuart Bishop
Дата:
On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
I did look at PostgresXL and CitusDB.  Both are admirable however neither could support the need to read a random record consistently under 30ms.  It's a similar problem Cassandra and others have: network latency.  At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index.  I have other requirements that prohibit use of any technology that is eventually consistent.

I liken the problem to fishing.  To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake.  However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.

-Greg

My gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another.

You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements.

An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).

Re: Millions of tables

От
Rick Otten
Дата:
Are the tables constantly being written to, or is this a mostly read scenario?   One architecture possibility, if the writes are not so frequent,  is to create just a handful of very big tables for writing, and then make smaller tables as materialized views for reading.  The vacuum and bloat management could be done back a the big tables. The materialized views could be refreshed or replaced during non-peak hours.  The materialized views could be on a different tablespace than the root tables.  They could also be structured to reflect real-world query patterns which are sometimes different than the raw data storage engineering problem.

With some logging you may be able to see that the data is not truly randomly accessed, but rather clustered around just some of the millions of tables.   Then the engineering problem becomes "How do I service 90% of the queries on these tables in 30ms ?" Rather than "How do I service 100% of the queries 100% of the time in 30ms?"  Knowing 90% of the queries hit just a few hundred tables, makes the first question easier to answer.

Similarly, if most of the columns are static and only a few columns are actually changing, you could consider pulling the static stuff out of the same table with the dynamic stuff and then look at joins in your queries.  The end goal is to be able to get solid indexes and tables that don't change a lot so they can be tightly packed and cached.  (less bloat, less fragmentation, fewer disk accesses).

With regards to consistent query performance, I think you need to get out of AWS.  That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous.

Also I think having 10M rows in a table is not a problem for the query times you are referring to.  So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables.



On Mon, Sep 26, 2016 at 5:43 AM, Stuart Bishop <stuart@stuartbishop.net> wrote:
On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
I did look at PostgresXL and CitusDB.  Both are admirable however neither could support the need to read a random record consistently under 30ms.  It's a similar problem Cassandra and others have: network latency.  At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index.  I have other requirements that prohibit use of any technology that is eventually consistent.

I liken the problem to fishing.  To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake.  However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.

-Greg

My gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another.

You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements.

An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).


Re: Millions of tables

От
Greg Spiegelberg
Дата:
Following list etiquette response inline ;)

On Mon, Sep 26, 2016 at 2:28 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:


On 26/09/16 05:50, Greg Spiegelberg wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

No data is loaded... yet  Table and index creation only. I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test. Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg

    Hi Greg.

    This is a problem (creating a large number of tables; really large indeed) that we researched in my company a while ago. You might want to read about it: https://www.pgcon.org/2013/schedule/events/595.en.html


updatedb, funny.  Thank you for the pointer.  I had no intention of going to 1B tables.

I may need to understand autovacuum better.  My impression was it consulted statistics and performed vacuums one table at a time based on the vacuum threshold formula on  https://www.postgresql.org/docs/9.5/static/routine-vacuuming.html.  


 -Greg

Re: Millions of tables

От
Yves Dorfsman
Дата:
Something that is not talked about at all in this thread is caching. A bunch
of memcache servers in front of the DB should be able to help with the 30ms
constraint (doesn't have to be memcache, some caching technology).

--
http://yves.zioup.com
gpg: 4096R/32B0F416



Re: Millions of tables

От
"Mike Sofen"
Дата:

From: Rick Otten   Sent: Monday, September 26, 2016 3:24 AM
Are the tables constantly being written to, or is this a mostly read scenario?

 

With regards to consistent query performance, I think you need to get out of AWS.  That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous.

 

Also I think having 10M rows in a table is not a problem for the query times you are referring to.  So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables.

----------

Excellent thoughts:  the read/write behavior will/should drive a lot of the design;  AWS does not guarantee consistency or latency;  and 10m rows is nothing to PG.

 

Re AWS:  we’re on it, at least for now.  In my profiling of our performance there, I consistently get low latencies…I just know that there will be random higher latencies, but the statistical average will be low.  I just ran a quick test against a modest sized table on a modest sized EC2 instance (m4.xlarge – 4 core/16gb ram, 3 tb ssd):  the table has 15m rows but is huge (it represents nearly 500m rows compressed in jsonb documents), with 5 indexed key columns and a total of 12 columns.  I queried for a single, non-PK, indexed value using “select *” (so it included the json) and it took 22ms, without the json it took 11ms.  Especially with the db/memory-optimized EC2 instances now available (with guaranteed IOPS), performance against even 100m row tables should still stay within your requirements.

 

So Rick’s point about not needing millions of tables is right on.  If there’s a way to create table “clumps”, at least you’ll have a more modest table count.

 

Mike Sofen (Synthetic Genomics)

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Mon, Sep 26, 2016 at 3:43 AM, Stuart Bishop <stuart@stuartbishop.net> wrote:
On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
I did look at PostgresXL and CitusDB.  Both are admirable however neither could support the need to read a random record consistently under 30ms.  It's a similar problem Cassandra and others have: network latency.  At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index.  I have other requirements that prohibit use of any technology that is eventually consistent.

I liken the problem to fishing.  To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake.  However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.

-Greg

My gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another.


Exactly why I am exploring.  What are the trade offs?

 
You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements.


Cassandra performance, according to the experts I consulted, starts to fall off once the stored dataset exceeds ~3 TB.  Much too small for my use case.  Again, I do have other reasons for not using Cassandra and others namely deduplication of information referenced by my millions of tables.  There are no guarantees in many outside of the RDBMS realm. 

 
An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).


However even 8 bazillion FDW's may cause an "overflow" of relationships at the loss of having an efficient storage engine acting more like a traffic cop.  In such a case, I would opt to put such logic in the app to directly access the true storage over using FDW's.

-Greg
 

Re: Millions of tables

От
Greg Spiegelberg
Дата:
Consider the problem though.  Random access to trillions of records with no guarantee any one will be fetched twice in a short time frame nullifies the effectiveness of a cache unless the cache is enormous.  If such a cache were that big, 100's of TB's, I wouldn't be looking at on-disk storage options.  :)

-Greg

On Mon, Sep 26, 2016 at 6:54 AM, Yves Dorfsman <yves@zioup.com> wrote:
Something that is not talked about at all in this thread is caching. A bunch
of memcache servers in front of the DB should be able to help with the 30ms
constraint (doesn't have to be memcache, some caching technology).

--
http://yves.zioup.com
gpg: 4096R/32B0F416



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Mon, Sep 26, 2016 at 7:05 AM, Mike Sofen <msofen@runbox.com> wrote:

From: Rick Otten   Sent: Monday, September 26, 2016 3:24 AM
Are the tables constantly being written to, or is this a mostly read scenario?

 

With regards to consistent query performance, I think you need to get out of AWS.  That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous.

 

Also I think having 10M rows in a table is not a problem for the query times you are referring to.  So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables.

----------

Excellent thoughts:  the read/write behavior will/should drive a lot of the design;  AWS does not guarantee consistency or latency;  and 10m rows is nothing to PG.

 

Re AWS:  we’re on it, at least for now.  In my profiling of our performance there, I consistently get low latencies…I just know that there will be random higher latencies, but the statistical average will be low.  I just ran a quick test against a modest sized table on a modest sized EC2 instance (m4.xlarge – 4 core/16gb ram, 3 tb ssd):  the table has 15m rows but is huge (it represents nearly 500m rows compressed in jsonb documents), with 5 indexed key columns and a total of 12 columns.  I queried for a single, non-PK, indexed value using “select *” (so it included the json) and it took 22ms, without the json it took 11ms.  Especially with the db/memory-optimized EC2 instances now available (with guaranteed IOPS), performance against even 100m row tables should still stay within your requirements.

 

So Rick’s point about not needing millions of tables is right on.  If there’s a way to create table “clumps”, at least you’ll have a more modest table count.

 


Absolutely!  The 8M tables do "belong" to a larger group and the option to reduce the 8M tables to ~4000 is an option however the problem then becomes rather than having an anticipated 140k records/table to 140M to 500M records/table.  I'm concerned read access times will go out the window.  It is on the docket to test.

-Greg

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Mon, Sep 26, 2016 at 4:23 AM, Rick Otten <rottenwindfish@gmail.com> wrote:
Are the tables constantly being written to, or is this a mostly read scenario?   One architecture possibility, if the writes are not so frequent,  is to create just a handful of very big tables for writing, and then make smaller tables as materialized views for reading.  The vacuum and bloat management could be done back a the big tables. The materialized views could be refreshed or replaced during non-peak hours.  The materialized views could be on a different tablespace than the root tables.  They could also be structured to reflect real-world query patterns which are sometimes different than the raw data storage engineering problem.


Like any data warehouse, I expect 90%+ of the activity being writes but the necessity of low latency reads is an absolute must else the design doesn't get off the ground.

Materialized views are neat for many cases but not this one.  Current versions of the data must be available the moment after they are written.

I am considering tablespaces however I have no way to properly size.  One group of tables may contain 1M records and another 100M.  Could be on the same file system but I'd have the same problems internal to PostgreSQL and the only thing overcome is millions of files in a single directory which is a file system selection problem.

 
With some logging you may be able to see that the data is not truly randomly accessed, but rather clustered around just some of the millions of tables.   Then the engineering problem becomes "How do I service 90% of the queries on these tables in 30ms ?" Rather than "How do I service 100% of the queries 100% of the time in 30ms?"  Knowing 90% of the queries hit just a few hundred tables, makes the first question easier to answer.

Similarly, if most of the columns are static and only a few columns are actually changing, you could consider pulling the static stuff out of the same table with the dynamic stuff and then look at joins in your queries.  The end goal is to be able to get solid indexes and tables that don't change a lot so they can be tightly packed and cached.  (less bloat, less fragmentation, fewer disk accesses).

With regards to consistent query performance, I think you need to get out of AWS.  That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous.


True about AWS and though it is possible hardware may be purchased AWS is the right place to start.  1) AWS is not IT and won't take months to approve budget for gear+deployment and more importantly 2) still in design phase and if deployed there is no way to predict true adoption meaning it'll start small.  AWS is the right place for now.

 
Also I think having 10M rows in a table is not a problem for the query times you are referring to.  So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables.



On Mon, Sep 26, 2016 at 5:43 AM, Stuart Bishop <stuart@stuartbishop.net> wrote:
On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
I did look at PostgresXL and CitusDB.  Both are admirable however neither could support the need to read a random record consistently under 30ms.  It's a similar problem Cassandra and others have: network latency.  At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index.  I have other requirements that prohibit use of any technology that is eventually consistent.

I liken the problem to fishing.  To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake.  However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.

-Greg

My gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another.

You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements.

An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).



Re: Millions of tables

От
Stuart Bishop
Дата:


On 26 September 2016 at 20:51, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:

An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).


However even 8 bazillion FDW's may cause an "overflow" of relationships at the loss of having an efficient storage engine acting more like a traffic cop.  In such a case, I would opt to put such logic in the app to directly access the true storage over using FDW's.

I mean one fdw table, which shards internally to 8 bazillion stores on disk. It has the sharding key, can calculate exactly which store(s) need to be hit, and returns the rows and to PostgreSQL it looks like 1 big table with 1.3 trillion rows. And if it doesn't do that in 30ms you get to blame yourself :)


--

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Sun, Sep 25, 2016 at 8:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg

I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern?  Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it.

In the next 24 hours, I will be sending ~1 trillion records to the test database.  Because of time to set up, I'd rather have things set up properly the first go.

Thanks!
-Greg 

Re: Millions of tables

От
Jeff Janes
Дата:
On Mon, Sep 26, 2016 at 5:53 AM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:


I may need to understand autovacuum better.  My impression was it consulted statistics and performed vacuums one table at a time based on the vacuum threshold formula on  https://www.postgresql.org/docs/9.5/static/routine-vacuuming.html.  

A problem is that those statistics are stored in one file (per database; it used to be one file per cluster).  With 8 million tables, that is going to be a pretty big file.  But the code pretty much assumes the file is going to be pretty small, and so it has no compunction about commanding that it be read and written, in its entirety, quite often.

Cheers,

Jeff

Re: Millions of tables

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> A problem is that those statistics are stored in one file (per database; it
> used to be one file per cluster).  With 8 million tables, that is going to
> be a pretty big file.  But the code pretty much assumes the file is going
> to be pretty small, and so it has no compunction about commanding that it
> be read and written, in its entirety, quite often.

I don't know that anyone ever believed it would be small.  But at the
time the pgstats code was written, there was no good alternative to
passing the data through files.  (And I'm not sure we envisioned
applications that would be demanding fresh data constantly, anyway.)

Now that the DSM stuff exists and has been more or less shaken out,
I wonder how practical it'd be to use a DSM segment to make the stats
collector's data available to backends.  You'd need a workaround for
the fact that not all the DSM implementations support resize (although
given the lack of callers of dsm_resize, one could be forgiven for
wondering whether any of that code has been tested at all).  But you
could imagine abandoning one DSM segment and creating a new one of
double the size anytime the hash tables got too big.

            regards, tom lane


Re: Millions of tables

От
Craig James
Дата:
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

You don't give enough details to fully explain the problem you're trying to solve.
  • Will records ever be updated or deleted? If so, what percentage and at what frequency?
  • What specifically are you storing (e.g. list of integers, strings, people's sex habits, ...)? Or more importantly, are these fixed- or variable-sized records?
  • Once the 1,200 trillion records are loaded, is that it? Or do more data arrive, and if so, at what rate?
  • Do your queries change, or is there a fixed set of queries?
  • How complex are the joins?
The reason I ask these specific questions is because, as others have pointed out, this might be a perfect case for a custom (non-relational) database. Relational databases are general-purpose tools, sort of like a Swiss-Army knife. A Swiss-Army knife does most things passably, but if you want to carve wood, or butcher meat, or slice vegetables, you get a knife meant for that specific task.

I've written several custom database-storage systems for very specific high-performance systems. It's generally a couple weeks of work, and you have a tailored performance and storage that's hard for a general-purpose relational system to match.

The difficulty of building such a system depends a lot on the answers to the questions above.

Craig


No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

Re: Millions of tables

От
"Mike Sofen"
Дата:

 

 

From: Greg Spiegelberg   Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern?  Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it.

 

In the next 24 hours, I will be sending ~1 trillion records to the test database.  Because of time to set up, I'd rather have things set up properly the first go.

 

Thanks!

-Greg 

---------------------

Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance.  This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column.  Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x.  It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds.  I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries.

 

Summary:  I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described.  I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time.

 

Mike Sofen (Synthetic Genomics)

Re: Millions of tables

От
"Mike Sofen"
Дата:

From: Mike Sofen   Sent: Tuesday, September 27, 2016 8:10 AM

From: Greg Spiegelberg   Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern?  Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it.

 

In the next 24 hours, I will be sending ~1 trillion records to the test database.  Because of time to set up, I'd rather have things set up properly the first go.

 

Thanks!

-Greg 

---------------------

Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance.  This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column.  Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x.  It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds.  I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries.

 

Summary:  I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described.  I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time.

 

Mike Sofen (Synthetic Genomics)

 

Totally typo’d the execution time:  it was 0.254 MILLISECONDS, not SECONDS.  Thus my comment about going up 10x in bucket size instead of appearing to be right at the limit.  Sorry!

 

Mike

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Tue, Sep 27, 2016 at 8:30 AM, Craig James <cjames@emolecules.com> wrote:
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

You don't give enough details to fully explain the problem you're trying to solve.
  • Will records ever be updated or deleted? If so, what percentage and at what frequency?
  • What specifically are you storing (e.g. list of integers, strings, people's sex habits, ...)? Or more importantly, are these fixed- or variable-sized records?
  • Once the 1,200 trillion records are loaded, is that it? Or do more data arrive, and if so, at what rate?
  • Do your queries change, or is there a fixed set of queries?
  • How complex are the joins?
Excellent questions.

1a. Half of the 4M tables will contain ~140k records and UPDATE's will occur on roughly 100 records/day/table.  No DELETE's on this first half.
1b. Second half of the 4M tables will contain ~200k records.  Zero UPDATE's however DELETE's will occur on ~100 records/day/table.

2. All 4M tables contain 7 columns: (4) bigints, (2) timestamptz and (1) boolean.  2M of the table will have an PKEY on (1) bigint table only.  Second 2M table have a PKEY on (bigint,timestamptz) and two additional indexes on (bigint, timestamptz) different columns.

3. The trillions-of-records load is just to push the system to find the maximum record load capability.  Reality, 200M records / day or ~2,300/second average is the expectation once in production.

4. Queries are fixed and match the indexes laid down on the tables.  Goal is <30ms/query.  I have attempted queries with and without indexes.  Without indexes the average query response varied between 20ms and 40ms whereas indexes respond within a much tighter range of 5ms to 9ms.  Both query performance tests were done during data-ingest.

5. Zero JOIN's and I won't let it ever happen.  However the 4M tables INHERIT a data grouping table.  Test rig limits child tables to 1,000/parent.  This was done to explore some other possible access patterns but they are secondary and if it doesn't work then either a) the requirement will be dropped or b) I may look at storing the data in the 1,000 child tables directly in the parent table and I'll need to re-run load & read tests.

 
The reason I ask these specific questions is because, as others have pointed out, this might be a perfect case for a custom (non-relational) database. Relational databases are general-purpose tools, sort of like a Swiss-Army knife. A Swiss-Army knife does most things passably, but if you want to carve wood, or butcher meat, or slice vegetables, you get a knife meant for that specific task.


I'm at a sizing phase.  If 4M tables works I'll attempt 16M tables.  If it points to only 2M or 1M then that's fine.  The 4M table database in only a single cog in the storage service design.  Anticipating ~40 of these databases but it is dependent upon how many tables work in a single instance.

The 4M tables are strict relationship tables referencing two other tables containing a JSONB column in each.  The PostgreSQL JSONB function and operator set facilitates current needs beautifully and leaves much room for future use cases.  Using other technologies really limits the query search and storage capabilities.

I've explored many other technologies and the possibility of using PostgreSQL for the 2 tables with JSONB and relationships elsewhere however I foresee too many complexities and possible problems.  I am confident in PostgreSQL and the implementation but, as I said, I need to understand the size limits.


I mentioned the 2 tables with JSONB so I'll elaborate a little more on query patterns.  Every query performs 3 SELECT's.
1. SELECT on JSONB table #1 (~140k records total) searching for records matching a JSONB literal (most common use) or pattern.  Returns id1.
2. SELECT on known table from the 4M using id1 from step 1 returned id2.
3. SELECT on JSONB table #2 (~500k to 90M records) search for record match id2 returned in step 2.


 
I've written several custom database-storage systems for very specific high-performance systems. It's generally a couple weeks of work, and you have a tailored performance and storage that's hard for a general-purpose relational system to match.


Yeah, we're kinda beyond the write-it-yourself because of the need to maintain-it-yourself.  :)


Hope some of these answers helped.

-Greg

 
The difficulty of building such a system depends a lot on the answers to the questions above.

Craig


No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Tue, Sep 27, 2016 at 9:42 AM, Mike Sofen <msofen@runbox.com> wrote:

From: Mike Sofen   Sent: Tuesday, September 27, 2016 8:10 AM

From: Greg Spiegelberg   Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern?  Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it.

 

In the next 24 hours, I will be sending ~1 trillion records to the test database.  Because of time to set up, I'd rather have things set up properly the first go.

 

Thanks!

-Greg 

---------------------

Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance.  This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column.  Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x.  It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds.  I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries.

 

Summary:  I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described.  I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time.

 

Mike Sofen (Synthetic Genomics)

 

Totally typo’d the execution time:  it was 0.254 MILLISECONDS, not SECONDS.  Thus my comment about going up 10x in bucket size instead of appearing to be right at the limit.  Sorry!



I figured.  :)

Haven't ruled it out but expectations of this implementation is to perform at worst 3X slower than memcache or Redis.

Bigger buckets mean a wider possibility of response times.  Some buckets may contain 140k records and some 100X more.

-Greg
 

Re: Millions of tables

От
Terry Schmitt
Дата:


On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg

I have not seen any mention of transaction ID wraparound mentioned in this thread yet. With the numbers that you are looking at, I could see this as a major issue.

T

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt <tschmitt@schmittworks.com> wrote:


On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,
-Greg

I have not seen any mention of transaction ID wraparound mentioned in this thread yet. With the numbers that you are looking at, I could see this as a major issue.

T

Thank you Terry.  You get the gold star.  :)   I was waiting for that to come up.

Success means handling this condition.  A whole database vacuum and dump-restore is out of the question.  Can a properly tuned autovacuum prevent the situation?

-Greg

Re: Millions of tables

От
Vitalii Tymchyshyn
Дата:
Have you considered having many databases (e.g. 100) and possibly many postgresql servers (e.g. 10) started on different ports?
This would give you 1000x less tables per db.

Re: Millions of tables

От
Richard Albright
Дата:
If going that route, why not just use plproxy?

On Wed, Sep 28, 2016 at 11:39 AM, Vitalii Tymchyshyn <vit@tym.im> wrote:
Have you considered having many databases (e.g. 100) and possibly many postgresql servers (e.g. 10) started on different ports?
This would give you 1000x less tables per db.


Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Wed, Sep 28, 2016 at 9:39 AM, Vitalii Tymchyshyn <vit@tym.im> wrote:
Have you considered having many databases (e.g. 100) and possibly many postgresql servers (e.g. 10) started on different ports?
This would give you 1000x less tables per db.

The system design already allows for many database servers.  40 is okay, 100 isn't terrible but if it's thousands then operations might lynch me.

-Greg

Re: Millions of tables

От
Stephen Frost
Дата:
Greg,

* Greg Spiegelberg (gspiegelberg@gmail.com) wrote:
> Bigger buckets mean a wider possibility of response times.  Some buckets
> may contain 140k records and some 100X more.

Have you analyzed the depth of the btree indexes to see how many more
pages need to be read to handle finding a row in 140k records vs. 14M
records vs. 140M records?

I suspect you'd find that the change in actual depth (meaning how many
pages have to actually be read to find the row you're looking for) isn't
very much and that your concern over the "wider possibility of response
times" isn't well founded.

Since you have a hard-set 30ms maximum for query response time, I would
suggest you work out how long it takes to read a cold page from your I/O
subsystem and then you can work through exactly how many page reads
could be done in that 30ms (or perhaps 20ms, to allow for whatever
overhead there will be in the rest of the system and as a buffer) and
then work that back to how deep the index can be based on that many page
reads and then how many records are required to create an index of that
depth.  Of course, the page from the heap will also need to be read and
there's a bit of additional work to be done, but the disk i/o for cold
pages is almost certainly where most time will be spent.

I suspect you'll discover that millions of tables is a couple orders of
magnitude off of how many you'd need to keep the number of page reads
below the threshold you work out based on your I/O.

Of course, you would need a consistent I/O subsystem, or at least one
where you know the maximum possible latency to pull a cold page.

Lastly, you'll want to figure out how to handle system crash/restart if
this system requires a high uptime.  I expect you'd want to have at
least one replica and a setup which allows you to flip traffic to it
very quickly to maintain the 30ms response times.

Thanks!

Stephen

Вложения

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Wed, Sep 28, 2016 at 11:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
Greg,

* Greg Spiegelberg (gspiegelberg@gmail.com) wrote:
> Bigger buckets mean a wider possibility of response times.  Some buckets
> may contain 140k records and some 100X more.

Have you analyzed the depth of the btree indexes to see how many more
pages need to be read to handle finding a row in 140k records vs. 14M
records vs. 140M records?

I suspect you'd find that the change in actual depth (meaning how many
pages have to actually be read to find the row you're looking for) isn't
very much and that your concern over the "wider possibility of response
times" isn't well founded.


Stephen,
Excellent feedback!   Um, how does one look at tree depth in PostgreSQL?  Oracle I know but have not done the same in PG.  Pointers?

 
Since you have a hard-set 30ms maximum for query response time, I would
suggest you work out how long it takes to read a cold page from your I/O
subsystem and then you can work through exactly how many page reads
could be done in that 30ms (or perhaps 20ms, to allow for whatever
overhead there will be in the rest of the system and as a buffer) and
then work that back to how deep the index can be based on that many page
reads and then how many records are required to create an index of that
depth.  Of course, the page from the heap will also need to be read and
there's a bit of additional work to be done, but the disk i/o for cold
pages is almost certainly where most time will be spent.

I suspect you'll discover that millions of tables is a couple orders of
magnitude off of how many you'd need to keep the number of page reads
below the threshold you work out based on your I/O.

Of course, you would need a consistent I/O subsystem, or at least one
where you know the maximum possible latency to pull a cold page.

Lastly, you'll want to figure out how to handle system crash/restart if
this system requires a high uptime.  I expect you'd want to have at
least one replica and a setup which allows you to flip traffic to it
very quickly to maintain the 30ms response times.

I'm replicating via messaging.  PG replication is fine for smaller db's but I don't trust networks and PG upgrade intricacies complicate matters.

-Greg
 

Re: Millions of tables

От
Stephen Frost
Дата:
Greg,

* Greg Spiegelberg (gspiegelberg@gmail.com) wrote:
> On Wed, Sep 28, 2016 at 11:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Greg Spiegelberg (gspiegelberg@gmail.com) wrote:
> > > Bigger buckets mean a wider possibility of response times.  Some buckets
> > > may contain 140k records and some 100X more.
> >
> > Have you analyzed the depth of the btree indexes to see how many more
> > pages need to be read to handle finding a row in 140k records vs. 14M
> > records vs. 140M records?
> >
> > I suspect you'd find that the change in actual depth (meaning how many
> > pages have to actually be read to find the row you're looking for) isn't
> > very much and that your concern over the "wider possibility of response
> > times" isn't well founded.

> Excellent feedback!   Um, how does one look at tree depth in PostgreSQL?
> Oracle I know but have not done the same in PG.  Pointers?

CREATE EXTENSION pageinspect;

SELECT * FROM bt_metap('indexname');

https://www.postgresql.org/docs/9.5/static/pageinspect.html

Thanks!

Stephen

Вложения

Re: Millions of tables

От
Simon Riggs
Дата:
On 26 September 2016 at 05:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
> I did look at PostgresXL and CitusDB.  Both are admirable however neither
> could support the need to read a random record consistently under 30ms.
> It's a similar problem Cassandra and others have: network latency.  At this
> scale, to provide the ability to access any given record amongst trillions
> it is imperative to know precisely where it is stored (system & database)
> and read a relatively small index.  I have other requirements that prohibit
> use of any technology that is eventually consistent.

Then XL is exactly what you need, since it does allow you to calculate
exactly where the record is via hash and then access it, which makes
the request just a single datanode task.

XL is not the same as CitusDB.

> I liken the problem to fishing.  To find a particular fish of length, size,
> color &c in a data lake you must accept the possibility of scanning the
> entire lake.  However, if all fish were in barrels where each barrel had a
> particular kind of fish of specific length, size, color &c then the problem
> is far simpler.

The task of putting the fish in the appropriate barrel is quite hard.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Millions of tables

От
"Alex Ignatov \(postgrespro\)"
Дата:

 

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Spiegelberg
Sent: Tuesday, September 27, 2016 7:28 PM
To: Terry Schmitt <tschmitt@schmittworks.com>
Cc: pgsql-performa. <pgsql-performance@postgresql.org>
Subject: Re: [PERFORM] Millions of tables

 

On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt <tschmitt@schmittworks.com> wrote:

 

 

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:

Hey all,

 

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables.  I too have long believed it until recently.

 

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those tables.  Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were created by a single process.  Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.

 

No system tuning but here is a list of PostgreSQL knobs and switches:

shared_buffers = 2GB

work_mem = 48 MB

max_stack_depth = 4 MB

synchronous_commit = off

effective_cache_size = 200 GB

pg_xlog is on it's own file system

 

There are some still obvious problems.  General DBA functions such as VACUUM and ANALYZE should not be done.  Each will run forever and cause much grief.  Backups are problematic in the traditional pg_dump and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's.  A system or database crash could take potentially hours to days to recover.  There are likely other issues ahead.

 

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop".  I looked at many others and ultimately the recommended use of each vendor was to have one table for all data.  That overcomes the millions of tables problem, right?

 

Problem with the "one big table" solution is I anticipate 1,200 trillion records.  Random access is expected and the customer expects <30ms reads for a single record fetch.

 

No data is loaded... yet  Table and index creation only.  I am interested in the opinions of all including tests I may perform.  If you had this setup, what would you capture / analyze?  I have a job running preparing data.  I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second.  The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.

 

I'm not looking for alternatives yet but input to my test.  Takers?

 

I can't promise immediate feedback but will do my best to respond with results.

 

TIA,

-Greg

 

I have not seen any mention of transaction ID wraparound mentioned in this thread yet. With the numbers that you are looking at, I could see this as a major issue.

 

T

 

Thank you Terry.  You get the gold star.  :)   I was waiting for that to come up.

 

Success means handling this condition.  A whole database vacuum and dump-restore is out of the question.  Can a properly tuned autovacuum prevent the situation?

 

-Greg

 

Hi!

With millions of tables you have to set    autovacuum_max_workers  sky-high =). We have some situation when at thousands of tables autovacuum can’t vacuum all tables that need it. Simply it vacuums some of most modified table and never reach others. Only manual vacuum can help with this situation. With wraparound issue it can be a nightmare

 

--

Alex Ignatov
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company

 

Re: Millions of tables

От
Jim Nasby
Дата:
On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote:
> With millions of tables you have to set    autovacuum_max_workers
>  sky-high =). We have some situation when at thousands of tables
> autovacuum can’t vacuum all tables that need it. Simply it vacuums some
> of most modified table and never reach others. Only manual vacuum can
> help with this situation. With wraparound issue it can be a nightmare

Specifically, autovac isn't going to start worrying about anti-wrap
vacuums until tables start hitting autovacuum_freeze_max_age (or
autovacuum_multixact_freeze_max_age). Any tables that hit that threshold
go to the front of the line for being vacuumed. (But keep in mind that
there is no universal line, just what each worker computes on it's own
when it's started).

Where things will completely fall apart for you is if a lot of tables
all have roughly the same relfrozenxid (or relminmxid), like they would
immediately after a large load. In that scenario you'll suddenly have
loads of work for autovac to do, all at the same time. That will make
the database, DBAs and you Very Unhappy (tm).

Somehow, some way, you *must* do a vacuum of the entire database.
Luckily the freeze map in 9.6 means you'd only have to do that one time
(assuming the data really is static). In any older version, (auto)vacuum
will need to eventually *read everything in every table* at least once
every ~2B transactions.

The only impact the number of tables is going to have on this is
granularity. If you have a small number of large tables, you'll have
(auto)vacuum processes that will need to run *uninterrupted* for a long
time to move the freeze threshold on each table. If you have tons of
small tables, you'll need tons of separate (auto)vacuums, but each one
will run for a shorter interval, and if one gets interrupted it won't be
as big a deal.

There is one potentially significant difference between autovac and
manual vacuums here; autovac treats toast tables as just another table,
with their own stats and their own freeze needs. If you're generating a
lot of toast records that might make a difference.

When it comes to vacuum, you might find
https://www.pgcon.org/2015/schedule/events/829.en.html useful.

On a different topic... I didn't see anything in the thread about what
you're storing, but with the row counts you're talking about I'm
guessing it's something that's time-series.
https://github.com/ElephantStack/ElephantStack is a project exploring
the idea of using Postgres array types as a far more efficient way to
store that kind of data; instead of an overhead of 24 bytes per row
(plus indexes) arrays give you essentially zero overhead per row.
There's no code yet, but a few of us have done testing on some real
world data (see the google group referenced from the README).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


Re: Millions of tables

От
Jeff Janes
Дата:
On Thu, Sep 29, 2016 at 4:11 AM, Alex Ignatov (postgrespro) <a.ignatov@postgrespro.ru> wrote:

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of  

Thank you Terry.  You get the gold star.  :)   I was waiting for that to come up.

 

Success means handling this condition.  A whole database vacuum and dump-restore is out of the question.  Can a properly tuned autovacuum prevent the situation?

 

-Greg

 

Hi!

With millions of tables you have to set    autovacuum_max_workers  sky-high =). We have some situation when at thousands of tables autovacuum can’t vacuum all tables that need it. Simply it vacuums some of most modified table and never reach others.


Any autovacuum worker should vacuum all tables in its assigned database which it perceives need vacuuming, as long as it can get the lock.  Unless the worker is interrupted, for example by frequent database shutdowns, it should reach all tables in that database before it exits.  Unless there is a bug, or you are constantly restarting the database before autovacuum can finish or doing something else to kill them off, what you describe should not happen.

If it is a bug, we should fix it.  Can you give more details?

There is a known bug when you multiple active databases in the same cluster.  Once one database reaches the age where anti-wrap around vacuums kick in, then all future autovacuum workers are directed to that one database, starving all other databases of auto-vacuuming.  But that doesn't sound like what you are describing.

Cheers,

Jeff

Re: Millions of tables

От
Greg Spiegelberg
Дата:
On Fri, Sep 30, 2016 at 4:49 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote:
With millions of tables you have to set    autovacuum_max_workers
 sky-high =). We have some situation when at thousands of tables
autovacuum can’t vacuum all tables that need it. Simply it vacuums some
of most modified table and never reach others. Only manual vacuum can
help with this situation. With wraparound issue it can be a nightmare

Specifically, autovac isn't going to start worrying about anti-wrap vacuums until tables start hitting autovacuum_freeze_max_age (or autovacuum_multixact_freeze_max_age). Any tables that hit that threshold go to the front of the line for being vacuumed. (But keep in mind that there is no universal line, just what each worker computes on it's own when it's started).

Where things will completely fall apart for you is if a lot of tables all have roughly the same relfrozenxid (or relminmxid), like they would immediately after a large load. In that scenario you'll suddenly have loads of work for autovac to do, all at the same time. That will make the database, DBAs and you Very Unhappy (tm).

Somehow, some way, you *must* do a vacuum of the entire database. Luckily the freeze map in 9.6 means you'd only have to do that one time (assuming the data really is static). In any older version, (auto)vacuum will need to eventually *read everything in every table* at least once every ~2B transactions.

Data is not static.  The 4M tables fall into one of two groups.

Group A contains 2M tables.  INSERT will occur ~100 times/day and maximum number of records anticipated will be 200k.  Periodic DELETE's will occur removing "old" records.  Age is something the client sets and I have no way of saying 1 or 10k records will be removed.

Group B contains the other 2M tables.  Maximum records ~140k and UPSERT will be the only mechanism used to populate and maintain.  Periodic DELETE's may run on these tables as well removing "old" records.

Will a set of tables require vacuum'ing at the same time?  Quite possibly but I have no way to say 2 or 200k tables will need it.


When you say "must do a vacuum of the entire database", are you saying the entire database must be vacuum'd as a whole per 2B transactions or all tables must be vacuum'd eventually at least once?  I want to be absolutely clear on what you're saying.

 
There is one potentially significant difference between autovac and manual vacuums here; autovac treats toast tables as just another table, with their own stats and their own freeze needs. If you're generating a lot of toast records that might make a difference.

I do not anticipate TOAST entering the picture.  No single column or record > 8KB or even approaching it. We have a few databases that (ab)use pg_toast and I want to avoid those complications.

 
-Greg

Re: Millions of tables

От
Jim Nasby
Дата:
On 10/5/16 7:34 AM, Greg Spiegelberg wrote:
> When you say "must do a vacuum of the entire database", are you saying
> the entire database must be vacuum'd as a whole per 2B transactions or
> all tables must be vacuum'd eventually at least once?

All tables at least once. Prior to 9.6, that had to happen ever 2B
transactions. With 9.6 there's a freeze map, so if a page never gets
dirtied between vacuum freeze runs then it doesn't need to be frozen.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


Re: Millions of tables

От
Robert Klemme
Дата:
Greg, sorry for the resent: I had forgotten to include the list.

On Wed, Oct 5, 2016 at 2:34 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:

> Data is not static.  The 4M tables fall into one of two groups.
>
> Group A contains 2M tables.  INSERT will occur ~100 times/day and maximum
> number of records anticipated will be 200k.  Periodic DELETE's will occur
> removing "old" records.  Age is something the client sets and I have no way
> of saying 1 or 10k records will be removed.

The ~100 times / day are per table I assume. Also, I assume DELETES
will probably delete batches (because the time criteria catches
several records).

> Group B contains the other 2M tables.  Maximum records ~140k and UPSERT will
> be the only mechanism used to populate and maintain.  Periodic DELETE's may
> run on these tables as well removing "old" records.

So there will be inserts and updates.

Either I missed it or you did not mention the criteria for placing a
record in one of the 4M buckets. Can you shed light on what the
criteria are? That would obviously suggest what indexing could be
done.

Also it would be interesting to see results of your tests with btree
on really large tables as Stephen had suggested. I know it is not the
primary tests you want to do but I would rather first explore
"traditional" schema before I venture in the unknown of the multi
million dollar, pardon, table schema.

Kind regards


--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/