Обсуждение: Millions of tables
Dear Greg, Have you checked PostgresXL ? with millions of table, how the apps choose which table is approriate? in my opinion, with that scale it should go with parallel query with data sharing like what PostgresXL is done. Thanks, Julyanto SUTANDANG Equnix Business Solutions, PT (An Open Source and Open Mind Company) www.equnix.co.id Pusat Niaga ITC Roxy Mas Blok C2/42. Jl. KH Hasyim Ashari 125, Jakarta Pusat T: +6221 22866662 F: +62216315281 M: +628164858028 Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions. On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote: > Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time > has said not to have millions of tables. I too have long believed it until > recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > tables. Table creation initially took 0.018031 secs, average 0.027467 and > after tossing out outliers (qty 5) the maximum creation time found was > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > Tables were created by a single process. Do note that table creation is > done via plpgsql function as there are other housekeeping tasks necessary > though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as VACUUM > and ANALYZE should not be done. Each will run forever and cause much grief. > Backups are problematic in the traditional pg_dump and PITR space. Large > JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test > case) are no-no's. A system or database crash could take potentially hours > to days to recover. There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". I looked at many > others and ultimately the recommended use of each vendor was to have one > table for all data. That overcomes the millions of tables problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 trillion > records. Random access is expected and the customer expects <30ms reads for > a single record fetch. > > No data is loaded... yet Table and index creation only. I am interested in > the opinions of all including tests I may perform. If you had this setup, > what would you capture / analyze? I have a job running preparing data. I > did this on a much smaller scale (50k tables) and data load via function > allowed close to 6,000 records/second. The schema has been simplified since > and last test reach just over 20,000 records/second with 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > > I can't promise immediate feedback but will do my best to respond with > results. > > TIA, > -Greg
From: Greg Spiegelberg Sent: Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those tables.
… A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.
You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.
I'm not looking for alternatives yet but input to my test.
_________
Holy guacamole, batman! Ok, here’s my take: you’ve traded the risks/limitations of the known for the risks of the unknown. The unknown being, in the numerous places where postgres historical development may have cut corners, you may be the first to exercise those corners and flame out like the recent SpaceX rocket.
Put it another way – you’re going to bet your career (perhaps) or a client’s future on an architectural model that just doesn’t seem feasible. I think you’ve got a remarkable design problem to solve, and am glad you’ve chosen to share that problem with us.
And I do think it will boil down to this: it’s not that you CAN do it on Postgres (which you clearly can), but once in production, assuming things are actually stable, how will you handle the data management aspects like inevitable breakage, data integrity issues, backups, restores, user contention for resources, fault tolerance and disaster recovery. Just listing the tables will take forever. Add a column? Never. I do think the amount of testing you’ll need to do prove that every normal data management function still works at that table count…that in itself is going to be not a lot of fun.
This one hurts my head. Ironically, the most logical destination for this type of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, etc…and I’m not a Hadoopie.
I am looking forward to hearing how this all plays out, it will be quite an adventure! All the best,
Mike Sofen (Synthetic Genomics…on Postgres 9.5x)
From: Greg Spiegelberg Sent: Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those tables.… A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.
You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.
I'm not looking for alternatives yet but input to my test.
_________
Holy guacamole, batman! Ok, here’s my take: you’ve traded the risks/limitations of the known for the risks of the unknown. The unknown being, in the numerous places where postgres historical development may have cut corners, you may be the first to exercise those corners and flame out like the recent SpaceX rocket.
Put it another way – you’re going to bet your career (perhaps) or a client’s future on an architectural model that just doesn’t seem feasible. I think you’ve got a remarkable design problem to solve, and am glad you’ve chosen to share that problem with us.
And I do think it will boil down to this: it’s not that you CAN do it on Postgres (which you clearly can), but once in production, assuming things are actually stable, how will you handle the data management aspects like inevitable breakage, data integrity issues, backups, restores, user contention for resources, fault tolerance and disaster recovery. Just listing the tables will take forever. Add a column? Never. I do think the amount of testing you’ll need to do prove that every normal data management function still works at that table count…that in itself is going to be not a lot of fun.
This one hurts my head. Ironically, the most logical destination for this type of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, etc…and I’m not a Hadoopie.
I am looking forward to hearing how this all plays out, it will be quite an adventure! All the best,
Mike Sofen (Synthetic Genomics…on Postgres 9.5x)
Dear Greg,
Have you checked PostgresXL ?
with millions of table, how the apps choose which table is approriate?
in my opinion, with that scale it should go with parallel query with
data sharing like what PostgresXL is done.
Thanks,
Julyanto SUTANDANG
Equnix Business Solutions, PT
(An Open Source and Open Mind Company)
www.equnix.co.id
Pusat Niaga ITC Roxy Mas Blok C2/42. Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028
Caution: The information enclosed in this email (and any attachments)
may be legally privileged and/or confidential and is intended only for
the use of the addressee(s). No addressee should forward, print, copy,
or otherwise reproduce this message in any manner that would allow it
to be viewed by any individual not originally listed as a recipient.
If the reader of this message is not the intended recipient, you are
hereby notified that any unauthorized disclosure, dissemination,
distribution, copying or the taking of any action in reliance on the
information herein is strictly prohibited. If you have received this
communication in error, please immediately notify the sender and
delete this message.Unless it is made by the authorized person, any
views expressed in this message are those of the individual sender and
may not necessarily reflect the views of PT Equnix Business Solutions.
On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg
<gspiegelberg@gmail.com> wrote:
> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables. I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA. Over the weekend, I created 8M tables with 16M indexes on those
> tables. Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process. Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems. General DBA functions such as VACUUM
> and ANALYZE should not be done. Each will run forever and cause much grief.
> Backups are problematic in the traditional pg_dump and PITR space. Large
> JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test
> case) are no-no's. A system or database crash could take potentially hours
> to days to recover. There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?" I looked at
> DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop". I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data. That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records. Random access is expected and the customer expects <30ms reads for
> a single record fetch.
>
> No data is loaded... yet Table and index creation only. I am interested in
> the opinions of all including tests I may perform. If you had this setup,
> what would you capture / analyze? I have a job running preparing data. I
> did this on a much smaller scale (50k tables) and data load via function
> allowed close to 6,000 records/second. The schema has been simplified since
> and last test reach just over 20,000 records/second with 300k tables.
>
> I'm not looking for alternatives yet but input to my test. Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
Hi Greg, Please follow the conventions of this mailing list, to avoid confusion - see bottom of this posting for further comments On 26/09/16 17:05, Greg Spiegelberg wrote: > Precisely why I shared with the group. I must understand the risks > involved. I need to explore if it can be stable at this size when > does it become unstable? Aside from locking down user access to > superuser, is there a way to prohibit database-wide VACUUM & ANALYZE? > Certainly putting my trust in autovacuum :) which is something I have > not yet fully explored how to best tune. > > Couple more numbers... ~231 GB is the size of PGDATA with 8M empty > tables and 16M empty indexes. ~5% of inodes on the file system have > been used. Sar data during the 8M table creation shows a very stable > and regular I/O pattern. Not a blip worth mentioning. > > Another point worth mentioning, the tables contain a boolean, int8's > and timestamptz's only. Nothing of variable size like bytea, text, > json or xml. Each of the 8M tables will contain on the very high side > between 140k and 200k records. The application also has a heads up as > to which table contains which record. The searches come in saying > "give me record X from partition key Y" where Y identifies the table > and X is used in the filter on the table. > > Last point, add column will never be done. I can hear eyes rolling :) > but the schema and it's intended use is complete. You'll have to > trust me on that one. > > -Greg > > On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen <msofen@runbox.com > <mailto:msofen@runbox.com>> wrote: > > *From:*Greg Spiegelberg *Sent:* Sunday, September 25, 2016 7:50 PM > … Over the weekend, I created 8M tables with 16M indexes on those > tables. > > … A system or database crash could take potentially hours to days > to recover. There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked > at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's > face it, it's antiquated and don't get me started on "Hadoop". > Problem with the "one big table" solution is I anticipate 1,200 > trillion records. Random access is expected and the customer > expects <30ms reads for a single record fetch. > > I'm not looking for alternatives yet but input to my test. > > _________ > > Holy guacamole, batman! Ok, here’s my take: you’ve traded the > risks/limitations of the known for the risks of the unknown. The > unknown being, in the numerous places where postgres historical > development may have cut corners, you may be the first to exercise > those corners and flame out like the recent SpaceX rocket. > > Put it another way – you’re going to bet your career (perhaps) or > a client’s future on an architectural model that just doesn’t seem > feasible. I think you’ve got a remarkable design problem to > solve, and am glad you’ve chosen to share that problem with us. > > And I do think it will boil down to this: it’s not that you CAN do > it on Postgres (which you clearly can), but once in production, > assuming things are actually stable, how will you handle the data > management aspects like inevitable breakage, data integrity > issues, backups, restores, user contention for resources, fault > tolerance and disaster recovery. Just listing the tables will > take forever. Add a column? Never. I do think the amount of > testing you’ll need to do prove that every normal data management > function still works at that table count…that in itself is going > to be not a lot of fun. > > This one hurts my head. Ironically, the most logical destination > for this type of data may actually be Hadoop – auto-scale, > auto-shard, fault tolerant, etc…and I’m not a Hadoopie. > > I am looking forward to hearing how this all plays out, it will be > quite an adventure! All the best, > > Mike Sofen (Synthetic Genomics…on Postgres 9.5x) > > In this list, the convention is to post replies at the end (with some rare exceptions), or interspersed when appropriate, and to omit parts no longer relevant. The motivation of bottom posting like this: is that people get to see the context before the reply, AND emails don't end up getting longer & longer as people reply at the beginning forgetting to trim the now irrelevant stuff at the end. Cheers, Gavin
Hey all,Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.No system tuning but here is a list of PostgreSQL knobs and switches:shared_buffers = 2GBwork_mem = 48 MBmax_stack_depth = 4 MBsynchronous_commit = offeffective_cache_size = 200 GBpg_xlog is on it's own file systemThere are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief.
Backups are problematic in the traditional pg_dump and PITR space.
Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover.
There are likely other issues ahead.You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.
No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.I'm not looking for alternatives yet but input to my test. Takers?
I did look at PostgresXL and CitusDB. Both are admirable however neither could support the need to read a random record consistently under 30ms. It's a similar problem Cassandra and others have: network latency. At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index. I have other requirements that prohibit use of any technology that is eventually consistent.
Do you plan to query thru Master table of the partition? it is quite slow actually, considering millions rule to check for every query.
I liken the problem to fishing. To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake. However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.
On 26/09/16 05:50, Greg Spiegelberg wrote: > Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a > time has said not to have millions of tables. I too have long > believed it until recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) > for PGDATA. Over the weekend, I created 8M tables with 16M indexes on > those tables. Table creation initially took 0.018031 secs, average > 0.027467 and after tossing out outliers (qty 5) the maximum creation > time found was 0.66139 seconds. Total time 30 hours, 31 minutes and > 8.435049 seconds. Tables were created by a single process. Do note > that table creation is done via plpgsql function as there are other > housekeeping tasks necessary though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as > VACUUM and ANALYZE should not be done. Each will run forever and > cause much grief. Backups are problematic in the traditional pg_dump > and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance > (I am abusing it in my test case) are no-no's. A system or database > crash could take potentially hours to days to recover. There are > likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face > it, it's antiquated and don't get me started on "Hadoop". I looked at > many others and ultimately the recommended use of each vendor was to > have one table for all data. That overcomes the millions of tables > problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 > trillion records. Random access is expected and the customer expects > <30ms reads for a single record fetch. > > No data is loaded... yet Table and index creation only. I am > interested in the opinions of all including tests I may perform. If > you had this setup, what would you capture / analyze? I have a job > running preparing data. I did this on a much smaller scale (50k > tables) and data load via function allowed close to 6,000 > records/second. The schema has been simplified since and last test > reach just over 20,000 records/second with 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > > I can't promise immediate feedback but will do my best to respond with > results. > > TIA, > -Greg Hi Greg. This is a problem (creating a large number of tables; really large indeed) that we researched in my company a while ago. You might want to read about it: https://www.pgcon.org/2013/schedule/events/595.en.html Cheers, Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
I did look at PostgresXL and CitusDB. Both are admirable however neither could support the need to read a random record consistently under 30ms. It's a similar problem Cassandra and others have: network latency. At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index. I have other requirements that prohibit use of any technology that is eventually consistent.I liken the problem to fishing. To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake. However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.-Greg
On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:I did look at PostgresXL and CitusDB. Both are admirable however neither could support the need to read a random record consistently under 30ms. It's a similar problem Cassandra and others have: network latency. At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index. I have other requirements that prohibit use of any technology that is eventually consistent.I liken the problem to fishing. To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake. However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.-GregMy gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another.You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements.An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).--
Hi Greg.
On 26/09/16 05:50, Greg Spiegelberg wrote:Hey all,
Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.
AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.
No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system
There are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.
You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?
Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.
No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.
I'm not looking for alternatives yet but input to my test. Takers?
I can't promise immediate feedback but will do my best to respond with results.
TIA,
-Greg
This is a problem (creating a large number of tables; really large indeed) that we researched in my company a while ago. You might want to read about it: https://www.pgcon.org/2013/schedule/events/595.en.html
Something that is not talked about at all in this thread is caching. A bunch of memcache servers in front of the DB should be able to help with the 30ms constraint (doesn't have to be memcache, some caching technology). -- http://yves.zioup.com gpg: 4096R/32B0F416
From: Rick Otten Sent: Monday, September 26, 2016 3:24 AM
Are the tables constantly being written to, or is this a mostly read scenario?
With regards to consistent query performance, I think you need to get out of AWS. That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous.
Also I think having 10M rows in a table is not a problem for the query times you are referring to. So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables.
----------
Excellent thoughts: the read/write behavior will/should drive a lot of the design; AWS does not guarantee consistency or latency; and 10m rows is nothing to PG.
Re AWS: we’re on it, at least for now. In my profiling of our performance there, I consistently get low latencies…I just know that there will be random higher latencies, but the statistical average will be low. I just ran a quick test against a modest sized table on a modest sized EC2 instance (m4.xlarge – 4 core/16gb ram, 3 tb ssd): the table has 15m rows but is huge (it represents nearly 500m rows compressed in jsonb documents), with 5 indexed key columns and a total of 12 columns. I queried for a single, non-PK, indexed value using “select *” (so it included the json) and it took 22ms, without the json it took 11ms. Especially with the db/memory-optimized EC2 instances now available (with guaranteed IOPS), performance against even 100m row tables should still stay within your requirements.
So Rick’s point about not needing millions of tables is right on. If there’s a way to create table “clumps”, at least you’ll have a more modest table count.
Mike Sofen (Synthetic Genomics)
On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:I did look at PostgresXL and CitusDB. Both are admirable however neither could support the need to read a random record consistently under 30ms. It's a similar problem Cassandra and others have: network latency. At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index. I have other requirements that prohibit use of any technology that is eventually consistent.I liken the problem to fishing. To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake. However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.-GregMy gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another.
You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements.
An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).
Something that is not talked about at all in this thread is caching. A bunch
of memcache servers in front of the DB should be able to help with the 30ms
constraint (doesn't have to be memcache, some caching technology).
--
http://yves.zioup.com
gpg: 4096R/32B0F416
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
From: Rick Otten Sent: Monday, September 26, 2016 3:24 AM
Are the tables constantly being written to, or is this a mostly read scenario?
With regards to consistent query performance, I think you need to get out of AWS. That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous.
Also I think having 10M rows in a table is not a problem for the query times you are referring to. So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables.
----------
Excellent thoughts: the read/write behavior will/should drive a lot of the design; AWS does not guarantee consistency or latency; and 10m rows is nothing to PG.
Re AWS: we’re on it, at least for now. In my profiling of our performance there, I consistently get low latencies…I just know that there will be random higher latencies, but the statistical average will be low. I just ran a quick test against a modest sized table on a modest sized EC2 instance (m4.xlarge – 4 core/16gb ram, 3 tb ssd): the table has 15m rows but is huge (it represents nearly 500m rows compressed in jsonb documents), with 5 indexed key columns and a total of 12 columns. I queried for a single, non-PK, indexed value using “select *” (so it included the json) and it took 22ms, without the json it took 11ms. Especially with the db/memory-optimized EC2 instances now available (with guaranteed IOPS), performance against even 100m row tables should still stay within your requirements.
So Rick’s point about not needing millions of tables is right on. If there’s a way to create table “clumps”, at least you’ll have a more modest table count.
Are the tables constantly being written to, or is this a mostly read scenario? One architecture possibility, if the writes are not so frequent, is to create just a handful of very big tables for writing, and then make smaller tables as materialized views for reading. The vacuum and bloat management could be done back a the big tables. The materialized views could be refreshed or replaced during non-peak hours. The materialized views could be on a different tablespace than the root tables. They could also be structured to reflect real-world query patterns which are sometimes different than the raw data storage engineering problem.
With some logging you may be able to see that the data is not truly randomly accessed, but rather clustered around just some of the millions of tables. Then the engineering problem becomes "How do I service 90% of the queries on these tables in 30ms ?" Rather than "How do I service 100% of the queries 100% of the time in 30ms?" Knowing 90% of the queries hit just a few hundred tables, makes the first question easier to answer.Similarly, if most of the columns are static and only a few columns are actually changing, you could consider pulling the static stuff out of the same table with the dynamic stuff and then look at joins in your queries. The end goal is to be able to get solid indexes and tables that don't change a lot so they can be tightly packed and cached. (less bloat, less fragmentation, fewer disk accesses).With regards to consistent query performance, I think you need to get out of AWS. That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous.
Also I think having 10M rows in a table is not a problem for the query times you are referring to. So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables.On Mon, Sep 26, 2016 at 5:43 AM, Stuart Bishop <stuart@stuartbishop.net> wrote:On 26 September 2016 at 11:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:I did look at PostgresXL and CitusDB. Both are admirable however neither could support the need to read a random record consistently under 30ms. It's a similar problem Cassandra and others have: network latency. At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index. I have other requirements that prohibit use of any technology that is eventually consistent.I liken the problem to fishing. To find a particular fish of length, size, color &c in a data lake you must accept the possibility of scanning the entire lake. However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color &c then the problem is far simpler.-GregMy gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another.You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements.An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).--
An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever).However even 8 bazillion FDW's may cause an "overflow" of relationships at the loss of having an efficient storage engine acting more like a traffic cop. In such a case, I would opt to put such logic in the app to directly access the true storage over using FDW's.
--
Hey all,Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.No system tuning but here is a list of PostgreSQL knobs and switches:shared_buffers = 2GBwork_mem = 48 MBmax_stack_depth = 4 MBsynchronous_commit = offeffective_cache_size = 200 GBpg_xlog is on it's own file systemThere are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.I'm not looking for alternatives yet but input to my test. Takers?I can't promise immediate feedback but will do my best to respond with results.TIA,-Greg
I may need to understand autovacuum better. My impression was it consulted statistics and performed vacuums one table at a time based on the vacuum threshold formula on https://www.postgresql.org/docs/9.5/static/routine- vacuuming.html.
Jeff Janes <jeff.janes@gmail.com> writes: > A problem is that those statistics are stored in one file (per database; it > used to be one file per cluster). With 8 million tables, that is going to > be a pretty big file. But the code pretty much assumes the file is going > to be pretty small, and so it has no compunction about commanding that it > be read and written, in its entirety, quite often. I don't know that anyone ever believed it would be small. But at the time the pgstats code was written, there was no good alternative to passing the data through files. (And I'm not sure we envisioned applications that would be demanding fresh data constantly, anyway.) Now that the DSM stuff exists and has been more or less shaken out, I wonder how practical it'd be to use a DSM segment to make the stats collector's data available to backends. You'd need a workaround for the fact that not all the DSM implementations support resize (although given the lack of callers of dsm_resize, one could be forgiven for wondering whether any of that code has been tested at all). But you could imagine abandoning one DSM segment and creating a new one of double the size anytime the hash tables got too big. regards, tom lane
Hey all,Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.No system tuning but here is a list of PostgreSQL knobs and switches:shared_buffers = 2GBwork_mem = 48 MBmax_stack_depth = 4 MBsynchronous_commit = offeffective_cache_size = 200 GBpg_xlog is on it's own file systemThere are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.
- Will records ever be updated or deleted? If so, what percentage and at what frequency?
- What specifically are you storing (e.g. list of integers, strings, people's sex habits, ...)? Or more importantly, are these fixed- or variable-sized records?
- Once the 1,200 trillion records are loaded, is that it? Or do more data arrive, and if so, at what rate?
- Do your queries change, or is there a fixed set of queries?
- How complex are the joins?
No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.I'm not looking for alternatives yet but input to my test. Takers?I can't promise immediate feedback but will do my best to respond with results.TIA,-Greg
Craig A. James
From: Greg Spiegelberg Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern? Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it.
In the next 24 hours, I will be sending ~1 trillion records to the test database. Because of time to set up, I'd rather have things set up properly the first go.
Thanks!
-Greg
---------------------
Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance. This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column. Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x. It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds. I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries.
Summary: I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described. I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time.
Mike Sofen (Synthetic Genomics)
From: Mike Sofen Sent: Tuesday, September 27, 2016 8:10 AM
From: Greg Spiegelberg Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern? Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it.
In the next 24 hours, I will be sending ~1 trillion records to the test database. Because of time to set up, I'd rather have things set up properly the first go.
Thanks!
-Greg
---------------------
Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance. This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column. Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x. It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds. I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries.
Summary: I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described. I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time.
Mike Sofen (Synthetic Genomics)
Totally typo’d the execution time: it was 0.254 MILLISECONDS, not SECONDS. Thus my comment about going up 10x in bucket size instead of appearing to be right at the limit. Sorry!
Mike
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:Hey all,Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.No system tuning but here is a list of PostgreSQL knobs and switches:shared_buffers = 2GBwork_mem = 48 MBmax_stack_depth = 4 MBsynchronous_commit = offeffective_cache_size = 200 GBpg_xlog is on it's own file systemThere are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.You don't give enough details to fully explain the problem you're trying to solve.
- Will records ever be updated or deleted? If so, what percentage and at what frequency?
- What specifically are you storing (e.g. list of integers, strings, people's sex habits, ...)? Or more importantly, are these fixed- or variable-sized records?
- Once the 1,200 trillion records are loaded, is that it? Or do more data arrive, and if so, at what rate?
- Do your queries change, or is there a fixed set of queries?
- How complex are the joins?
The reason I ask these specific questions is because, as others have pointed out, this might be a perfect case for a custom (non-relational) database. Relational databases are general-purpose tools, sort of like a Swiss-Army knife. A Swiss-Army knife does most things passably, but if you want to carve wood, or butcher meat, or slice vegetables, you get a knife meant for that specific task.
I've written several custom database-storage systems for very specific high-performance systems. It's generally a couple weeks of work, and you have a tailored performance and storage that's hard for a general-purpose relational system to match.
The difficulty of building such a system depends a lot on the answers to the questions above.CraigNo data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.I'm not looking for alternatives yet but input to my test. Takers?I can't promise immediate feedback but will do my best to respond with results.TIA,-Greg--------------------------------eMolecules, Inc.------------------------------Chief Technology Officer---
Craig A. James---
From: Mike Sofen Sent: Tuesday, September 27, 2016 8:10 AM
From: Greg Spiegelberg Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern? Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it.
In the next 24 hours, I will be sending ~1 trillion records to the test database. Because of time to set up, I'd rather have things set up properly the first go.
Thanks!
-Greg
---------------------
Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance. This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column. Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x. It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds. I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries.
Summary: I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described. I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time.
Mike Sofen (Synthetic Genomics)
Totally typo’d the execution time: it was 0.254 MILLISECONDS, not SECONDS. Thus my comment about going up 10x in bucket size instead of appearing to be right at the limit. Sorry!
Hey all,Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.No system tuning but here is a list of PostgreSQL knobs and switches:shared_buffers = 2GBwork_mem = 48 MBmax_stack_depth = 4 MBsynchronous_commit = offeffective_cache_size = 200 GBpg_xlog is on it's own file systemThere are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.I'm not looking for alternatives yet but input to my test. Takers?I can't promise immediate feedback but will do my best to respond with results.TIA,-Greg
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:Hey all,Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.No system tuning but here is a list of PostgreSQL knobs and switches:shared_buffers = 2GBwork_mem = 48 MBmax_stack_depth = 4 MBsynchronous_commit = offeffective_cache_size = 200 GBpg_xlog is on it's own file systemThere are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.I'm not looking for alternatives yet but input to my test. Takers?I can't promise immediate feedback but will do my best to respond with results.TIA,-GregI have not seen any mention of transaction ID wraparound mentioned in this thread yet. With the numbers that you are looking at, I could see this as a major issue.T
Have you considered having many databases (e.g. 100) and possibly many postgresql servers (e.g. 10) started on different ports?This would give you 1000x less tables per db.
Have you considered having many databases (e.g. 100) and possibly many postgresql servers (e.g. 10) started on different ports?This would give you 1000x less tables per db.
Greg, * Greg Spiegelberg (gspiegelberg@gmail.com) wrote: > Bigger buckets mean a wider possibility of response times. Some buckets > may contain 140k records and some 100X more. Have you analyzed the depth of the btree indexes to see how many more pages need to be read to handle finding a row in 140k records vs. 14M records vs. 140M records? I suspect you'd find that the change in actual depth (meaning how many pages have to actually be read to find the row you're looking for) isn't very much and that your concern over the "wider possibility of response times" isn't well founded. Since you have a hard-set 30ms maximum for query response time, I would suggest you work out how long it takes to read a cold page from your I/O subsystem and then you can work through exactly how many page reads could be done in that 30ms (or perhaps 20ms, to allow for whatever overhead there will be in the rest of the system and as a buffer) and then work that back to how deep the index can be based on that many page reads and then how many records are required to create an index of that depth. Of course, the page from the heap will also need to be read and there's a bit of additional work to be done, but the disk i/o for cold pages is almost certainly where most time will be spent. I suspect you'll discover that millions of tables is a couple orders of magnitude off of how many you'd need to keep the number of page reads below the threshold you work out based on your I/O. Of course, you would need a consistent I/O subsystem, or at least one where you know the maximum possible latency to pull a cold page. Lastly, you'll want to figure out how to handle system crash/restart if this system requires a high uptime. I expect you'd want to have at least one replica and a setup which allows you to flip traffic to it very quickly to maintain the 30ms response times. Thanks! Stephen
Вложения
Greg,
* Greg Spiegelberg (gspiegelberg@gmail.com) wrote:
> Bigger buckets mean a wider possibility of response times. Some buckets
> may contain 140k records and some 100X more.
Have you analyzed the depth of the btree indexes to see how many more
pages need to be read to handle finding a row in 140k records vs. 14M
records vs. 140M records?
I suspect you'd find that the change in actual depth (meaning how many
pages have to actually be read to find the row you're looking for) isn't
very much and that your concern over the "wider possibility of response
times" isn't well founded.
Since you have a hard-set 30ms maximum for query response time, I would
suggest you work out how long it takes to read a cold page from your I/O
subsystem and then you can work through exactly how many page reads
could be done in that 30ms (or perhaps 20ms, to allow for whatever
overhead there will be in the rest of the system and as a buffer) and
then work that back to how deep the index can be based on that many page
reads and then how many records are required to create an index of that
depth. Of course, the page from the heap will also need to be read and
there's a bit of additional work to be done, but the disk i/o for cold
pages is almost certainly where most time will be spent.
I suspect you'll discover that millions of tables is a couple orders of
magnitude off of how many you'd need to keep the number of page reads
below the threshold you work out based on your I/O.
Of course, you would need a consistent I/O subsystem, or at least one
where you know the maximum possible latency to pull a cold page.
Lastly, you'll want to figure out how to handle system crash/restart if
this system requires a high uptime. I expect you'd want to have at
least one replica and a setup which allows you to flip traffic to it
very quickly to maintain the 30ms response times.
Greg, * Greg Spiegelberg (gspiegelberg@gmail.com) wrote: > On Wed, Sep 28, 2016 at 11:27 AM, Stephen Frost <sfrost@snowman.net> wrote: > > * Greg Spiegelberg (gspiegelberg@gmail.com) wrote: > > > Bigger buckets mean a wider possibility of response times. Some buckets > > > may contain 140k records and some 100X more. > > > > Have you analyzed the depth of the btree indexes to see how many more > > pages need to be read to handle finding a row in 140k records vs. 14M > > records vs. 140M records? > > > > I suspect you'd find that the change in actual depth (meaning how many > > pages have to actually be read to find the row you're looking for) isn't > > very much and that your concern over the "wider possibility of response > > times" isn't well founded. > Excellent feedback! Um, how does one look at tree depth in PostgreSQL? > Oracle I know but have not done the same in PG. Pointers? CREATE EXTENSION pageinspect; SELECT * FROM bt_metap('indexname'); https://www.postgresql.org/docs/9.5/static/pageinspect.html Thanks! Stephen
Вложения
On 26 September 2016 at 05:19, Greg Spiegelberg <gspiegelberg@gmail.com> wrote: > I did look at PostgresXL and CitusDB. Both are admirable however neither > could support the need to read a random record consistently under 30ms. > It's a similar problem Cassandra and others have: network latency. At this > scale, to provide the ability to access any given record amongst trillions > it is imperative to know precisely where it is stored (system & database) > and read a relatively small index. I have other requirements that prohibit > use of any technology that is eventually consistent. Then XL is exactly what you need, since it does allow you to calculate exactly where the record is via hash and then access it, which makes the request just a single datanode task. XL is not the same as CitusDB. > I liken the problem to fishing. To find a particular fish of length, size, > color &c in a data lake you must accept the possibility of scanning the > entire lake. However, if all fish were in barrels where each barrel had a > particular kind of fish of specific length, size, color &c then the problem > is far simpler. The task of putting the fish in the appropriate barrel is quite hard. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Spiegelberg
Sent: Tuesday, September 27, 2016 7:28 PM
To: Terry Schmitt <tschmitt@schmittworks.com>
Cc: pgsql-performa. <pgsql-performance@postgresql.org>
Subject: Re: [PERFORM] Millions of tables
On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt <tschmitt@schmittworks.com> wrote:
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
Hey all,
Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently.
AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal.
No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system
There are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead.
You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right?
Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch.
No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables.
I'm not looking for alternatives yet but input to my test. Takers?
I can't promise immediate feedback but will do my best to respond with results.
TIA,
-Greg
I have not seen any mention of transaction ID wraparound mentioned in this thread yet. With the numbers that you are looking at, I could see this as a major issue.
T
Thank you Terry. You get the gold star. :) I was waiting for that to come up.
Success means handling this condition. A whole database vacuum and dump-restore is out of the question. Can a properly tuned autovacuum prevent the situation?
-Greg
Hi!
With millions of tables you have to set autovacuum_max_workers sky-high =). We have some situation when at thousands of tables autovacuum can’t vacuum all tables that need it. Simply it vacuums some of most modified table and never reach others. Only manual vacuum can help with this situation. With wraparound issue it can be a nightmare
--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote: > With millions of tables you have to set autovacuum_max_workers > sky-high =). We have some situation when at thousands of tables > autovacuum can’t vacuum all tables that need it. Simply it vacuums some > of most modified table and never reach others. Only manual vacuum can > help with this situation. With wraparound issue it can be a nightmare Specifically, autovac isn't going to start worrying about anti-wrap vacuums until tables start hitting autovacuum_freeze_max_age (or autovacuum_multixact_freeze_max_age). Any tables that hit that threshold go to the front of the line for being vacuumed. (But keep in mind that there is no universal line, just what each worker computes on it's own when it's started). Where things will completely fall apart for you is if a lot of tables all have roughly the same relfrozenxid (or relminmxid), like they would immediately after a large load. In that scenario you'll suddenly have loads of work for autovac to do, all at the same time. That will make the database, DBAs and you Very Unhappy (tm). Somehow, some way, you *must* do a vacuum of the entire database. Luckily the freeze map in 9.6 means you'd only have to do that one time (assuming the data really is static). In any older version, (auto)vacuum will need to eventually *read everything in every table* at least once every ~2B transactions. The only impact the number of tables is going to have on this is granularity. If you have a small number of large tables, you'll have (auto)vacuum processes that will need to run *uninterrupted* for a long time to move the freeze threshold on each table. If you have tons of small tables, you'll need tons of separate (auto)vacuums, but each one will run for a shorter interval, and if one gets interrupted it won't be as big a deal. There is one potentially significant difference between autovac and manual vacuums here; autovac treats toast tables as just another table, with their own stats and their own freeze needs. If you're generating a lot of toast records that might make a difference. When it comes to vacuum, you might find https://www.pgcon.org/2015/schedule/events/829.en.html useful. On a different topic... I didn't see anything in the thread about what you're storing, but with the row counts you're talking about I'm guessing it's something that's time-series. https://github.com/ElephantStack/ElephantStack is a project exploring the idea of using Postgres array types as a far more efficient way to store that kind of data; instead of an overhead of 24 bytes per row (plus indexes) arrays give you essentially zero overhead per row. There's no code yet, but a few of us have done testing on some real world data (see the google group referenced from the README). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
From: pgsql-performance-owner@
postgresql.org [mailto:pgsql-performance- owner@postgresql.org] On Behalf Of Thank you Terry. You get the gold star. :) I was waiting for that to come up.
Success means handling this condition. A whole database vacuum and dump-restore is out of the question. Can a properly tuned autovacuum prevent the situation?
-Greg
Hi!
With millions of tables you have to set autovacuum_max_workers sky-high =). We have some situation when at thousands of tables autovacuum can’t vacuum all tables that need it. Simply it vacuums some of most modified table and never reach others.
On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote:With millions of tables you have to set autovacuum_max_workers
sky-high =). We have some situation when at thousands of tables
autovacuum can’t vacuum all tables that need it. Simply it vacuums some
of most modified table and never reach others. Only manual vacuum can
help with this situation. With wraparound issue it can be a nightmare
Specifically, autovac isn't going to start worrying about anti-wrap vacuums until tables start hitting autovacuum_freeze_max_age (or autovacuum_multixact_freeze_max_age). Any tables that hit that threshold go to the front of the line for being vacuumed. (But keep in mind that there is no universal line, just what each worker computes on it's own when it's started).
Where things will completely fall apart for you is if a lot of tables all have roughly the same relfrozenxid (or relminmxid), like they would immediately after a large load. In that scenario you'll suddenly have loads of work for autovac to do, all at the same time. That will make the database, DBAs and you Very Unhappy (tm).
Somehow, some way, you *must* do a vacuum of the entire database. Luckily the freeze map in 9.6 means you'd only have to do that one time (assuming the data really is static). In any older version, (auto)vacuum will need to eventually *read everything in every table* at least once every ~2B transactions.
There is one potentially significant difference between autovac and manual vacuums here; autovac treats toast tables as just another table, with their own stats and their own freeze needs. If you're generating a lot of toast records that might make a difference.
On 10/5/16 7:34 AM, Greg Spiegelberg wrote: > When you say "must do a vacuum of the entire database", are you saying > the entire database must be vacuum'd as a whole per 2B transactions or > all tables must be vacuum'd eventually at least once? All tables at least once. Prior to 9.6, that had to happen ever 2B transactions. With 9.6 there's a freeze map, so if a page never gets dirtied between vacuum freeze runs then it doesn't need to be frozen. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
Greg, sorry for the resent: I had forgotten to include the list. On Wed, Oct 5, 2016 at 2:34 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote: > Data is not static. The 4M tables fall into one of two groups. > > Group A contains 2M tables. INSERT will occur ~100 times/day and maximum > number of records anticipated will be 200k. Periodic DELETE's will occur > removing "old" records. Age is something the client sets and I have no way > of saying 1 or 10k records will be removed. The ~100 times / day are per table I assume. Also, I assume DELETES will probably delete batches (because the time criteria catches several records). > Group B contains the other 2M tables. Maximum records ~140k and UPSERT will > be the only mechanism used to populate and maintain. Periodic DELETE's may > run on these tables as well removing "old" records. So there will be inserts and updates. Either I missed it or you did not mention the criteria for placing a record in one of the 4M buckets. Can you shed light on what the criteria are? That would obviously suggest what indexing could be done. Also it would be interesting to see results of your tests with btree on really large tables as Stephen had suggested. I know it is not the primary tests you want to do but I would rather first explore "traditional" schema before I venture in the unknown of the multi million dollar, pardon, table schema. Kind regards -- [guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can - without end} http://blog.rubybestpractices.com/