Обсуждение: Thousands databases or schemas

Поиск
Список
Период
Сортировка

Thousands databases or schemas

От
Denis
Дата:
We have a web application where we create a schema or a database with a
number of tables in it for each customer. Now we have about 2600 clients.

The problem we met using a separate DB for each client is that the creation
of new DB can take up to 2 minutes, that is absolutely unacceptable. Using
schemes instead (one DB with a number of schemes containing similar tables
in it) solved this problem (schemes are created in a couple of seconds), but
created two other blocking points:
1. sometimes creation of a new table in a schema takes up to 5 seconds. In
case when we have create up to 40 tables in a schema this takes way too much
time.
2. "pg_dump -n schema_name db_name" takes from 30 to 60 seconds, no matter
how big is the amount of data in the schema. Also, the dump of the tables
structure only takes at least 30 seconds. Basing on this topic
http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html,
pg_dump always analyses ALL the tables in the DB, i.e. in my case more than
100 000 tables.

I know you guys will ask me about selecting this particular application
architecture.
This architecture was chosen to ease the process of backup/restoring data
and isolating client's data from each other. Sometimes clients ask us to
restore data for the last month or roll back to last week's state. This task
is easy to accomplish then the client's data is isolated in a schema/DB. If
we put all the clients data in one table - operations of this kind will be
much harder to perform. We will have to restore a huge DB with an enormously
large tables in it to find the requested data. Sometime client even doesn't
remember the exact date, he or she just say "I lost my data somewhere
between Tuesday and Friday last week" and I have to restore backups for
several days. If I have one huge table instead of small tables it will be a
nightmare!

Different clients have different activity rate and we can select different
backup strategies according to it. This would be impossible in case we keep
all the clients data in one table.
Besides all the above mentioned, the probability of massive data corruption
(if an error in our web application occurs) is much higher.


P.S.
Not to start a holywar, but FYI: in a similar project where we used MySQL
now we have about 6000 DBs and everything works like a charm.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Thousands databases or schemas

От
Samuel Gendler
Дата:


On Thu, Nov 8, 2012 at 1:36 AM, Denis <socsam@gmail.com> wrote:

P.S.
Not to start a holywar, but FYI: in a similar project where we used MySQL
now we have about 6000 DBs and everything works like a charm.

You seem to have answered your own question here.  If my recollection of a previous discussion about many schemas and pg_dump performance is accurate, I suspect you are going to be told that you've got a data architecture that is fairly incompatible with postgresql's architecture and you've specifically ruled out a solution that would play to postgresql's strengths.


Re: Thousands databases or schemas

От
Denis
Дата:
Samuel Gendler wrote
> On Thu, Nov 8, 2012 at 1:36 AM, Denis <

> socsam@

> > wrote:
>
>>
>> P.S.
>> Not to start a holywar, but FYI: in a similar project where we used MySQL
>> now we have about 6000 DBs and everything works like a charm.
>>
>
> You seem to have answered your own question here.  If my recollection of a
> previous discussion about many schemas and pg_dump performance is
> accurate,
> I suspect you are going to be told that you've got a data architecture
> that
> is fairly incompatible with postgresql's architecture and you've
> specifically ruled out a solution that would play to postgresql's
> strengths.

Ok guys, it was not my intention to hurt anyone's feelings by mentioning
MySQL. Sorry about that. There simply was a project with a similar
architecture built using MySQL. When we started the current project, I have
made a decision to give PostgreSQL a try. Now I see that the same
architecture is not applicable if PostgreSQL is used.

I would recommend you to refresh the info here
http://wiki.postgresql.org/wiki/FAQ. There is a question "What is the
maximum size for a row, a table, and a database?". Please add there info on
maximum DBs number and tables number one DB can contain while PostgreSQL
continues to work properly.

PS: the easiest solution in my case is to create initially 500 DBs (like
app_template_[0-500]) and create up to 500 schemas in each of it. This will
make  250000 possible clients in total. This should be enough.  The question
is: can you see the possible pitfalls of this solution?



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189p5731203.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Thousands databases or schemas

От
Pavel Stehule
Дата:
Hello

2012/11/8 Denis <socsam@gmail.com>:
> Samuel Gendler wrote
>> On Thu, Nov 8, 2012 at 1:36 AM, Denis <
>
>> socsam@
>
>> > wrote:
>>
>>>
>>> P.S.
>>> Not to start a holywar, but FYI: in a similar project where we used MySQL
>>> now we have about 6000 DBs and everything works like a charm.
>>>
>>
>> You seem to have answered your own question here.  If my recollection of a
>> previous discussion about many schemas and pg_dump performance is
>> accurate,
>> I suspect you are going to be told that you've got a data architecture
>> that
>> is fairly incompatible with postgresql's architecture and you've
>> specifically ruled out a solution that would play to postgresql's
>> strengths.
>
> Ok guys, it was not my intention to hurt anyone's feelings by mentioning
> MySQL. Sorry about that. There simply was a project with a similar
> architecture built using MySQL. When we started the current project, I have
> made a decision to give PostgreSQL a try. Now I see that the same
> architecture is not applicable if PostgreSQL is used.
>
> I would recommend you to refresh the info here
> http://wiki.postgresql.org/wiki/FAQ. There is a question "What is the
> maximum size for a row, a table, and a database?". Please add there info on
> maximum DBs number and tables number one DB can contain while PostgreSQL
> continues to work properly.
>
> PS: the easiest solution in my case is to create initially 500 DBs (like
> app_template_[0-500]) and create up to 500 schemas in each of it. This will
> make  250000 possible clients in total. This should be enough.  The question
> is: can you see the possible pitfalls of this solution?
>

we use about 2000 databases per warehouse - and it working well, but
pg_dumpall doesn't work well in this environment. So we use a
different backup methods.

Regards

Pavel

>
>
> --
> View this message in context:
http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189p5731203.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Thousands databases or schemas

От
Mark Thornton
Дата:
On 08/11/12 09:36, Denis wrote:
> We have a web application where we create a schema or a database with a
> number of tables in it for each customer. Now we have about 2600 clients.
>
> The problem we met using a separate DB for each client is that the creation
> of new DB can take up to 2 minutes, that is absolutely unacceptable. Using
> schemes instead (one DB with a number of schemes containing similar tables
> in it) solved this problem (schemes are created in a couple of seconds), but
> created two other blocking points:
> 1. sometimes creation of a new table in a schema takes up to 5 seconds. In
> case when we have create up to 40 tables in a schema this takes way too much
> time.
> 2. "pg_dump -n schema_name db_name" takes from 30 to 60 seconds, no matter
> how big is the amount of data in the schema. Also, the dump of the tables
> structure only takes at least 30 seconds. Basing on this topic
> http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html,
> pg_dump always analyses ALL the tables in the DB, i.e. in my case more than
> 100 000 tables.
The obvious solution would be to write your own version of pg_dump which
only examines the tables within a schema. You can even start with the
source of the standard pg_dump! However, you could then eliminate the
per customer schema/tables and add an extra 'customer' key column on
each table. Now you modify pg_dump to only dump the parts of each table
matching a given customer id.

Mark Thornton



Re: Thousands databases or schemas

От
Craig Ringer
Дата:
On 11/08/2012 09:29 PM, Denis wrote:
> Ok guys, it was not my intention to hurt anyone's feelings by mentioning
> MySQL. Sorry about that.
It's pretty silly to be upset by someone mentioning another DB product.
I wouldn't worry.
> There simply was a project with a similar
> architecture built using MySQL. When we started the current project, I have
> made a decision to give PostgreSQL a try.
It's certainly interesting that MySQL currently scales to much larger
table counts better than PostgreSQL appears to.

I'd like to see if this can be improved down the track. Various people
are doing work on PostgreSQL scaling and performance, so with luck huge
table counts will come into play there. If nothing else, supporting
large table counts is important when dealing with very large amounts of
data in partitioned tables.

I think I saw mention of better performance with higher table counts in
9.3 in -hackers, too.

> I would recommend you to refresh the info here
> http://wiki.postgresql.org/wiki/FAQ. There is a question "What is the
> maximum size for a row, a table, and a database?". Please add there info on
> maximum DBs number and tables number one DB can contain while PostgreSQL
> continues to work properly.
Yeah, a number of people have been thrown by that. Technical limitations
aren't the same as practical limitations, and in some cases the
practical limitations are lower.

The trouble is: How do you put a number to it when something is a slow
and gradual drop in performance? And when one person's "performs
adequately" is another's "way too slow" ?

--
Craig Ringer


Re: Thousands databases or schemas

От
Magnus Hagander
Дата:
On Fri, Nov 9, 2012 at 7:15 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 11/08/2012 09:29 PM, Denis wrote:
>> Ok guys, it was not my intention to hurt anyone's feelings by mentioning
>> MySQL. Sorry about that.
> It's pretty silly to be upset by someone mentioning another DB product.
> I wouldn't worry.
>> There simply was a project with a similar
>> architecture built using MySQL. When we started the current project, I have
>> made a decision to give PostgreSQL a try.
> It's certainly interesting that MySQL currently scales to much larger
> table counts better than PostgreSQL appears to.
>
> I'd like to see if this can be improved down the track. Various people
> are doing work on PostgreSQL scaling and performance, so with luck huge
> table counts will come into play there. If nothing else, supporting
> large table counts is important when dealing with very large amounts of
> data in partitioned tables.
>
> I think I saw mention of better performance with higher table counts in
> 9.3 in -hackers, too.
>
>> I would recommend you to refresh the info here
>> http://wiki.postgresql.org/wiki/FAQ. There is a question "What is the
>> maximum size for a row, a table, and a database?". Please add there info on
>> maximum DBs number and tables number one DB can contain while PostgreSQL
>> continues to work properly.
> Yeah, a number of people have been thrown by that. Technical limitations
> aren't the same as practical limitations, and in some cases the
> practical limitations are lower.

Yes. And the fact is that, PostgreSQL doesn't actually have a big
problem with this scenario. pg_dump does. It's mainly a tool issue,
not a database engine one. For example, pg_admin used to have problems
even with a much smaller number of databases than that - I think that
is better in current releases, but I'm not 100% sure. And you can
imagine what a tool looks like that tries to graph per database
values, for example, into a single graph when you have thousands of
databases.

PostgreSQL isn't perfect in these cases - as noted just creating a new
schema can take slightly more than the usual millisecond. But it
works, and I've never come across a scenario personally where it's not
"good enough" (that doesn't mean it doesn't exist, of course).

pg_dump and pg_dumpall, however, do have issues, as noted elsewhere as well.

But as Pavel mentioned, and others certainly have before as well,
there are other ways to deal with backups that solve this problem. It
may not be perfect, but it gets you pretty far.

It may not be a match for the requirements in this case - but it's not
a general limitation of the db.

FWIW, have you looked into using proper backups with PITR and just
rolling forward through the transaction log with
pause_at_recovery_target set? That's an excellent way to deal with the
"we lost data sometime between <x> and <y> but don't know when" when
tracking it down. I've yet to find a case where that's not easier than
repeatedly restoring even a fairly small pg_dump based backup to find
it. Plus, it give you a much better granularity, which leads to better
results for your customer. *And* it takes away the dependency on
pg_dump's performance issues.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Thousands databases or schemas

От
Merlin Moncure
Дата:
On Thu, Nov 8, 2012 at 3:36 AM, Denis <socsam@gmail.com> wrote:
> We have a web application where we create a schema or a database with a
> number of tables in it for each customer. Now we have about 2600 clients.
>
> The problem we met using a separate DB for each client is that the creation
> of new DB can take up to 2 minutes, that is absolutely unacceptable. Using
> schemes instead (one DB with a number of schemes containing similar tables
> in it) solved this problem (schemes are created in a couple of seconds), but
> created two other blocking points:

Sure: db creation can be a bear particularly on servers already under
load; it's i/o intensive.  I think you made the right choice: it's not
a good idea to create databases via user input (but if I *had* to do
that, I would probably be pre-creating them).

> 1. sometimes creation of a new table in a schema takes up to 5 seconds. In
> case when we have create up to 40 tables in a schema this takes way too much
> time.

How are you creating the tables.  What's the iowait on the sever in
situations like this?  If the file system is binding you here, there's
a not a lot you can do other than to try and pre-create or improve i/o
performance.


> 2. "pg_dump -n schema_name db_name" takes from 30 to 60 seconds, no matter
> how big is the amount of data in the schema. Also, the dump of the tables
> structure only takes at least 30 seconds. Basing on this topic
> http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html,
> pg_dump always analyses ALL the tables in the DB, i.e. in my case more than
> 100 000 tables.

That may be correct.  To prove it, try:

pg_dump -s -n schema_name db_name

where '-s' is the switch to dump only schema.  if things are still
slow, try logging queries from pg_dump (maybe enable
log_min_duration_statement if you can) and maybe something turns up
that can be optimized.  One you have the query, explain analyze it and
post it to the list. postgresql has been continuously improving in
terms of bulk table handling over the years and there may be some low
hanging fruit there.  There may even be some simple database tweaks
you can make to improve thing without code changes.

Magnus already answered this pretty well, but I'd like to add: the
database engine scales pretty well to large amounts of table but in
particular cases the tools are not.  The reason for this is that the
engine deals with internal optimized structures while the tools have
to do everything over SQL.  That said, there may be some low hanging
optimization fruit; it's a lot easier to hack on client side tools vs
the backend.

> I know you guys will ask me about selecting this particular application
> architecture.

Sharding by schema is pretty common actually. 6000 schema holding
tables is a lot -- so the question you should be getting is 'given the
current state of affairs, have you considered distributing your
clients across more than one server'.  What if you suddenly sign
10,000 more clients?

> Different clients have different activity rate and we can select different
> backup strategies according to it. This would be impossible in case we keep
> all the clients data in one table.
> Besides all the above mentioned, the probability of massive data corruption
> (if an error in our web application occurs) is much higher.

sure -- all large databases struggle with backups once the brute force
dump starts to become impractical.  the way forward is to explore
various high availability options -- PITR, HS/SR etc.  distributing
the backup load across shards is also good as long as your rigorous
about not involving any shared structures.

> Not to start a holywar, but FYI: in a similar project where we used MySQL
> now we have about 6000 DBs and everything works like a charm.

no worries.  postgres schemas are fancier than mysql databases and
this is one of those things were extra features really do impact
performance :-).

merlin


Re: Thousands databases or schemas

От
Bruce Momjian
Дата:
On Fri, Nov  9, 2012 at 02:15:45PM +0800, Craig Ringer wrote:
> On 11/08/2012 09:29 PM, Denis wrote:
> > Ok guys, it was not my intention to hurt anyone's feelings by mentioning
> > MySQL. Sorry about that.
> It's pretty silly to be upset by someone mentioning another DB product.
> I wouldn't worry.
> > There simply was a project with a similar
> > architecture built using MySQL. When we started the current project, I have
> > made a decision to give PostgreSQL a try.
> It's certainly interesting that MySQL currently scales to much larger
> table counts better than PostgreSQL appears to.
>
> I'd like to see if this can be improved down the track. Various people
> are doing work on PostgreSQL scaling and performance, so with luck huge
> table counts will come into play there. If nothing else, supporting
> large table counts is important when dealing with very large amounts of
> data in partitioned tables.
>
> I think I saw mention of better performance with higher table counts in
> 9.3 in -hackers, too.

Yes, 9.3 does much better dumping/restoring databases with a large
number of tables.  I was testing this as part of pg_upgrade performance
improvements for large tables.  We have a few other things we might try
to improve for 9.3 related to caching, but that might not help in this
case.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +