Обсуждение: Partitioning / Clustering

От:
Alex Stapleton
Дата:

What is the status of Postgres support for any sort of multi-machine
scaling support? What are you meant to do once you've upgraded your
box and tuned the conf files as much as you can? But your query load
is just too high for a single machine?

Upgrading stock Dell boxes (I know we could be using better machines,
but I am trying to tackle the real issue) is not a hugely price
efficient way of getting extra performance, nor particularly scalable
in the long term.

So, when/is PG meant to be getting a decent partitioning system?
MySQL is getting one (eventually) which is apparently meant to be
similiar to Oracle's according to the docs. Clusgres does not appear
to be widely/or at all used, and info on it seems pretty thin on the
ground, so I am
not too keen on going with that. Is the real solution to multi-
machine partitioning (as in, not like MySQLs MERGE tables) on
PostgreSQL actually doing it in our application API? This seems like
a less than perfect solution once we want to add redundancy and
things into the mix.

От:
John A Meinel
Дата:

Alex Stapleton wrote:
> What is the status of Postgres support for any sort of multi-machine
> scaling support? What are you meant to do once you've upgraded your  box
> and tuned the conf files as much as you can? But your query load  is
> just too high for a single machine?
>
> Upgrading stock Dell boxes (I know we could be using better machines,
> but I am trying to tackle the real issue) is not a hugely price
> efficient way of getting extra performance, nor particularly scalable
> in the long term.

Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is
far away from Big Iron. I don't know what performance you are looking
for, but you can easily get into inserting 10M rows/day with quality
hardware.

But actually is it your SELECT load that is too high, or your INSERT
load, or something inbetween.

Because Slony is around if it is a SELECT problem.
http://gborg.postgresql.org/project/slony1/projdisplay.php

Basically, Slony is a Master/Slave replication system. So if you have
INSERT going into the Master, you can have as many replicated slaves,
which can handle your SELECT load.
Slony is an asynchronous replicator, so there is a time delay from the
INSERT until it will show up on a slave, but that time could be pretty
small.

This would require some application level support, since an INSERT goes
to a different place than a SELECT. But there has been some discussion
about pg_pool being able to spread the query load, and having it be
aware of the difference between a SELECT and an INSERT and have it route
the query to the correct host. The biggest problem being that functions
could cause a SELECT func() to actually insert a row, which pg_pool
wouldn't know about. There are 2 possible solutions, a) don't do that
when you are using this system, b) add some sort of comment hint so that
pg_pool can understand that the select is actually an INSERT, and needs
to be done on the master.

>
> So, when/is PG meant to be getting a decent partitioning system?  MySQL
> is getting one (eventually) which is apparently meant to be  similiar to
> Oracle's according to the docs. Clusgres does not appear  to be
> widely/or at all used, and info on it seems pretty thin on the  ground,
> so I am
> not too keen on going with that. Is the real solution to multi- machine
> partitioning (as in, not like MySQLs MERGE tables) on  PostgreSQL
> actually doing it in our application API? This seems like  a less than
> perfect solution once we want to add redundancy and  things into the mix.

There is also PGCluster
http://pgfoundry.org/projects/pgcluster/

Which is trying to be more of a Synchronous multi-master system. I
haven't heard of Clusgres, so I'm guessing it is an older attempt, which
has been overtaken by pgcluster.

Just realize that clusters don't necessarily scale like you would want
them too. Because at some point you have to insert into the same table,
which means you need to hold a lock which prevents the other machine
from doing anything. And with synchronous replication, you have to wait
for all of the machines to get a copy of the data before you can say it
has been committed, which does *not* scale well with the number of machines.

If you can make it work, I think having a powerful master server, who
can finish an INSERT quickly, and then having a bunch of Slony slaves
with a middleman (like pg_pool) to do load balancing among them, is the
best way to scale up. There are still some requirements, like not having
to see the results of an INSERT instantly (though if you are using
hinting to pg_pool, you could hint that this query must be done on the
master, realizing that the more you do it, the more you slow everything
down).

John
=:->

PS> I don't know what functionality has been actually implemented in
pg_pool, just that it was discussed in the past. Slony-II is also in the
works.

От:
"Adam Haberlach"
Дата:

I think that perhaps he was trying to avoid having to buy "Big Iron" at all.

With all the Opteron v. Xeon around here, and talk of $30,000 machines,
perhaps it would be worth exploring the option of buying 10 cheapass
machines for $300 each.  At the moment, that $300 buys you, from Dell, a
2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet.
The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
harness as a single machine.

For those of us looking at batch and data warehousing applications, it would
be really handy to be able to partition databases, tables, and processing
load across banks of cheap hardware.

Yes, clustering solutions can distribute the data, and can even do it on a
per-table basis in some cases.  This still leaves it up to the application's
logic to handle reunification of the data.

Ideas:
    1. Create a table/storage type that consists of a select statement
on another machine.  While I don't think the current executor is capable of
working on multiple nodes of an execution tree at the same time, it would be
great if it could offload a select of tuples from a remote table to an
entirely different server and merge the resulting data into the current
execution.  I believe MySQL has this, and Oracle may implement it in another
way.

    2. There is no #2 at this time, but I'm sure one can be
hypothesized.

...Google and other companies have definitely proved that one can harness
huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)


-----Original Message-----
From: 
[mailto:] On Behalf Of John A Meinel
Sent: Tuesday, May 10, 2005 7:41 AM
To: Alex Stapleton
Cc: 
Subject: Re: [PERFORM] Partitioning / Clustering

Alex Stapleton wrote:
> What is the status of Postgres support for any sort of multi-machine
> scaling support? What are you meant to do once you've upgraded your
> box and tuned the conf files as much as you can? But your query load
> is just too high for a single machine?
>
> Upgrading stock Dell boxes (I know we could be using better machines,
> but I am trying to tackle the real issue) is not a hugely price
> efficient way of getting extra performance, nor particularly scalable
> in the long term.

Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is far
away from Big Iron. I don't know what performance you are looking for, but
you can easily get into inserting 10M rows/day with quality hardware.

But actually is it your SELECT load that is too high, or your INSERT load,
or something inbetween.

Because Slony is around if it is a SELECT problem.
http://gborg.postgresql.org/project/slony1/projdisplay.php

Basically, Slony is a Master/Slave replication system. So if you have INSERT
going into the Master, you can have as many replicated slaves, which can
handle your SELECT load.
Slony is an asynchronous replicator, so there is a time delay from the
INSERT until it will show up on a slave, but that time could be pretty
small.

This would require some application level support, since an INSERT goes to a
different place than a SELECT. But there has been some discussion about
pg_pool being able to spread the query load, and having it be aware of the
difference between a SELECT and an INSERT and have it route the query to the
correct host. The biggest problem being that functions could cause a SELECT
func() to actually insert a row, which pg_pool wouldn't know about. There
are 2 possible solutions, a) don't do that when you are using this system,
b) add some sort of comment hint so that pg_pool can understand that the
select is actually an INSERT, and needs to be done on the master.

>
> So, when/is PG meant to be getting a decent partitioning system?
> MySQL is getting one (eventually) which is apparently meant to be
> similiar to Oracle's according to the docs. Clusgres does not appear
> to be widely/or at all used, and info on it seems pretty thin on the
> ground, so I am not too keen on going with that. Is the real solution
> to multi- machine partitioning (as in, not like MySQLs MERGE tables)
> on  PostgreSQL actually doing it in our application API? This seems
> like  a less than perfect solution once we want to add redundancy and
> things into the mix.

There is also PGCluster
http://pgfoundry.org/projects/pgcluster/

Which is trying to be more of a Synchronous multi-master system. I haven't
heard of Clusgres, so I'm guessing it is an older attempt, which has been
overtaken by pgcluster.

Just realize that clusters don't necessarily scale like you would want them
too. Because at some point you have to insert into the same table, which
means you need to hold a lock which prevents the other machine from doing
anything. And with synchronous replication, you have to wait for all of the
machines to get a copy of the data before you can say it has been committed,
which does *not* scale well with the number of machines.

If you can make it work, I think having a powerful master server, who can
finish an INSERT quickly, and then having a bunch of Slony slaves with a
middleman (like pg_pool) to do load balancing among them, is the best way to
scale up. There are still some requirements, like not having to see the
results of an INSERT instantly (though if you are using hinting to pg_pool,
you could hint that this query must be done on the master, realizing that
the more you do it, the more you slow everything down).

John
=:->

PS> I don't know what functionality has been actually implemented in
pg_pool, just that it was discussed in the past. Slony-II is also in the
works.


От:
Alex Stapleton
Дата:

On 10 May 2005, at 15:41, John A Meinel wrote:

> Alex Stapleton wrote:
>
>> What is the status of Postgres support for any sort of multi-machine
>> scaling support? What are you meant to do once you've upgraded
>> your  box
>> and tuned the conf files as much as you can? But your query load  is
>> just too high for a single machine?
>>
>> Upgrading stock Dell boxes (I know we could be using better machines,
>> but I am trying to tackle the real issue) is not a hugely price
>> efficient way of getting extra performance, nor particularly scalable
>> in the long term.
>>
>
> Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is
> far away from Big Iron. I don't know what performance you are looking
> for, but you can easily get into inserting 10M rows/day with quality
> hardware.

Better hardware = More Efficient != More Scalable

> But actually is it your SELECT load that is too high, or your INSERT
> load, or something inbetween.
>
> Because Slony is around if it is a SELECT problem.
> http://gborg.postgresql.org/project/slony1/projdisplay.php
>
> Basically, Slony is a Master/Slave replication system. So if you have
> INSERT going into the Master, you can have as many replicated slaves,
> which can handle your SELECT load.
> Slony is an asynchronous replicator, so there is a time delay from the
> INSERT until it will show up on a slave, but that time could be pretty
> small.

<snip>

>
>>
>> So, when/is PG meant to be getting a decent partitioning system?
>> MySQL
>> is getting one (eventually) which is apparently meant to be
>> similiar to
>> Oracle's according to the docs. Clusgres does not appear  to be
>> widely/or at all used, and info on it seems pretty thin on the
>> ground,
>> so I am
>> not too keen on going with that. Is the real solution to multi-
>> machine
>> partitioning (as in, not like MySQLs MERGE tables) on  PostgreSQL
>> actually doing it in our application API? This seems like  a less
>> than
>> perfect solution once we want to add redundancy and  things into
>> the mix.
>>
>
> There is also PGCluster
> http://pgfoundry.org/projects/pgcluster/
>
> Which is trying to be more of a Synchronous multi-master system. I
> haven't heard of Clusgres, so I'm guessing it is an older attempt,
> which
> has been overtaken by pgcluster.
>
> Just realize that clusters don't necessarily scale like you would want
> them too. Because at some point you have to insert into the same
> table,
> which means you need to hold a lock which prevents the other machine
> from doing anything. And with synchronous replication, you have to
> wait
> for all of the machines to get a copy of the data before you can
> say it
> has been committed, which does *not* scale well with the number of
> machines.

This is why I mention partitioning. It solves this issue by storing
different data sets on different machines under the same schema.
These seperate chunks of the table can then be replicated as well for
data redundancy and so on. MySQL are working on these things, but PG
just has a bunch of third party extensions, I wonder why these are
not being integrated into the main trunk :/ Thanks for pointing me to
PGCluster though. It looks like it should be better than Slony at least.


От:
Alex Stapleton
Дата:

On 10 May 2005, at 16:02, Adam Haberlach wrote:

>
> I think that perhaps he was trying to avoid having to buy "Big
> Iron" at all.

You would be right. Although we are not against paying a bit more
than $300 for a server ;)

> With all the Opteron v. Xeon around here, and talk of $30,000
> machines,
> perhaps it would be worth exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from
> Dell, a
> 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit
> ethernet.
> The aggregate CPU and bandwidth is pretty stupendous, but not as
> easy to
> harness as a single machine.

<snip>

> Yes, clustering solutions can distribute the data, and can even do
> it on a
> per-table basis in some cases.  This still leaves it up to the
> application's
> logic to handle reunification of the data.

If your going to be programming that sort of logic into your API in
the beginning, it's not too much more work to add basic replication,
load balancing and partitioning into it either. But the DB should be
able to do it for you, adding that stuff in later is often more
difficult and less likely to get done.

> Ideas:
>     1. Create a table/storage type that consists of a select statement
> on another machine.  While I don't think the current executor is
> capable of
> working on multiple nodes of an execution tree at the same time, it
> would be
> great if it could offload a select of tuples from a remote table to an
> entirely different server and merge the resulting data into the
> current
> execution.  I believe MySQL has this, and Oracle may implement it
> in another
> way.

MySQL sort of has this, it's not as good as Oracle's though.
Apparently there is a much better version of it in 5.1 though, that
should make it to stable sometime next year I imagine.

>     2. There is no #2 at this time, but I'm sure one can be
> hypothesized.

I would of thought a particularly smart version of pg_pool could do
it. It could partition data to different servers if it knew which
columns to key by on each table.

> ...Google and other companies have definitely proved that one can
> harness
> huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)

I shudder to think how much the "Big Iron" equivalent of a google
data-center would cost.

> -----Original Message-----
> From: 
> [mailto:] On Behalf Of John A
> Meinel
> Sent: Tuesday, May 10, 2005 7:41 AM
> To: Alex Stapleton
> Cc: 
> Subject: Re: [PERFORM] Partitioning / Clustering
>
> Alex Stapleton wrote:
>
>> What is the status of Postgres support for any sort of multi-machine
>> scaling support? What are you meant to do once you've upgraded your
>> box and tuned the conf files as much as you can? But your query load
>> is just too high for a single machine?
>>
>> Upgrading stock Dell boxes (I know we could be using better machines,
>> but I am trying to tackle the real issue) is not a hugely price
>> efficient way of getting extra performance, nor particularly scalable
>> in the long term.
>>
>
> Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell
> is far
> away from Big Iron. I don't know what performance you are looking
> for, but
> you can easily get into inserting 10M rows/day with quality hardware.
>
> But actually is it your SELECT load that is too high, or your
> INSERT load,
> or something inbetween.
>
> Because Slony is around if it is a SELECT problem.
> http://gborg.postgresql.org/project/slony1/projdisplay.php
>
> Basically, Slony is a Master/Slave replication system. So if you
> have INSERT
> going into the Master, you can have as many replicated slaves,
> which can
> handle your SELECT load.
> Slony is an asynchronous replicator, so there is a time delay from the
> INSERT until it will show up on a slave, but that time could be pretty
> small.
>
> This would require some application level support, since an INSERT
> goes to a
> different place than a SELECT. But there has been some discussion
> about
> pg_pool being able to spread the query load, and having it be aware
> of the
> difference between a SELECT and an INSERT and have it route the
> query to the
> correct host. The biggest problem being that functions could cause
> a SELECT
> func() to actually insert a row, which pg_pool wouldn't know about.
> There
> are 2 possible solutions, a) don't do that when you are using this
> system,
> b) add some sort of comment hint so that pg_pool can understand
> that the
> select is actually an INSERT, and needs to be done on the master.
>
>
>>
>> So, when/is PG meant to be getting a decent partitioning system?
>> MySQL is getting one (eventually) which is apparently meant to be
>> similiar to Oracle's according to the docs. Clusgres does not appear
>> to be widely/or at all used, and info on it seems pretty thin on the
>> ground, so I am not too keen on going with that. Is the real solution
>> to multi- machine partitioning (as in, not like MySQLs MERGE tables)
>> on  PostgreSQL actually doing it in our application API? This seems
>> like  a less than perfect solution once we want to add redundancy and
>> things into the mix.
>>
>
> There is also PGCluster
> http://pgfoundry.org/projects/pgcluster/
>
> Which is trying to be more of a Synchronous multi-master system. I
> haven't
> heard of Clusgres, so I'm guessing it is an older attempt, which
> has been
> overtaken by pgcluster.
>
> Just realize that clusters don't necessarily scale like you would
> want them
> too. Because at some point you have to insert into the same table,
> which
> means you need to hold a lock which prevents the other machine from
> doing
> anything. And with synchronous replication, you have to wait for
> all of the
> machines to get a copy of the data before you can say it has been
> committed,
> which does *not* scale well with the number of machines.
>
> If you can make it work, I think having a powerful master server,
> who can
> finish an INSERT quickly, and then having a bunch of Slony slaves
> with a
> middleman (like pg_pool) to do load balancing among them, is the
> best way to
> scale up. There are still some requirements, like not having to see
> the
> results of an INSERT instantly (though if you are using hinting to
> pg_pool,
> you could hint that this query must be done on the master,
> realizing that
> the more you do it, the more you slow everything down).
>
> John
> =:->
>
> PS> I don't know what functionality has been actually implemented in
> pg_pool, just that it was discussed in the past. Slony-II is also
> in the
> works.
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
> )
>
>


От:
Richard_D_Levine@raytheon.com
Дата:

> exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from Dell, a
> 2.5Ghz Pentium 4

Buy cheaper ass Dells with an AMD 64 3000+.  Beats the crap out of the 2.5
GHz Pentium, especially for PostgreSQL.

See the thread "Whence the Opterons" for more....

Rick

 wrote on 05/10/2005 10:02:50 AM:

>
> I think that perhaps he was trying to avoid having to buy "Big Iron" at
all.
>
> With all the Opteron v. Xeon around here, and talk of $30,000 machines,
> perhaps it would be worth exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from Dell, a
> 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit
ethernet.
> The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
> harness as a single machine.
>
> For those of us looking at batch and data warehousing applications, it
would
> be really handy to be able to partition databases, tables, and processing
> load across banks of cheap hardware.
>
> Yes, clustering solutions can distribute the data, and can even do it on
a
> per-table basis in some cases.  This still leaves it up to the
application's
> logic to handle reunification of the data.
>
> Ideas:
>    1. Create a table/storage type that consists of a select statement
> on another machine.  While I don't think the current executor is capable
of
> working on multiple nodes of an execution tree at the same time, it would
be
> great if it could offload a select of tuples from a remote table to an
> entirely different server and merge the resulting data into the current
> execution.  I believe MySQL has this, and Oracle may implement it in
another
> way.
>
>    2. There is no #2 at this time, but I'm sure one can be
> hypothesized.
>
> ...Google and other companies have definitely proved that one can harness
> huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)
>
>
> -----Original Message-----
> From: 
> [mailto:] On Behalf Of John A
Meinel
> Sent: Tuesday, May 10, 2005 7:41 AM
> To: Alex Stapleton
> Cc: 
> Subject: Re: [PERFORM] Partitioning / Clustering
>
> Alex Stapleton wrote:
> > What is the status of Postgres support for any sort of multi-machine
> > scaling support? What are you meant to do once you've upgraded your
> > box and tuned the conf files as much as you can? But your query load
> > is just too high for a single machine?
> >
> > Upgrading stock Dell boxes (I know we could be using better machines,
> > but I am trying to tackle the real issue) is not a hugely price
> > efficient way of getting extra performance, nor particularly scalable
> > in the long term.
>
> Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is
far
> away from Big Iron. I don't know what performance you are looking for,
but
> you can easily get into inserting 10M rows/day with quality hardware.
>
> But actually is it your SELECT load that is too high, or your INSERT
load,
> or something inbetween.
>
> Because Slony is around if it is a SELECT problem.
> http://gborg.postgresql.org/project/slony1/projdisplay.php
>
> Basically, Slony is a Master/Slave replication system. So if you have
INSERT
> going into the Master, you can have as many replicated slaves, which can
> handle your SELECT load.
> Slony is an asynchronous replicator, so there is a time delay from the
> INSERT until it will show up on a slave, but that time could be pretty
> small.
>
> This would require some application level support, since an INSERT goes
to a
> different place than a SELECT. But there has been some discussion about
> pg_pool being able to spread the query load, and having it be aware of
the
> difference between a SELECT and an INSERT and have it route the query to
the
> correct host. The biggest problem being that functions could cause a
SELECT
> func() to actually insert a row, which pg_pool wouldn't know about. There
> are 2 possible solutions, a) don't do that when you are using this
system,
> b) add some sort of comment hint so that pg_pool can understand that the
> select is actually an INSERT, and needs to be done on the master.
>
> >
> > So, when/is PG meant to be getting a decent partitioning system?
> > MySQL is getting one (eventually) which is apparently meant to be
> > similiar to Oracle's according to the docs. Clusgres does not appear
> > to be widely/or at all used, and info on it seems pretty thin on the
> > ground, so I am not too keen on going with that. Is the real solution
> > to multi- machine partitioning (as in, not like MySQLs MERGE tables)
> > on  PostgreSQL actually doing it in our application API? This seems
> > like  a less than perfect solution once we want to add redundancy and
> > things into the mix.
>
> There is also PGCluster
> http://pgfoundry.org/projects/pgcluster/
>
> Which is trying to be more of a Synchronous multi-master system. I
haven't
> heard of Clusgres, so I'm guessing it is an older attempt, which has been
> overtaken by pgcluster.
>
> Just realize that clusters don't necessarily scale like you would want
them
> too. Because at some point you have to insert into the same table, which
> means you need to hold a lock which prevents the other machine from doing
> anything. And with synchronous replication, you have to wait for all of
the
> machines to get a copy of the data before you can say it has been
committed,
> which does *not* scale well with the number of machines.
>
> If you can make it work, I think having a powerful master server, who can
> finish an INSERT quickly, and then having a bunch of Slony slaves with a
> middleman (like pg_pool) to do load balancing among them, is the best way
to
> scale up. There are still some requirements, like not having to see the
> results of an INSERT instantly (though if you are using hinting to
pg_pool,
> you could hint that this query must be done on the master, realizing that
> the more you do it, the more you slow everything down).
>
> John
> =:->
>
> PS> I don't know what functionality has been actually implemented in
> pg_pool, just that it was discussed in the past. Slony-II is also in the
> works.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to )


От:
John A Meinel
Дата:

Adam Haberlach wrote:
> I think that perhaps he was trying to avoid having to buy "Big Iron" at all.
>
> With all the Opteron v. Xeon around here, and talk of $30,000 machines,
> perhaps it would be worth exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from Dell, a
> 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet.
> The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
> harness as a single machine.
>
> For those of us looking at batch and data warehousing applications, it would
> be really handy to be able to partition databases, tables, and processing
> load across banks of cheap hardware.
>
> Yes, clustering solutions can distribute the data, and can even do it on a
> per-table basis in some cases.  This still leaves it up to the application's
> logic to handle reunification of the data.

Sure. A lot of this is application dependent, though. For instance
foreign key constraints. In a general cluster solution, you would allow
foreign keys across partitions. I have a feeling this would be extra
slow, and hard to do correctly. Multi-machine transactions are also a
difficulty, since WAL now has to take into account all machines, and you
have to wait for fsync on all of them.

I'm not sure how Oracle does it, but these things seem like they prevent
clustering from really scaling very well.

>
> Ideas:
>     1. Create a table/storage type that consists of a select statement
> on another machine.  While I don't think the current executor is capable of
> working on multiple nodes of an execution tree at the same time, it would be
> great if it could offload a select of tuples from a remote table to an
> entirely different server and merge the resulting data into the current
> execution.  I believe MySQL has this, and Oracle may implement it in another
> way.
>
>     2. There is no #2 at this time, but I'm sure one can be
> hypothesized.
>
> ...Google and other companies have definitely proved that one can harness
> huge clusters of cheap hardware.  It can't be _that_ hard, can it.  :)

Again, it depends on the application. A generic database with lots of
cross reference integrity checking does not work on a cluster very well.
A very distributed db where you don't worry about cross references does
scale. Google made a point of making their application work in a
distributed manner.

In the other post he mentions that pg_pool could naturally split out the
rows into different machines based on partitioning, etc. I would argue
that it is more of a custom pool daemon based on the overall
application. Because you have to start dealing with things like
cross-machine joins. Who handles that? the pool daemon has to, since it
is the only thing that talks to both tables. I think you could certainly
write a reasonably simple application specific daemon where all of the
clients send their queries to, and it figures out where they need to go,
and aggregates them as necessary. But a fully generic one is *not*
simple at all, and I think is far out of the scope of something like
pg_pool.

I'm guessing that PGCluster is looking at working on that, and it might
be true that pg_pool is thinking about it. But just thinking about the
very simple query:

SELECT row1, row2 FROM table1_on_machine_a NATURAL JOIN table2_on_machine_b
WHERE restrict_table_1 AND restrict_table_2
AND restrict_1_based_on_2;

This needs to be broken into something like:

SELECT row1 FROM table1_on_machine_a
    WHERE restrict_table_1
    ORDER BY join_column;
SELECT row2 FROM table2_on_machine_b
    WHERE restrict_table_2
    ORDER BY join_column;

Then these rows need to be merge_joined, and the restrict_1_based_on_2
needs to be applied.
This is in no way trivial, and I think it is outside the scope of
pg_pool. Now maybe if you restrict yourself so that each query stays
within one machine you can make it work. Or write your own app to handle
some of this transparently for the clients. But I would expect to make
the problem feasible, it would not be a generic solution.

Maybe I'm off base, I don't really keep track of pg_pool/PGCluster/etc.
But I can see that the problem is very difficult. Not at the very least,
this is a simple query. And it doesn't even do optimizations. You might
actually prefer the above to be done with a Nestloop style, where
table_1 is selected, and then for each row you do a single index select
on table_2. But how is your app going to know that? It has to have the
statistics from the backend databases. And if it has to place an extra
query to get those statistics, you just hurt your scalability even more.
Whereas big-iron already has all the statistics, and can optimize the
query plan.

Perhaps pg_cluster will handle this, by maintaining full statistics
across the cluster on each machine, so that more optimal queries can be
performed. I don't really know.

John
=:->

От:
PFC
Дата:


> SELECT row1, row2 FROM table1_on_machine_a NATURAL JOIN
> table2_on_machine_b
> WHERE restrict_table_1 AND restrict_table_2
> AND restrict_1_based_on_2;

    I don't think that's ever going to be efficient...
    What would be efficient would be, for instance, a Join of a part of a
table against another part of another table which both happen to be on the
same machine, because the partitioning was done with this in mind (ie. for
instance partitioning on client_id and keeping the information for each
client on the same machine).

    You could build your smart pool daemon in pl/pgsql and use dblink ! At
least you have the query parser built-in.

    I wonder how Oracle does it ;)

От:
Josh Berkus
Дата:

Alex,

> This is why I mention partitioning. It solves this issue by storing  
> different data sets on different machines under the same schema.  

That's clustering, actually.  Partitioning is simply dividing up a table into
chunks and using the chunks intelligently.   Putting those chunks on seperate
machines is another thing entirely.

We're working on partitioning through the Bizgres sub-project:
www.bizgres.org  / http://pgfoundry.org/projects/bizgres/
... and will be pushing it to the main PostgreSQL when we have something.

I invite you to join the mailing list.

> These seperate chunks of the table can then be replicated as well for  
> data redundancy and so on. MySQL are working on these things,

Don't hold your breath.   MySQL, to judge by their first "clustering"
implementation, has a *long* way to go before they have anything usable.  In
fact, at OSCON their engineers were asking Jan Wieck for advice.

If you have $$$ to shell out, my employer (GreenPlum) has a multi-machine
distributed version of PostgreSQL.  It's proprietary, though.
www.greenplum.com.

If you have more time than money, I understand that Stanford is working on
this problem:
http://www-db.stanford.edu/~bawa/

But, overall, some people on this list are very mistaken in thinking it's an
easy problem.   GP has devoted something like 5 engineers for 3 years to
develop their system.  Oracle spent over $100 million to develop RAC.

> but PG  
> just has a bunch of third party extensions, I wonder why these are  
> not being integrated into the main trunk :/

Because it represents a host of complex functionality which is not applicable
to most users?  Because there are 4 types of replication and 3 kinds of
clusering and not all users want the same kind?

--
Josh Berkus
Aglio Database Solutions
San Francisco

От:
Mischa Sandberg
Дата:

Quoting :

> > exploring the option of buying 10 cheapass
> > machines for $300 each.  At the moment, that $300 buys you, from
> Dell, a
> > 2.5Ghz Pentium 4
>
> Buy cheaper ass Dells with an AMD 64 3000+.  Beats the crap out of
> the 2.5
> GHz Pentium, especially for PostgreSQL.

Whence "Dells with an AMD 64" ?? Perhaps you skimmed:

  http://www.thestreet.com/tech/kcswanson/10150604.html
or
  http://www.eweek.com/article2/0,1759,1553822,00.asp





От:
"Jim C. Nasby"
Дата:

On Tue, May 10, 2005 at 07:29:59PM +0200, PFC wrote:
>     I wonder how Oracle does it ;)

Oracle *clustering* demands shared storage. So you've shifted your money
from big-iron CPUs to big-iron disk arrays.

Oracle replication works similar to Slony, though it supports a lot more
modes (ie: syncronous).
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
Mischa Sandberg
Дата:

Quoting Alex Stapleton <>:

> This is why I mention partitioning. It solves this issue by storing
> different data sets on different machines under the same schema.
> These seperate chunks of the table can then be replicated as well for
> data redundancy and so on. MySQL are working on these things, but PG
> just has a bunch of third party extensions, I wonder why these are
> not being integrated into the main trunk :/ Thanks for pointing me to
> PGCluster though. It looks like it should be better than Slony at
> least.

Across a decade or two of projects, including creating a federated
database engine for Simba, I've become rather dubious of horizontal
partitions (across disks or servers), either to improve performance, or
just to scale up and not lose performance. [[The one exception is for
<emphasis> non-time-critical read-only</emphasis> systems, with
Slony-style replication.]]

The most successful high-volume systems I've seen have broken up
databases functionally, like a pipeline, where different applications
use different sections of the pipe.

The highest-volume system I've worked on is Acxiom's gigantic
data-cleansing system. This is the central clearinghouse for every scrap
of demographic that can be associated with some North American,
somewhere. Think of D&B for 300M people (some dead). The volumes are
just beyond belief, for both updates and queries. At Acxiom, the
datasets are so large, even after partitioning, that they just
constantly cycle them through memory, and commands are executes in
convoys --- sort of like riding a paternoster.
..........
Anybody been tracking on what Mr Stonebraker's been up to, lately?
Datastream management. Check it out. Like replication, everybody
hand-rolled their own datastream systems until finally somebody else
generalized it well enough that it didn't have to be built from scratch
every time.

Datastream systems require practically no locking, let alone distributed
transactions. They give you some really strong guarantees on transaction
elapsed-time and throughput.
.......
Where is this all leading? Well, for scaling data like this, the one
feature that you need is the ability of procedures/rules on one server
to perform queries/procedures on another. MSSQL has linked servers and
(blech) OpenQuery. This lets you do reasonably-efficient work when you
only deal with one table at a time. Do NOT try anything fancy with
multi-table joins; timeouts are unavoidable, and painful.

Postgres has a natural advantage in such a distributed server system:
all table/index stats are openly available through the SQL interface,
for one server to make rational query plans involving another server's
resources. God! I would have killed for that when I was writing a
federated SQL engine; the kluges you need to do this at arms-length from
that information are true pain.

So where should I go look, to see what's been done so far, on a Postgres
that can treat another PG server as a new table type?



От:
"Jim C. Nasby"
Дата:

On Tue, May 10, 2005 at 02:55:55PM -0700, Mischa Sandberg wrote:
> just beyond belief, for both updates and queries. At Acxiom, the
> datasets are so large, even after partitioning, that they just
> constantly cycle them through memory, and commands are executes in
> convoys --- sort of like riding a paternoster.

Speaking of which... what's the status of the patch that allows seqscans
to piggyback on already running seqscans on the same table?

> So where should I go look, to see what's been done so far, on a Postgres
> that can treat another PG server as a new table type?

To the best of my knowledge no such work has been done. There is a
project (who's name escapes me) that lets you run queries against a
remote postgresql server from a postgresql connection to a different
server, which could serve as the basis for what you're proposing.

BTW, given your experience, you might want to check out Bizgres.
(http://pgfoundry.org/projects/bizgres/) I'm sure your insights would be
most welcome.
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
Christopher Kings-Lynne
Дата:

> This is why I mention partitioning. It solves this issue by storing
> different data sets on different machines under the same schema.  These
> seperate chunks of the table can then be replicated as well for  data
> redundancy and so on. MySQL are working on these things

*laff*

Yeah, like they've been working on views for the last 5 years, and still
haven't released them :D :D :D

Chris

От:
Mischa Sandberg
Дата:

Quoting Christopher Kings-Lynne <>:

> > This is why I mention partitioning. It solves this issue by storing
> > different data sets on different machines under the same schema.
> > These seperate chunks of the table can then be replicated as well for
> > data redundancy and so on. MySQL are working on these things
> *laff*
> Yeah, like they've been working on views for the last 5 years, and
> still haven't released them :D :D :D

?
http://dev.mysql.com/doc/mysql/en/create-view.html
...for MySQL 5.0.1+ ?


От:
"Joshua D. Drake"
Дата:

Mischa Sandberg wrote:
> Quoting Christopher Kings-Lynne <>:
>
>
>>>This is why I mention partitioning. It solves this issue by storing
>>>different data sets on different machines under the same schema.
>>>These seperate chunks of the table can then be replicated as well for
>>>data redundancy and so on. MySQL are working on these things
>>
>>*laff*
>>Yeah, like they've been working on views for the last 5 years, and
>>still haven't released them :D :D :D
>
>
> ?
> http://dev.mysql.com/doc/mysql/en/create-view.html
> ...for MySQL 5.0.1+ ?

Yes but MySQL 5 isn't out yet (considered stable).

Sincerely,

Joshua D. Drake



>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org


От:
Christopher Kings-Lynne
Дата:

>>*laff*
>>Yeah, like they've been working on views for the last 5 years, and
>>still haven't released them :D :D :D
>
>
> ?
> http://dev.mysql.com/doc/mysql/en/create-view.html
> ...for MySQL 5.0.1+ ?

Give me a call when it's RELEASED.

Chris

От:
Mischa Sandberg
Дата:

Quoting Christopher Kings-Lynne <>:

>
> >>*laff*
> >>Yeah, like they've been working on views for the last 5 years, and
> >>still haven't released them :D :D :D
> >
> > ?
> > http://dev.mysql.com/doc/mysql/en/create-view.html
> > ...for MySQL 5.0.1+ ?
>
> Give me a call when it's RELEASED.


:-) Touche'



От:
Mischa Sandberg
Дата:

Quoting "Jim C. Nasby" <>:

> To the best of my knowledge no such work has been done. There is a
> project (who's name escapes me) that lets you run queries against a
> remote postgresql server from a postgresql connection to a different
> server, which could serve as the basis for what you're proposing.

Okay, if the following looks right to the powerthatbe, I'd like to start
a project. Here's the proposition:

"servername.dbname.schema.object" would change RangeVar, which would
affect much code. "dbname.schema.object" itself is not implemented in
8.0. So, simplicity dictates something like:

table pg_remote(schemaname text, connectby text, remoteschema text)

The pg_statistic info from a remote server cannot be cached in local
pg_statistic, without inventing pseudo reloids as well as a
pseudoschema. Probably cleaner to cache it somewhere else. I'm still
reading down the path that puts pg_statistic data where costsize can get
at it.

First step: find out whether one can link libpq.so to postmaster :-)


От:
Neil Conway
Дата:

Josh Berkus wrote:
> Don't hold your breath.   MySQL, to judge by their first "clustering"
> implementation, has a *long* way to go before they have anything usable.

Oh? What's wrong with MySQL's clustering implementation?

-Neil

От:
"Joshua D. Drake"
Дата:

Neil Conway wrote:
> Josh Berkus wrote:
>
>> Don't hold your breath.   MySQL, to judge by their first "clustering"
>> implementation, has a *long* way to go before they have anything usable.
>
>
> Oh? What's wrong with MySQL's clustering implementation?

Ram only tables :)

>
> -Neil
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to  so that your
>      message can get through to the mailing list cleanly


От:
Bruno Wolff III
Дата:

On Tue, May 10, 2005 at 08:02:50 -0700,
  Adam Haberlach <> wrote:
>
>
> With all the Opteron v. Xeon around here, and talk of $30,000 machines,
> perhaps it would be worth exploring the option of buying 10 cheapass
> machines for $300 each.  At the moment, that $300 buys you, from Dell, a
> 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet.
> The aggregate CPU and bandwidth is pretty stupendous, but not as easy to
> harness as a single machine.

That isn't going to be ECC ram. I don't think you really want to use
non-ECC ram in a critical database.

От:
Neil Conway
Дата:

Joshua D. Drake wrote:
> Neil Conway wrote:
>> Oh? What's wrong with MySQL's clustering implementation?
>
> Ram only tables :)

Sure, but that hardly makes it not "usable". Considering the price of
RAM these days, having enough RAM to hold the database (distributed over
the entire cluster) is perfectly acceptable for quite a few people.

(Another deficiency is in 4.0, predicates in queries would not be pushed
down to storage nodes -- so you had to stream the *entire* table over
the network, and then apply the WHERE clause at the frontend query node.
That is fixed in 5.0, though.)

-Neil

От:
Josh Berkus
Дата:

Neil,

> Sure, but that hardly makes it not "usable". Considering the price of
> RAM these days, having enough RAM to hold the database (distributed over
> the entire cluster) is perfectly acceptable for quite a few people.

The other problem, as I was told it at OSCON, was that these were not
high-availability clusters; it's impossible to add a server to an existing
cluster, and a server going down is liable to take the whole cluster down.
Mind you, I've not tried that aspect of it myself; once I saw the ram-only
rule, we switched to something else.

--
Josh Berkus
Aglio Database Solutions
San Francisco

От:
Simon Riggs
Дата:

On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote:
> So, when/is PG meant to be getting a decent partitioning system?

ISTM that your question seems to confuse where code comes from. Without
meaning to pick on you, or reply rudely, I'd like to explore that
question. Perhaps it should be a FAQ entry.

All code is written by someone, and those people need to eat. Some
people are fully or partly funded to perform their tasks on this project
(coding, patching, etc). Others contribute their time for a variety of
reasons where involvement has a positive benefit.

You should ask these questions:
- Is anyone currently working on (Feature X)?
- If not, Can I do it myself?
- If not, and I still want it, can I fund someone else to build it for
me?

Asking "when is Feature X going to happen" is almost certainly going to
get the answer "never" otherwise, if the initial development is large
and complex. There are many TODO items that have lain untouched for
years, even though adding the feature has been discussed and agreed.

Best Regards, Simon Riggs



От:
Neil Conway
Дата:

Josh Berkus wrote:
> The other problem, as I was told it at OSCON, was that these were not
> high-availability clusters; it's impossible to add a server to an existing
> cluster

Yeah, that's a pretty significant problem.

> a server going down is liable to take the whole cluster down.

That's news to me. Do you have more information on this?

-Neil

От:
"David Roussel"
Дата:

For an interesting look at scalability, clustering, caching, etc for a
large site have a look at how livejournal did it.
http://www.danga.com/words/2004_lisa/lisa04.pdf

They have 2.6 Million active users, posting 200 new blog entries per
minute, plus many comments and countless page views.

Although this system is of a different sort to the type I work on it's
interesting to see how they've made it scale.

They use mysql on dell hardware! And found single master replication did
not scale.  There's a section on multimaster replication, not sure if
they use it.  The main approach they use is to parition users into
spefic database clusters.  Caching is done using memcached at the
application level to avoid hitting the db for rendered pageviews.

It's interesting that the solution livejournal have arrived at is quite
similar in ways to the way google is set up.

David

От:
Alex Stapleton
Дата:

On 11 May 2005, at 08:16, Simon Riggs wrote:

> On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote:
>
>> So, when/is PG meant to be getting a decent partitioning system?
>>
>
> ISTM that your question seems to confuse where code comes from.
> Without
> meaning to pick on you, or reply rudely, I'd like to explore that
> question. Perhaps it should be a FAQ entry.
>
> All code is written by someone, and those people need to eat. Some
> people are fully or partly funded to perform their tasks on this
> project
> (coding, patching, etc). Others contribute their time for a variety of
> reasons where involvement has a positive benefit.
>
> You should ask these questions:
> - Is anyone currently working on (Feature X)?
> - If not, Can I do it myself?
> - If not, and I still want it, can I fund someone else to build it for
> me?
>
> Asking "when is Feature X going to happen" is almost certainly
> going to
> get the answer "never" otherwise, if the initial development is large
> and complex. There are many TODO items that have lain untouched for
> years, even though adding the feature has been discussed and agreed.
>
> Best Regards, Simon Riggs
>

Acceptable Answers to 'So, when/is PG meant to be getting a decent
partitioning system?':

     1. Person X is working on it I believe.
     2. It's on the list, but nobody has done anything about it yet
     3. Your welcome to take a stab at it, I expect the community
would support your efforts as well.
     4. If you have a huge pile of money you could probably buy the
Moon. Thinking along those lines, you can probably pay someone to
write it for you.
     5. It's a stupid idea, and it's never going to work, and heres
why..............

Unacceptable Answers to the same question:

     1. Yours.

Be more helpful, and less arrogant please. Everyone else who has
contributed to this thread has been very helpful in clarifying the
state of affairs and pointing out what work is and isn't being done,
and alternatives to just waiting for PG do it for you.

>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
>


От:
Alex Stapleton
Дата:

On 11 May 2005, at 08:57, David Roussel wrote:

> For an interesting look at scalability, clustering, caching, etc for a
> large site have a look at how livejournal did it.
> http://www.danga.com/words/2004_lisa/lisa04.pdf

I have implemented similar systems in the past, it's a pretty good
technique, unfortunately it's not very "Plug-and-Play" as you have to
base most of your API on memcached (I imagine MySQLs NDB tables might
work as well actually) for it to work well.

> They have 2.6 Million active users, posting 200 new blog entries per
> minute, plus many comments and countless page views.
>
> Although this system is of a different sort to the type I work on it's
> interesting to see how they've made it scale.
>
> They use mysql on dell hardware! And found single master
> replication did
> not scale.  There's a section on multimaster replication, not sure if
> they use it.  The main approach they use is to parition users into
> spefic database clusters.  Caching is done using memcached at the
> application level to avoid hitting the db for rendered pageviews

I don't think they are storing pre-rendered pages (or bits of) in
memcached, but are principally storing the data for the pages in it.
Gluing pages together is not a hugely intensive process usually :)
The only problem with memcached is that the clients clustering/
partitioning system will probably break if a node dies, and probably
get confused if you add new nodes onto it as well. Easily extensible
clustering (no complete redistribution of data required when you add/
remove nodes) with the data distributed across nodes seems to be
nothing but a pipe dream right now.

> It's interesting that the solution livejournal have arrived at is
> quite
> similar in ways to the way google is set up.

Don't Google use indexing servers which keep track of where data is?
So that you only need to update them when you add or move data,
deletes don't even have to be propagated among indexes immediately
really because you'll find out if data isn't there when you visit
where it should be. Or am I talking crap?

> David
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
>


От:
Alex Stapleton
Дата:

On 11 May 2005, at 09:50, Alex Stapleton wrote:

>
> On 11 May 2005, at 08:57, David Roussel wrote:
>
>
>> For an interesting look at scalability, clustering, caching, etc
>> for a
>> large site have a look at how livejournal did it.
>> http://www.danga.com/words/2004_lisa/lisa04.pdf
>>
>
> I have implemented similar systems in the past, it's a pretty good
> technique, unfortunately it's not very "Plug-and-Play" as you have
> to base most of your API on memcached (I imagine MySQLs NDB tables
> might work as well actually) for it to work well.
>
>
>> They have 2.6 Million active users, posting 200 new blog entries per
>> minute, plus many comments and countless page views.
>>
>> Although this system is of a different sort to the type I work on
>> it's
>> interesting to see how they've made it scale.
>>
>> They use mysql on dell hardware! And found single master
>> replication did
>> not scale.  There's a section on multimaster replication, not sure if
>> they use it.  The main approach they use is to parition users into
>> spefic database clusters.  Caching is done using memcached at the
>> application level to avoid hitting the db for rendered pageviews
>>
>
> I don't think they are storing pre-rendered pages (or bits of) in
> memcached, but are principally storing the data for the pages in
> it. Gluing pages together is not a hugely intensive process usually :)
> The only problem with memcached is that the clients clustering/
> partitioning system will probably break if a node dies, and
> probably get confused if you add new nodes onto it as well. Easily
> extensible clustering (no complete redistribution of data required
> when you add/remove nodes) with the data distributed across nodes
> seems to be nothing but a pipe dream right now.
>
>
>> It's interesting that the solution livejournal have arrived at is
>> quite
>> similar in ways to the way google is set up.
>>
>
> Don't Google use indexing servers which keep track of where data
> is? So that you only need to update them when you add or move data,
> deletes don't even have to be propagated among indexes immediately
> really because you'll find out if data isn't there when you visit
> where it should be. Or am I talking crap?

That will teach me to RTFA first ;) Ok so LJ maintain an index of
which cluster each user is on, kinda of like google do :)

>
>> David
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 8: explain analyze is your friend
>>
>>
>>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
>


От:
Christopher Kings-Lynne
Дата:

> Acceptable Answers to 'So, when/is PG meant to be getting a decent
> partitioning system?':
>
>     1. Person X is working on it I believe.
>     2. It's on the list, but nobody has done anything about it yet
>     3. Your welcome to take a stab at it, I expect the community  would
> support your efforts as well.
>     4. If you have a huge pile of money you could probably buy the
> Moon. Thinking along those lines, you can probably pay someone to  write
> it for you.
>     5. It's a stupid idea, and it's never going to work, and heres
> why..............
>
> Unacceptable Answers to the same question:
>
>     1. Yours.
>
> Be more helpful, and less arrogant please. Everyone else who has
> contributed to this thread has been very helpful in clarifying the
> state of affairs and pointing out what work is and isn't being done,
> and alternatives to just waiting for PG do it for you.

Please YOU be more helpful and less arrogant.  I thought your inital
email was arrogant, demanding and insulting.  Your followup email has
done nothing to dispel my impression.  Simon (one of PostgreSQL's major
contributors AND one of the very few people working on partitioning in
PostgreSQL, as you requested) told you all the reasons clearly and politely.

Chris

От:
Tom Lane
Дата:

Mischa Sandberg <> writes:
> So, simplicity dictates something like:

> table pg_remote(schemaname text, connectby text, remoteschema text)

Previous discussion of this sort of thing concluded that we wanted to
follow the SQL-MED standard.

            regards, tom lane

От:
Josh Berkus
Дата:

David,

> It's interesting that the solution livejournal have arrived at is quite
> similar in ways to the way google is set up.

Yes, although again, they're using memcached as pseudo-clustering software,
and as a result are limited to what fits in RAM (RAM on 27 machines, but it's
still RAM).  And due to limitations on memcached, the whole thing blows
whenever a server goes out (the memcached project is working on this).  But
any LJ user could tell you that it's a low-availability system.

However, memcached (and for us, pg_memcached) is an excellent way to improve
horizontal scalability by taking disposable data (like session information)
out of the database and putting it in protected RAM.  On some websites,
adding memcached can result is as much as a 60% decrease in database traffic.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

От:
Greg Stark
Дата:

Alex Stapleton <> writes:

> Acceptable Answers to 'So, when/is PG meant to be getting a decent
> partitioning system?':
...
>      3. Your welcome to take a stab at it, I expect the community  would
> support your efforts as well.

As long as we're being curt all around, this one's not acceptable on the basis
that it's not grammatical.

--
greg

От:
Simon Riggs
Дата:

On Wed, 2005-05-11 at 17:13 +0800, Christopher Kings-Lynne wrote:
> > Alex Stapleton wrote
> > Be more helpful, and less arrogant please.
>
> Simon told you all the reasons clearly and politely.

Thanks Chris for your comments.

PostgreSQL can always do with one more developer and my sole intent was
to encourage Alex and other readers to act themselves. If my words seem
arrogant, then I apologise to any and all that think so.

Best Regards, Simon Riggs



От:
"Jim C. Nasby"
Дата:

On Wed, May 11, 2005 at 08:57:57AM +0100, David Roussel wrote:
> For an interesting look at scalability, clustering, caching, etc for a
> large site have a look at how livejournal did it.
> http://www.danga.com/words/2004_lisa/lisa04.pdf
>
> They have 2.6 Million active users, posting 200 new blog entries per
> minute, plus many comments and countless page views.

Neither of which is that horribly impressive. 200 TPM is less than 4TPS.
While I haven't run high transaction rate databases under PostgreSQL, I
suspect others who have will say that 4TPS isn't that big of a deal.

> Although this system is of a different sort to the type I work on it's
> interesting to see how they've made it scale.
>
> They use mysql on dell hardware! And found single master replication did
> not scale.  There's a section on multimaster replication, not sure if
Probably didn't scale because they used to use MyISAM.

> they use it.  The main approach they use is to parition users into
> spefic database clusters.  Caching is done using memcached at the
Which means they've got a huge amount of additional code complexity, not
to mention how many times you can't post something because 'that cluster
is down for maintenance'.

> application level to avoid hitting the db for rendered pageviews.
Memcached is about the only good thing I've seen come out of
livejournal.

> It's interesting that the solution livejournal have arrived at is quite
> similar in ways to the way google is set up.

Except that unlike LJ, google stays up and it's fast. Though granted, LJ
is quite a bit faster than it was 6 months ago.
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
PFC
Дата:


> However, memcached (and for us, pg_memcached) is an excellent way to
> improve
> horizontal scalability by taking disposable data (like session
> information)
> out of the database and putting it in protected RAM.

    So, what is the advantage of such a system versus, say, a "sticky
sessions" system where each session is assigned to ONE application server
(not PHP then) which keeps it in RAM as native objects instead of
serializing and deserializing it on each request ?
    I'd say the sticky sessions should perform a lot better, and if one
machine dies, only the sessions on this one are lost.
    But of course you can't do it with PHP as you need an app server which
can manage sessions. Potentially the savings are huge, though.

    On Google, their distributed system spans a huge number of PCs and it has
redundancy, ie. individual PC failure is a normal thing and is a part of
the system, it is handled gracefully. I read a paper on this matter, it's
pretty impressive. The google filesystem has nothing to do with databases
though, it's more a massive data store / streaming storage.

От:
Alex Stapleton
Дата:

On 11 May 2005, at 23:35, PFC wrote:

>
>
>
>> However, memcached (and for us, pg_memcached) is an excellent way
>> to improve
>> horizontal scalability by taking disposable data (like session
>> information)
>> out of the database and putting it in protected RAM.
>>
>
>     So, what is the advantage of such a system versus, say, a
> "sticky sessions" system where each session is assigned to ONE
> application server (not PHP then) which keeps it in RAM as native
> objects instead of serializing and deserializing it on each request ?
>     I'd say the sticky sessions should perform a lot better, and if
> one machine dies, only the sessions on this one are lost.
>     But of course you can't do it with PHP as you need an app
> server which can manage sessions. Potentially the savings are huge,
> though.

Theres no reason it couldn't be done with PHP to be fair as long as
you could ensure that the client was always routed back to the same
machines. Which has it's own set of issues entirely. I am not
entirely sure that memcached actually does serialize data when it's
comitted into memcached either, although I could be wrong, I have not
looked at the source. Certainly if you can ensure that a client
always goes back to the same machine you can simplify the whole thing
hugely. It's generally not that easy though, you need a proxy server
of some description capable of understanding the HTTP traffic and
maintaining a central session lookup table to redirect with. Which
isn't really solving the problem so much as moving it somewhere else.
Instead of needing huge memcached pools, you need hardcore
loadbalancers. Load Balancers tend to cost $$$$$ in comparison.
Distributed sticky sessions are a rather nice idea, I would like to
hear a way of implementing them cheaply (and on PHP) as well. I may
have to give that some thought in fact. Oh yeah, and load balancers
software often sucks in annoying (if not always important) ways.

>     On Google, their distributed system spans a huge number of PCs
> and it has redundancy, ie. individual PC failure is a normal thing
> and is a part of the system, it is handled gracefully. I read a
> paper on this matter, it's pretty impressive. The google filesystem
> has nothing to do with databases though, it's more a massive data
> store / streaming storage.
>

Since when did Massive Data stores have nothing to do with DBs? Isn't
Oracle Cluster entirely based on forming an enormous scalable disk
array to store your DB on?


От:
PFC
Дата:

> machines. Which has it's own set of issues entirely. I am not entirely
> sure that memcached actually does serialize data when it's comitted into

    I think it does, ie. it's a simple mapping of [string key] => [string
value].

> memcached either, although I could be wrong, I have not looked at the
> source. Certainly if you can ensure that a client always goes back to
> the same machine you can simplify the whole thing hugely. It's generally
> not that easy though, you need a proxy server of some description
> capable of understanding the HTTP traffic and maintaining a central

    Yes...
    You could implement it by mapping servers to the hash of the user session
id.
    Statistically, the servers would get the same numbers of sessions on each
of them, but you have to trust statistics...
    It does eliminate the lookup table though.

> idea, I would like to hear a way of implementing them cheaply (and on
> PHP) as well. I may have to give that some thought in fact. Oh yeah, and
> load balancers software often sucks in annoying (if not always
> important) ways.

    You can use lighttpd as a load balancer, I believe it has a stick
sessions plugin (or you could code one in, it's open source after all). It
definitely support simple round-robin load balancing, acting as a proxy to
any number of independent servers.


>> matter, it's pretty impressive. The google filesystem has nothing to do
>> with databases though, it's more a massive data store / streaming
>> storage.
>
> Since when did Massive Data stores have nothing to do with DBs? Isn't
> Oracle Cluster entirely based on forming an enormous scalable disk array
> to store your DB on?

    Um, well, the Google Filesystem is (like its name implies) a filesystem
designed to store huge files in a distributed and redundant manner. Files
are structured as a stream of records (which are themselves big in size)
and it's designed to support appending records to these stream files
efficiently and without worrying about locking.

    It has no querying features however, that is why I said it was not a
database.

    I wish I could find the whitepaper, I think the URL was on this list some
day, maybe it's on Google's site ?

От:
Alex Turner
Дата:

Having local sessions is unnesesary, and here is my logic:

Generaly most people have less than 100Mb of bandwidth to the internet.

If you make the assertion that you are transferring equal or less
session data between your session server (lets say an RDBMS) and the
app server than you are between the app server and the client, an out
of band 100Mb network for session information is plenty of bandwidth.
This also represents OLTP style traffic, which postgresql is pretty
good at.  You should easily be able to get over 100Tps.  100 hits per
second is an awful lot of traffic, more than any website I've managed
will ever see.

Why solve the complicated clustered sessions problem, when you don't
really need to?

Alex Turner
netEconomist

On 5/11/05, PFC <> wrote:
>
>
> > However, memcached (and for us, pg_memcached) is an excellent way to
> > improve
> > horizontal scalability by taking disposable data (like session
> > information)
> > out of the database and putting it in protected RAM.
>
>         So, what is the advantage of such a system versus, say, a "sticky
> sessions" system where each session is assigned to ONE application server
> (not PHP then) which keeps it in RAM as native objects instead of
> serializing and deserializing it on each request ?
>         I'd say the sticky sessions should perform a lot better, and if one
> machine dies, only the sessions on this one are lost.
>         But of course you can't do it with PHP as you need an app server which
> can manage sessions. Potentially the savings are huge, though.
>
>         On Google, their distributed system spans a huge number of PCs and it has
> redundancy, ie. individual PC failure is a normal thing and is a part of
> the system, it is handled gracefully. I read a paper on this matter, it's
> pretty impressive. The google filesystem has nothing to do with databases
> though, it's more a massive data store / streaming storage.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to 
>

От:
Alex Stapleton
Дата:

On 12 May 2005, at 15:08, Alex Turner wrote:

> Having local sessions is unnesesary, and here is my logic:
>
> Generaly most people have less than 100Mb of bandwidth to the
> internet.
>
> If you make the assertion that you are transferring equal or less
> session data between your session server (lets say an RDBMS) and the
> app server than you are between the app server and the client, an out
> of band 100Mb network for session information is plenty of bandwidth.
> This also represents OLTP style traffic, which postgresql is pretty
> good at.  You should easily be able to get over 100Tps.  100 hits per
> second is an awful lot of traffic, more than any website I've managed
> will ever see.
>
> Why solve the complicated clustered sessions problem, when you don't
> really need to?

100 hits a second = 8,640,000 hits a day. I work on a site which does
 > 100 million dynamic pages a day. In comparison Yahoo probably does
 > 100,000,000,000 (100 billion) views a day
  if I am interpreting Alexa's charts correctly. Which is about
1,150,000 a second.

Now considering the site I work on is not even in the top 1000 on
Alexa, theres a lot of sites out there which need to solve this
problem I would assume.

There are also only so many hash table lookups a single machine can
do, even if its a Quad Opteron behemoth.


> Alex Turner
> netEconomist
>
> On 5/11/05, PFC <> wrote:
>
>>
>>
>>
>>> However, memcached (and for us, pg_memcached) is an excellent way to
>>> improve
>>> horizontal scalability by taking disposable data (like session
>>> information)
>>> out of the database and putting it in protected RAM.
>>>
>>
>>         So, what is the advantage of such a system versus, say, a
>> "sticky
>> sessions" system where each session is assigned to ONE application
>> server
>> (not PHP then) which keeps it in RAM as native objects instead of
>> serializing and deserializing it on each request ?
>>         I'd say the sticky sessions should perform a lot better,
>> and if one
>> machine dies, only the sessions on this one are lost.
>>         But of course you can't do it with PHP as you need an app
>> server which
>> can manage sessions. Potentially the savings are huge, though.
>>
>>         On Google, their distributed system spans a huge number of
>> PCs and it has
>> redundancy, ie. individual PC failure is a normal thing and is a
>> part of
>> the system, it is handled gracefully. I read a paper on this
>> matter, it's
>> pretty impressive. The google filesystem has nothing to do with
>> databases
>> though, it's more a massive data store / streaming storage.
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 1: subscribe and unsubscribe commands go to
>> 
>>
>>
>
>


От:
Alex Turner
Дата:

Ok - my common sense alarm is going off here...

There are only 6.446 billion people worldwide.  100 Billion page views
would require every person in the world to view 18 pages of yahoo
every day.  Not very likely.

http://www.internetworldstats.com/stats.htm
suggests that there are around 1 billion people actualy on the internet.

That means each and every person on the internet has to view 100 pages
per day of yahoo.

pretty unlikely IMHO.  I for one don't even use Yahoo ;)

100 million page views per day suggests that 1 in 100 people on the
internet each viewed 10 pages of a site.  Thats a pretty high
percentage if you ask me.

If I visit 20 web sites in a day, and see an average of 10 pages per
site. that means only about 2000 or so sites generate 100 million page
views in a day or better.

100 million pageviews averages to 1157/sec, which we'll double for
peak load to 2314.

I can easily see a system doing 2314 hash lookups per second.  Hell I
wrote a system that could do a thousand times that four years ago on a
single 1Ghz Athlon.  Heck - you can get 2314 lookups/sec on a 486 ;)

Given that session information doesn't _have_ to persist to storage,
and can be kept in RAM.  A single server could readily manage session
information for even very large sites (of course over a million
concurrent users could really start chewing into RAM, but if you are
Yahoo, you can probably afford a box with 100GB of RAM ;).

We get over 1000 tps on a dual opteron with a couple of mid size RAID
arrays on 10k discs with fsync on for small transactions.  I'm sure
that could easily be bettered with a few more dollars.

Maybe my number are off, but somehow it doesn't seem like that many
people need a highly complex session solution to me.

Alex Turner
netEconomist

On 5/12/05, Alex Stapleton <> wrote:
>
> On 12 May 2005, at 15:08, Alex Turner wrote:
>
> > Having local sessions is unnesesary, and here is my logic:
> >
> > Generaly most people have less than 100Mb of bandwidth to the
> > internet.
> >
> > If you make the assertion that you are transferring equal or less
> > session data between your session server (lets say an RDBMS) and the
> > app server than you are between the app server and the client, an out
> > of band 100Mb network for session information is plenty of bandwidth.
> > This also represents OLTP style traffic, which postgresql is pretty
> > good at.  You should easily be able to get over 100Tps.  100 hits per
> > second is an awful lot of traffic, more than any website I've managed
> > will ever see.
> >
> > Why solve the complicated clustered sessions problem, when you don't
> > really need to?
>
> 100 hits a second = 8,640,000 hits a day. I work on a site which does
>  > 100 million dynamic pages a day. In comparison Yahoo probably does
>  > 100,000,000,000 (100 billion) views a day
>   if I am interpreting Alexa's charts correctly. Which is about
> 1,150,000 a second.
>
> Now considering the site I work on is not even in the top 1000 on
> Alexa, theres a lot of sites out there which need to solve this
> problem I would assume.
>
> There are also only so many hash table lookups a single machine can
> do, even if its a Quad Opteron behemoth.
>
>
> > Alex Turner
> > netEconomist
> >
> > On 5/11/05, PFC <> wrote:
> >
> >>
> >>
> >>
> >>> However, memcached (and for us, pg_memcached) is an excellent way to
> >>> improve
> >>> horizontal scalability by taking disposable data (like session
> >>> information)
> >>> out of the database and putting it in protected RAM.
> >>>
> >>
> >>         So, what is the advantage of such a system versus, say, a
> >> "sticky
> >> sessions" system where each session is assigned to ONE application
> >> server
> >> (not PHP then) which keeps it in RAM as native objects instead of
> >> serializing and deserializing it on each request ?
> >>         I'd say the sticky sessions should perform a lot better,
> >> and if one
> >> machine dies, only the sessions on this one are lost.
> >>         But of course you can't do it with PHP as you need an app
> >> server which
> >> can manage sessions. Potentially the savings are huge, though.
> >>
> >>         On Google, their distributed system spans a huge number of
> >> PCs and it has
> >> redundancy, ie. individual PC failure is a normal thing and is a
> >> part of
> >> the system, it is handled gracefully. I read a paper on this
> >> matter, it's
> >> pretty impressive. The google filesystem has nothing to do with
> >> databases
> >> though, it's more a massive data store / streaming storage.
> >>
> >> ---------------------------(end of
> >> broadcast)---------------------------
> >> TIP 1: subscribe and unsubscribe commands go to
> >> 
> >>
> >>
> >
> >
>
>

От:
John A Meinel
Дата:

Alex Turner wrote:
> Ok - my common sense alarm is going off here...
>
> There are only 6.446 billion people worldwide.  100 Billion page views
> would require every person in the world to view 18 pages of yahoo
> every day.  Not very likely.
>
> http://www.internetworldstats.com/stats.htm
> suggests that there are around 1 billion people actualy on the internet.
>
> That means each and every person on the internet has to view 100 pages
> per day of yahoo.
>
> pretty unlikely IMHO.  I for one don't even use Yahoo ;)
>
> 100 million page views per day suggests that 1 in 100 people on the
> internet each viewed 10 pages of a site.  Thats a pretty high
> percentage if you ask me.

In general I think your point is valid. Just remember that it probably
also matters how you count page views. Because technically images are a
separate page (and this thread did discuss serving up images). So if
there are 20 graphics on a specific page, that is 20 server hits just
for that one page.

I could easily see an image heavy site getting 100 hits / page. Which
starts meaning that if 1M users hit 10 pages, then you get 1M*10*100 = 1G.

I still think 100G views on a single website is a lot, but 100M is
certainly possible.

John
=:->

От:
PFC
Дата:

> 100 hits a second = 8,640,000 hits a day. I work on a site which does  >
> 100 million dynamic pages a day. In comparison Yahoo probably does  >
> 100,000,000,000 (100 billion) views a day
>   if I am interpreting Alexa's charts correctly. Which is about
> 1,150,000 a second.


Read the help on Alexa's site... ;)

От:
Josh Berkus
Дата:

People,

> In general I think your point is valid. Just remember that it probably
> also matters how you count page views. Because technically images are a
> separate page (and this thread did discuss serving up images). So if
> there are 20 graphics on a specific page, that is 20 server hits just
> for that one page.

Also, there's bots and screen-scrapers and RSS, web e-mails, and web services
and many other things which create hits but are not "people".  I'm currently
working on clickstream for a site which is nowhere in the top 100, and is
getting 3 million real hits a day ... and we know for a fact that at least
1/4 of that is bots.

Regardless, the strategy you should be employing for a high traffic site is
that if your users hit the database for anything other than direct
interaction (like filling out a webform) then you're lost.    Use memcached,
squid, lighttpd caching, ASP.NET caching, pools, etc.   Keep the load off the
database except for the stuff that only the database can do.

--
Josh Berkus
Aglio Database Solutions
San Francisco

От:
Alex Stapleton
Дата:

On 12 May 2005, at 18:33, Josh Berkus wrote:

> People,
>
>
>> In general I think your point is valid. Just remember that it
>> probably
>> also matters how you count page views. Because technically images
>> are a
>> separate page (and this thread did discuss serving up images). So if
>> there are 20 graphics on a specific page, that is 20 server hits just
>> for that one page.
>>
>
> Also, there's bots and screen-scrapers and RSS, web e-mails, and
> web services
> and many other things which create hits but are not "people".  I'm
> currently
> working on clickstream for a site which is nowhere in the top 100,
> and is
> getting 3 million real hits a day ... and we know for a fact that
> at least
> 1/4 of that is bots.

I doubt bots are generally Alexa toolbar enabled.

> Regardless, the strategy you should be employing for a high traffic
> site is
> that if your users hit the database for anything other than direct
> interaction (like filling out a webform) then you're lost.    Use
> memcached,
> squid, lighttpd caching, ASP.NET caching, pools, etc.   Keep the
> load off the
> database except for the stuff that only the database can do.

This is the aproach I would take as well. There is no point storing
stuff in a DB, if your only doing direct lookups on it and it isn't
the sort of data that you care so much about the integrity of.


> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
>


От:
Josh Berkus
Дата:

Ross,

> Memcached is a PG memory store, I gather,

Nope.  It's a hyperfast resident-in-memory hash that allows you to stash stuff
like user session information and even materialized query set results.
Thanks to SeanC, we even have a plugin, pgmemcached.

> but...what is squid, lighttpd?
> anything directly PG-related?

No.   These are all related to making the web server do more.   The idea is
NOT to hit the database every time you have to serve up a web page, and
possibly not to hit the web server either.  For example, you can use squid 3
for "reverse" caching in front of your web server, and serve far more page
views than you could with Apache alone.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

От:
PFC
Дата:


> If you make the assertion that you are transferring equal or less
> session data between your session server (lets say an RDBMS) and the
> app server than you are between the app server and the client, an out
> of band 100Mb network for session information is plenty of bandwidth.

    So if you count on a mean page size of 6-8 kbytes gzipped, that will
prevent you from caching the N first results of the Big Slow Search Query
in a native object in the user session state (say, a list of integers
indicating which rows match), so you will have to redo the Big Slow Search
Query everytime the user clicks on Next Page instead of grabbing a set of
cached row id's and doing a fast SELECT WHERE id IN ...
    This is the worst case ... I'd gzip() the row id's and stuff them in the
session, that's always better than blowing up the database with the Big
Slow Search Query everytime someone does Next Page...

> This also represents OLTP style traffic, which postgresql is pretty
> good at.  You should easily be able to get over 100Tps.  100 hits per
> second is an awful lot of traffic, more than any website I've managed
> will ever see.

    On the latest anandtech benchmarks, 100 hits per second on a blog/forum
software is a big bi-opteron server running dotNET, at 99% load... it's a
lot if you count only dynamic page hits.