Обсуждение: Horizontal Write Scaling

Поиск
Список
Период
Сортировка

Horizontal Write Scaling

От
Eliot Gable
Дата:
I know there has been a lot of talk about replication getting built into Postgres and I know of many projects that aim to fill the role. However, I have not seen much in the way of a serious attempt at multi-master write scaling. I understand the fundamental problem with write scaling across multiple nodes is Disk I/O and inter-node communication latency and that in the conventional synchronous, multi-master replication type setup you would be limited to the speed of the slowest node, plus the communication protocol overhead and latency. However, it occurs to me that if you had a shared disk system via either iSCSI, Fiber Channel, NFS, or whatever (which also had higher I/O capabilities than a single server could utilize), if you used a file system that supported locks on a particular section (extent) of a file, it should theoretically be possible for multiple Postgres instances on multiple systems sharing the database to read and write to the database without causing corruption. Obviously, it would take some carefully designed code in terms of file extent locking, but it seems like it would also eliminate the need entirely for any type of replication system. Lustre seems to support the type of file locking required. Of course, I am assuming the disk system would be RAID 1, RAID 10, RAID 5, or RAID 6 for reliability purposes and that it is sufficiently redundant that you don't have to worry about an outage of your storage system.

Has anyone put any thought into what it would take to do this in Postgres? Is it simply a matter of making the database file interaction code aware of extent locking, or is it considerably more involved than that? It also occurs to me that you probably need some form of transaction ordering mechanism across the nodes based on synchronized timestamps, but it seems Postgres-R has the required code to do that portion already written. That may not even be needed since all nodes would actually be working on the same database files and the locks would ensure a strict ordering of queries. The only place I see that possibly causing a problem is someone load balancing across the servers and sending a delete and insert really close to each other to different nodes such that if the insert executes first, it would be deleted by the delete even though the intent was to have the delete run first. Timestamp ordering of the queries or just some shared transaction ID across the nodes would eliminate that possibility. Other than that, is there anything else I am missing? Wouldn't this type of setup be far simpler to implement and provide better scalability than trying to do multi-master replication using log shipping or binary object shipping or any other techniques? Wouldn't it also be far more efficient since you don't need to have a copy of your data on each master node and therefor also don't have to ship your data to each node and have each node process it?

I am mostly asking for educational purposes, and I would appreciate technical (and hopefully specific) explanations as to what in Postgres would need to change to support this. 

Re: Horizontal Write Scaling

От
Eliot Gable
Дата:
On Tue, Nov 23, 2010 at 3:43 PM, Eliot Gable <egable+pgsql-hackers@gmail.com> wrote:
<snip> 
Other than that, is there anything else I am missing? Wouldn't this type of setup be far simpler to implement and provide better scalability than trying to do multi-master replication using log shipping or binary object shipping or any other techniques? Wouldn't it also be far more efficient since you don't need to have a copy of your data on each master node and therefor also don't have to ship your data to each node and have each node process it?

I am mostly asking for educational purposes, and I would appreciate technical (and hopefully specific) explanations as to what in Postgres would need to change to support this. 


Now that I think about this more, it seems you would still need to ship the transactions to your other nodes and have some form of processing system on each that knew which node was supposed to be executing each transaction and whether that node is currently online. It would also have to have designated backup nodes to execute the transaction on. Otherwise, you could end up waiting forever for a transaction to finish that was sent to one node right before that node lost power. However, if a transaction manager on each node is able to figured out the ordering of the transactions for itself based on some globally incrementing transaction ID and able to figure out which node will be executing the transaction and which node is the backup if the first one fails, etc., then if the backup sees the primary for that transaction go offline, it could execute the transaction instead. 

Then, I suppose you also need some system in Postgres which can allow concurrent processing of transactions such that they don't process stuff in a transaction which is dependent on a transaction that has not yet been committed, but can process other stuff. So, evaluation of deterministic functions could take place, but anything volatile could not until all previous transactions finished. I assume Postgres already has something like this in order to scale across multiple cores in a single box. This setup would basically make all the master nodes for the database look like just extra memory and CPU cores. 

Re: Horizontal Write Scaling

От
"Kevin Grittner"
Дата:
Eliot Gable <egable+pgsql-hackers@gmail.com> wrote:
> the locks would ensure a strict ordering of queries.
PostgreSQL doesn't support S2PL.  I'm not sure what locks you mean.
-Kevin


Re: Horizontal Write Scaling

От
Mark Kirkwood
Дата:
On 24/11/10 09:43, Eliot Gable wrote: <blockquote
cite="mid:AANLkTinxTiuuDy8Up-pZVh=1YDa89ga_cDarGZ3usg3L@mail.gmail.com"type="cite"> However, it occurs to me that if
youhad a shared disk system via either iSCSI, Fiber Channel, NFS, or whatever (which also had higher I/O capabilities
thana single server could utilize)</blockquote><br /> Yeah, current Postgres multi-master projects seem to be focusing
onshared-nothing architecture as opposed to a shared-disk one. I guess the advantage of the former is that specialized
(i.eexpensive) hardware is not required to attempt to overcome the point of failure with shared-disk systems - the disk
theyshare. <br /><br /> Cheers<br /><br /> Mark<br /> 

Re: Horizontal Write Scaling

От
Greg Smith
Дата:
Eliot Gable wrote:
> However, I have not seen much in the way of a serious attempt at 
> multi-master write scaling. 

Scaling writes across nodes using PL/Proxy works.

> Of course, I am assuming the disk system would be RAID 1, RAID 10, 
> RAID 5, or RAID 6 for reliability purposes and that it is sufficiently 
> redundant that you don't have to worry about an outage of your storage 
> system.

The idea that you'll have a system that needs better write scalability 
that isn't limited by the storage system is an unusual one, not the 
expected case.  And the trend everywhere in the industry is away from 
giant redundant systems, and toward having multiple cheaper redundant 
copies of all the data instead.  It's impossible to protect against 
things like environmental failure at any single location.  Once you've 
accepted that you have to be able to replicate this beast too if you 
want high availability, you're back at having a multi-node problem 
again.  This is why the most active work is on distributed designs that 
start on that basis, rather than projects trying to build more scalable 
monoliths. 

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Horizontal Write Scaling

От
Markus Wanner
Дата:
Eliot,

On 11/23/2010 09:43 PM, Eliot Gable wrote:
> I know there has been a lot of talk about replication getting built into
> Postgres and I know of many projects that aim to fill the role. However,
> I have not seen much in the way of a serious attempt at multi-master
> write scaling.

Postgres-XC and Postgres-R are two pretty serious projects, IMO.

> I understand the fundamental problem with write scaling
> across multiple nodes is Disk I/O and inter-node communication latency
> and that in the conventional synchronous, multi-master replication type
> setup you would be limited to the speed of the slowest node,

That's not necessarily true for Postgres-R, which is why I call it an
'eager' solution (as opposed to fully synchronous). While it guarantees
that all transactions that got committed *will* be committable on all
nodes at some time in the future, nodes may still lag behind others.

Thus, even a slower / busy node doesn't hold back the others, but may
serve stale data. Ideally, your load balancer accounts for that and
gives that node a break or at least reduces the amount of transactions
going to that node, so it can catch up again.

Anyway, that's pretty Postgres-R specific.

> plus the
> communication protocol overhead and latency. However, it occurs to me
> that if you had a shared disk system via either iSCSI, Fiber Channel,
> NFS, or whatever (which also had higher I/O capabilities than a single
> server could utilize), if you used a file system that supported locks on
> a particular section (extent) of a file, it should theoretically be
> possible for multiple Postgres instances on multiple systems sharing the
> database to read and write to the database without causing corruption.

Possible, yes. Worthwile to do, probably not.

> Has anyone put any thought into what it would take to do this in
> Postgres? Is it simply a matter of making the database file interaction
> code aware of extent locking, or is it considerably more involved than
> that? It also occurs to me that you probably need some form of
> transaction ordering mechanism across the nodes based on synchronized
> timestamps, but it seems Postgres-R has the required code to do that
> portion already written.

If you rely on such an ordering, why use additional locks. That seems
like a waste of resources compared to Postgres-R. Not to mention the
introduction of a SPOF with the SAN.

> Wouldn't this type of setup be far
> simpler to implement

That's certainly debatable, yes. I obviously think that the benefit per
cost ratio for Postgres-R is better :-)

> and provide better scalability than trying to do
> multi-master replication using log shipping or binary object shipping or
> any other techniques?

It's more similar to replication using two phase commit, which provably
doesn't scale (see for example [1]) And using a SAN for locking
certainly doesn't beat 2PC via an equally modern/expensive interconnect.

> Wouldn't it also be far more efficient since you
> don't need to have a copy of your data on each master node and therefor
> also don't have to ship your data to each node and have each node
> process it?

You have to ship it from the SAN to the node, so I definitely don't
think so, but see this as an argument against it. Each having a local
copy and only exchange locking information and transactional changes
sounds like much less traffic overall.

Regards

Markus Wanner


[1]: The Dangers of Replication and a Solution, Gray et al, In Proc. of
the SIGMOD Conf., 1996,
http://research.microsoft.com/apps/pubs/default.aspx?id=68247


Re: Horizontal Write Scaling

От
Koichi Suzuki
Дата:
Hi,

2010/11/25 Markus Wanner <markus@bluegap.ch>:
> Eliot,
>
> On 11/23/2010 09:43 PM, Eliot Gable wrote:
>> I know there has been a lot of talk about replication getting built into
>> Postgres and I know of many projects that aim to fill the role. However,
>> I have not seen much in the way of a serious attempt at multi-master
>> write scaling.
>
> Postgres-XC and Postgres-R are two pretty serious projects, IMO.

Yes.  Please visit http://postgres-xc.sourceforge.net/ for details.

>> I understand the fundamental problem with write scaling
>> across multiple nodes is Disk I/O and inter-node communication latency
>> and that in the conventional synchronous, multi-master replication type
>> setup you would be limited to the speed of the slowest node,
>
> That's not necessarily true for Postgres-R, which is why I call it an
> 'eager' solution (as opposed to fully synchronous). While it guarantees
> that all transactions that got committed *will* be committable on all
> nodes at some time in the future, nodes may still lag behind others.
>
> Thus, even a slower / busy node doesn't hold back the others, but may
> serve stale data. Ideally, your load balancer accounts for that and
> gives that node a break or at least reduces the amount of transactions
> going to that node, so it can catch up again.
>
> Anyway, that's pretty Postgres-R specific.

Right.   In the case of Postgres-XC, tables can be partitioned (we
call "distributed") among cluster nodes so that writing can be done in
parallel.

>
>> plus the
>> communication protocol overhead and latency. However, it occurs to me
>> that if you had a shared disk system via either iSCSI, Fiber Channel,
>> NFS, or whatever (which also had higher I/O capabilities than a single
>> server could utilize), if you used a file system that supported locks on
>> a particular section (extent) of a file, it should theoretically be
>> possible for multiple Postgres instances on multiple systems sharing the
>> database to read and write to the database without causing corruption.
>
> Possible, yes. Worthwile to do, probably not.

We may be suffered from synchronizing cache on each database.

>
>> Has anyone put any thought into what it would take to do this in
>> Postgres? Is it simply a matter of making the database file interaction
>> code aware of extent locking, or is it considerably more involved than
>> that? It also occurs to me that you probably need some form of
>> transaction ordering mechanism across the nodes based on synchronized
>> timestamps, but it seems Postgres-R has the required code to do that
>> portion already written.
>
> If you rely on such an ordering, why use additional locks. That seems
> like a waste of resources compared to Postgres-R. Not to mention the
> introduction of a SPOF with the SAN.
>
>> Wouldn't this type of setup be far
>> simpler to implement
>
> That's certainly debatable, yes. I obviously think that the benefit per
> cost ratio for Postgres-R is better :-)
>
>> and provide better scalability than trying to do
>> multi-master replication using log shipping or binary object shipping or
>> any other techniques?

Postgres-XC uses combination of replicated table and distributed
(partitioned) table, not just simple replication.

>
> It's more similar to replication using two phase commit, which provably
> doesn't scale (see for example [1]) And using a SAN for locking
> certainly doesn't beat 2PC via an equally modern/expensive interconnect.
>
>> Wouldn't it also be far more efficient since you
>> don't need to have a copy of your data on each master node and therefor
>> also don't have to ship your data to each node and have each node
>> process it?
>
> You have to ship it from the SAN to the node, so I definitely don't
> think so, but see this as an argument against it. Each having a local
> copy and only exchange locking information and transactional changes
> sounds like much less traffic overall.
>
> Regards
>
> Markus Wanner
>
>
> [1]: The Dangers of Replication and a Solution, Gray et al, In Proc. of
> the SIGMOD Conf., 1996,
> http://research.microsoft.com/apps/pubs/default.aspx?id=68247
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Cheers;
---
Koichi Suzuki


Re: Horizontal Write Scaling

От
Robert Haas
Дата:
On Thu, Nov 25, 2010 at 4:45 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote:
>>> plus the
>>> communication protocol overhead and latency. However, it occurs to me
>>> that if you had a shared disk system via either iSCSI, Fiber Channel,
>>> NFS, or whatever (which also had higher I/O capabilities than a single
>>> server could utilize), if you used a file system that supported locks on
>>> a particular section (extent) of a file, it should theoretically be
>>> possible for multiple Postgres instances on multiple systems sharing the
>>> database to read and write to the database without causing corruption.
>>
>> Possible, yes. Worthwile to do, probably not.
>
> We may be suffered from synchronizing cache on each database.

That's putting it mildly.  You have to worry about the database buffer
cache, the shared invalidation queue, the control file data, the
ProcArray, the lock manager, the LWLock tables, and probably some
other things I'm forgetting about.  Everything in shared memory, in
short.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Horizontal Write Scaling

От
Eliot Gable
Дата:
Thanks, everyone, for all the feedback! I am nowhere near a database expert yet, but you guys have been very helpful in clearing up some of my confusion. I have checked out Postgres-XC and it looks like the version 1.0 that will be released soon probably covers everything I have been looking for in terms of Postgres capabilities. The big ones are write scaling, read scaling, consistent view of data between each server, and HA capabilities. Last time I looked at Postgres-XC was probably a year ago, and it was nowhere close to what I was looking for at the time, and I forgot all about it. Now, it looks like a real contender. 

I was aware of Postgres-R and was actually thinking I might be able to get away with using that, but the project I am working on does a substantial amount of writing, as well as being CPU intensive. Each query executes a stored procedure which is about 2,500 lines long and pulls data from about 80 tables to compute a final result set. That final result set is returned to the requester, and is also written into 3 tables (while still inside the original transaction). One of those tables gets one row while the other two get 6 - 15 rows per query. I execute hundreds of these queries per second. So, I need to be able to spread the load across multiple boxes due to CPU usage, but still have a consistent view of the written data. Using conventional drives, I would saturate the disk I/O pretty quickly on commodity hardware. With normal multi-master replication, the cost of making sure I have enough disk I/O on each server is way more than I have the budget for. With a write scaling solution, it suddenly looks affordable. I was looking at maybe getting a single shared RAID array with some enterprise-class SSDs that could guarantee writes even during a power failure. I was hoping I could find something that would let multiple Postgres instances share that disk array as it would be more cost effective to get both the CPU power and Disk I/O I needed than sticking such a RAID array in each and every server I was going to spread load across. Postgres-XC actually makes it look even more affordable, as I now probably no longer need to consider SSDs, or at least I don't need to consider a RAID 10 array of 4 or more SSDs per box. I can probably do RAID 1 with 2 drives per box and have plenty of Disk I/O available for the amount of CPU power I would have in the boxes.

So, thanks again for the feedback.