Обсуждение: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

Поиск
Список
Период
Сортировка

proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Mark Dilger
Дата:
This is not entirely "pie in the sky", but feel free to tell me why this is crazy.  I
have had this idea for several years, but have not seen anyone else suggest it,
nor any arguments why it would not work.

If we had 64-bit Oids, we could reserve the top 16 bits (for instance) to indicate a
server ID.  Each server in a cluster could be configured with a unique serverID in
[0..65535].  Every table, index, etc that is created on a server could be assigned
an Oid congruent to this serverID modulus 65536.  Each server would still have a
total of 2^48 Oids that it could assign before reaching Oid exhaustion, which is
64k times more Oids than the current design.  (The 16bit / 48bit split is arbitrary,
and could be 8bit / 56bit or whatever.)

Any INSERT, UPDATE, DELETE, INDEX, REINDEX, DROP that was run on a server
could be rejected if the object has the serverId of a different server in the cluster.
The serverId could be kept at all times in memory, and Oids for database objects
are always looked up when handling these SQL operations anyway, so there
should be effectively zero overhead for rejecting invalid operations that attempt
to change on server X a table/index/whatever which belongs to server Y.

I don't see how to run such a setup in serializable mode, but with read committed
it should be ok for server X to change its own tables based on what it perceives
to be the current state of data on server X and Y (reading local copies of tables
from Y) at the same time that server Y changes its own tables based on what it
perceives to be the current state of data on server Y and X (reading local copies
of tables from X).  The inconsistencies possible from these actions seem to me
the same as the inconsistencies possible when the two operations are happening
on the same server simultaneously in read committed mode.

WAL records from each server should only ever change objects belonging to that
server, so with some effort it might be possible for every server in the cluster to
replay the WAL records from every other server without collisions.  (This is the point
where my idea might be crazy -- I'm not sure how difficult/impossible it would
be to merge WAL information from multiple sources).

The total amount of WAL data that needs to be replayed on any given server would
be proportion to the total amount of data changes cluster-wide, which is no different
than the current state of affairs; it just happens to all come from a single master in
the current design.

The performance problems that at first seem inevitable in this design owing to the
need for each server to wait for wal replay from other servers to avoid data inconsistencies
would only actually happen if updates were frequently based on reads from other
servers.  If the vast majority of data changes were not based on the contents of
remotely owned tables/indexes/sequences and such, then changes to local data
could proceed without stalling.

Therefore, this approach would probably be most appropriate for highly partitioned
data, with each partition being owned by one server in the cluster, and modifications
based on data from more than one partition being rare.  (SELECTs based on data
from multiple partitions should be just fine.)

If you think about it, that's just an extension of the current architecture.  Currently,
all data is partitioned into a single partition belonging to the master, with zero partitions
on the slaves, and all updates are performed based on a single partition, that being
the one on the master.  It just isn't obvious that this is the design, because it is a
degenerate case of the more general case that I am describing.

Application software that wants to run INSERTS, UPDATE, etc. would have to connect
to the appropriate server for the table in question, but that might not be too bad
if the data is partitioned anyway -- just connect to the server with the partition you
want to change.

Thanks in advance for having even read this far....

Mark Dilger

Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Alvaro Herrera
Дата:
Mark Dilger wrote:
> This is not entirely "pie in the sky", but feel free to tell me why this is crazy.

Have you seen http://wiki.postgresql.org/wiki/BDR ?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Mark Dilger
Дата:
The BDR documentation http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf
says,

    "Physical replication forces us to use just one
     node: multi-master required for write scalability"

    "Physical replication provides best read scalability"

I am inclined to agree with the second statement, but
I think my proposal invalidates the first statement, at
least for a particular rigorous partitioning over which
server owns which data.

In my own workflow, I load lots of data from different
sources.  The partition the data loads into depends on
which source it came from, and it is never mixed or
cross referenced in any operation that writes the data.
It is only "mixed" in the sense that applications query
data from multiple sources.

So for me, multi-master with physical replication seems
possible, and would presumably provide the best
read scalability.  I doubt that I am in the only database
user who has this kind of workflow.

The alternatives are ugly.  I can load data from separate
sources into separate database servers *without* replication
between them, but then the application layer has to
emulate queries across the data.  (Yuck.)  Or I can use
logical replication such as BDR, but then the servers
are spending more effort than with physical replication,
so I get less bang for the buck when I purchase more
servers to add to the cluster.  Or I can use FDW to access
data from other servers, but that means the same data
may be pulled across the wire arbitrarily many times, with
corresponding impact on the bandwidth.

Am I missing something here?  Does BDR really provide
an equivalent solution?

Second, it seems that BDR leaves to the client the responsibility
for making schemas the same everywhere.  Perhaps this is just
a limitation of the implementation so far, which will be resolved
in the future?


On Tuesday, December 31, 2013 12:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Mark Dilger wrote:

> This is not entirely "pie in the sky", but feel free to tell me why this is crazy.


Have you seen http://wiki.postgresql.org/wiki/BDR ?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Andres Freund
Дата:
On 2013-12-31 13:51:08 -0800, Mark Dilger wrote:
> The BDR documentation http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf
> says,
>
>     "Physical replication forces us to use just one
>      node: multi-master required for write scalability"
>
>     "Physical replication provides best read scalability"
>
> I am inclined to agree with the second statement, but
> I think my proposal invalidates the first statement, at
> least for a particular rigorous partitioning over which
> server owns which data.

I think you *massively* underestimate the amount of work implementing
this would require.
For one, you'd need to have a catalog that is written to on only one
server, you cannot have several nodes writing to the same table, even if
it's to disparate oid ranges. So you'd need to partition the whole
catalog by oid ranges - which would be a major efficiency loss for many,
many cases.

Not to speak of breaking pg_upgrade and noticeably increasing the size
of the catalog due to bigger oids and additional relations.

> So for me, multi-master with physical replication seems
> possible, and would presumably provide the best
> read scalability.

What you describe isn't really multi master though, as every row can
only be written to by a single node (the owner).

Also, why would this have a better read scalability? Whether a row is
written by streaming rep or not doesn't influence read speed.

> Or I can use logical replication such as BDR, but then the servers
> are spending more effort than with physical replication,
> so I get less bang for the buck when I purchase more
> servers to add to the cluster.

The efficiency difference really hasn't to be big if done right. If
you're so write-heavy that the difference is becoming a problem you
wouldn't implement a shared-everything architecture anyway.

> Am I missing something here?  Does BDR really provide
> an equivalent solution?

Not yet, but the plan is to get there.
> Second, it seems that BDR leaves to the client the responsibility
> for making schemas the same everywhere.  Perhaps this is just
> a limitation of the implementation so far, which will be resolved
> in the future?

Hopefully something that's going to get lifted.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Mark Dilger
Дата:
My original email was mostly a question about whether WAL data
could be merged from multiple servers, or whether I was overlooking
some unsolvable difficulty.  I'm still mostly curious about that
question.

I anticipated that my proposal would require partitioning the catalogs.
For instance, autovacuum could only run on locally owned tables, and
would need to store the analyze stats data in a catalog partition belonging
to the local server, but that doesn't seem like a fundamental barrier to
it working.  The partitioned catalog tables would get replicated like
everything else.  The code that needs to open catalogs and look things
up could open the specific catalog partition needed if it already knew the
Oid of the table/index/whatever that it was interested in, as the catalog
partition desired would have the same modulus as the Oid of the object
being researched. 

Your point about increasing the runtime of pg_upgrade is taken.  I will
need to think about that some more.

Your claim that what I describe is not multi-master is at least partially
correct, depending on how you think about the word "master".  Certainly
every server is the master of its own chunk.  I see that as a downside
for some people, who want to be able to insert/update/delete any data
on any server.  But the ability to modify *anything anywhere* brings
performance problems with it.  Either the servers have to wait for each
other before commits go through, in order to avoid incompatible data
changes being committed on both ends, or the servers have to reject
commits after they have already been reported to the client as successful.

I expect my proposal to have better read scalability in a write-heavy
environment, because the less work it takes to integrate data changes
from other workers, the more resources remain per server to answer
read queries.

Your claim that BDR doesn't have to be much slower than what I am
proposing is quite interesting, as if that is true I can ditch this idea and
use BDR instead.  It is hard to empirically test, though, as I don't have
the alternate implementation on hand.

I think the expectation that performance will be harmed if postgres
uses 8 byte Oids is not quite correct.

Several years ago I ported postgresql sources to use 64bit everything.
Oids, varlena headers, variables tracking offsets, etc.  It was a fair
amount of work, but all the doom and gloom predictions that I have
heard over the years about how 8-byte varlena headers would kill
performance, 8-byte Oids would kill performance, etc, turned out to
be quite inaccurate.  The performance impact was ok for me.  The
disk space impact wasn't much either, as with 8-byte varlena headers,
anything under 127 bytes had a 1-byte header, and anything under
16383 had a 2-byte header, with 8-bytes only used after that, which
pretty much meant that disk usage for representing varlena data
shrunk slightly rather than growing.  Tom Lane had mentioned in a
thread that he didn't want to make the #define for processing
varlena headers any more complicated than it was, because it gets
executed quite a lot.  So I tried the 1,2,8 byte vs 1,8 byte varlena
design both ways and found it made little difference to me which I
chose.  Of course, my analysis was based on my own usage patterns,
my own schemas, my own data, and might not apply to everyone
else.  I tend to conflate the 8-byte Oid change with all these other
changes from 4-byte to 8-byte, because that's what I did and what
I have experience with.

Having 8-byte everything with everything aligned allowed me to use
SSE functions on some stuff that postgres was (at least at the time)
doing less efficiently.  Since then, I have noticed that the hash function
for disk blocks is implemented with SSE in mind.  With 8-byte aligned
datums, SSE based hashing can be used without all the calls to
realign the data.  I was experimenting with forcing data to be 16-byte
aligned to take advantage of newer SSE functions, but this was years
ago and I didn't own any hardware with the newer SSE capabilities,
so I never got to benchmark that.

All this is to say that increasing to 8 bytes is not a pure performance
loss.  It is a trade-off, and one that I did not find particularly problematic.
On the up side, I didn't need to worry about Oid exhaustion anymore,
which allows removing the code that checks for it (though I left that
code in place.)  It allows using varlena objects instead of the large object
interface, so I could yank that interface and make my code size
smaller.  (I never much used the LO interface to begin with, so I might
not be the right person to ask about this.)  It allows not worrying about
accidentally bumping into the 1GB limit on varlenas, which means you
don't have to code for that error condition in applications.


mark



On Thursday, January 2, 2014 1:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-12-31 13:51:08 -0800, Mark Dilger wrote:
> The BDR documentation http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf
> says,
>
>     "Physical replication forces us to use just one
>      node: multi-master required for write scalability"
>
>     "Physical replication provides best read scalability"
>
> I am inclined to agree with the second statement, but
> I think my proposal invalidates the first statement, at
> least for a particular rigorous partitioning over which
> server owns which data.

I think you *massively* underestimate the amount of work implementing
this would require.
For one, you'd need to have a catalog that is written to on only one
server, you cannot have several nodes writing to the same table, even if
it's to disparate oid ranges. So you'd need to partition the whole
catalog by oid ranges - which would be a major efficiency loss for many,
many cases.

Not to speak of breaking pg_upgrade and noticeably increasing the size
of the catalog due to bigger oids and additional relations.

> So for me, multi-master with physical replication seems
> possible, and would presumably provide the best
> read scalability.

What you describe isn't really multi master though, as every row can
only be written to by a single node (the owner).

Also, why would this have a better read scalability? Whether a row is
written by streaming rep or not doesn't influence read speed.

> Or I can use logical replication such as BDR, but then the servers
> are spending more effort than with physical replication,
> so I get less bang for the buck when I purchase more
> servers to add to the cluster.

The efficiency difference really hasn't to be big if done right. If
you're so write-heavy that the difference is becoming a problem you
wouldn't implement a shared-everything architecture anyway.

> Am I missing something here?  Does BDR really provide
> an equivalent solution?

Not yet, but the plan is to get there.

> Second, it seems that BDR leaves to the client the responsibility
> for making schemas the same everywhere.  Perhaps this is just
> a limitation of the implementation so far, which will be resolved
> in the future?

Hopefully something that's going to get lifted.

Greetings,

Andres Freund

--
Andres Freund                      http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Merlin Moncure
Дата:
On Tue, Dec 31, 2013 at 3:51 PM, Mark Dilger <markdilger@yahoo.com> wrote:
> The BDR documentation
> http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf
> says,
>
>     "Physical replication forces us to use just one
>      node: multi-master required for write scalability"
>
>     "Physical replication provides best read scalability"
>
> I am inclined to agree with the second statement, but
> I think my proposal invalidates the first statement, at
> least for a particular rigorous partitioning over which
> server owns which data.
>
> In my own workflow, I load lots of data from different
> sources.  The partition the data loads into depends on
> which source it came from, and it is never mixed or
> cross referenced in any operation that writes the data.
> It is only "mixed" in the sense that applications query
> data from multiple sources.
>
> So for me, multi-master with physical replication seems
> possible, and would presumably provide the best
> read scalability.  I doubt that I am in the only database
> user who has this kind of workflow.
>
> The alternatives are ugly.  I can load data from separate
> sources into separate database servers *without* replication
> between them, but then the application layer has to
> emulate queries across the data.  (Yuck.)  Or I can use
> logical replication such as BDR, but then the servers
> are spending more effort than with physical replication,
> so I get less bang for the buck when I purchase more
> servers to add to the cluster.  Or I can use FDW to access
> data from other servers, but that means the same data
> may be pulled across the wire arbitrarily many times, with
> corresponding impact on the bandwidth.
>
> Am I missing something here?  Does BDR really provide
> an equivalent solution?

I think BDR is better: while it does only support schema-equivalent
replication that is the typical case for distributed write systems
like this.  Also, there are a lot less assumptions about the network
architecture in the actual data itself (for example, what happens when
you want to change onwer/mege/split data?).  IMNSHO, It's better that
each node is managing WAL for itself, not the other way around except
in the very special case you want an exact replica of the database on
each node at all times as with the current HS/SR.

A **huge** amount of work has/is being put in to wal based logical
replication support (LLSR in the BDR docs) that should mostly combine
the flexibility of trigger based logical replication with the
robustness of wal based replication that we have in core now.  LLSR a
low level data transmission framework that can be wrapped by higher
level user facing stuff like BDR.  LLSR, by the way, does not come
attached with the assumption that all databases have the same schema.
If I were you, I'd be studying up on LLSR and seeing how it could be
molded into the use cases you're talking about.  From a development
point of view, the replication train hasn't just left the station,
it's a space shuttle that just broke out of earth's orbit.  By my
reckoning a new 'from the ground up' implementation of replication
requiring in-core changes has an exactly zero percent change of being
adopted.

merlin



Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Andres Freund
Дата:
On 2014-01-02 10:18:52 -0800, Mark Dilger wrote:
> I anticipated that my proposal would require partitioning the catalogs.
> For instance, autovacuum could only run on locally owned tables, and
> would need to store the analyze stats data in a catalog partition belonging
> to the local server, but that doesn't seem like a fundamental barrier to
> it working.

It would make every catalog lookup noticeably more expensive.

>  The partitioned catalog tables would get replicated like
> everything else.  The code that needs to open catalogs and look things
> up could open the specific catalog partition needed if it already knew the
> Oid of the table/index/whatever that it was interested in, as the catalog
> partition desired would have the same modulus as the Oid of the object
> being researched. 

Far, far, far from every lookup is by oid. Most prominently the names of
database objects. Those will have to scan every catalog partition. Not
fun.

> Your point about increasing the runtime of pg_upgrade is taken.  I will
> need to think about that some more.

It's not about increasing the runtime, it's about simply breaking
it. pg_upgrade relies on binary compatibility of user relation's files
and you're breaking that if you change the width of datatypes.

> Your claim that what I describe is not multi-master is at least partially
> correct, depending on how you think about the word "master".  Certainly
> every server is the master of its own chunk.

Well, you're essentially just describing a sharded system - that's not
usually coined multimaster.

> Your claim that BDR doesn't have to be much slower than what I am
> proposing is quite interesting, as if that is true I can ditch this idea and
> use BDR instead.  It is hard to empirically test, though, as I don't have
> the alternate implementation on hand.

Well, I can tell you that for the changeset extraction stuff (which is the
basis for BDR) the biggest bottleneck so far seems to be the CRC
computation when reading the WAL - and that's something plain WAL apply
has to do as well. And it is optimizable.
When actually testing decoding & apply, for workloads fitting into
memory I had to try very hard to construe situations where apply was a
big bottleneck. It is easier for seek bound workloads, where the standby
is less powerful than the primary, since there's more random reads for
those due to full page writes removing the need for reads in many cases.

> I think the expectation that performance will be harmed if postgres
> uses 8 byte Oids is not quite correct.
>
> Several years ago I ported postgresql sources to use 64bit everything.
> Oids, varlena headers, variables tracking offsets, etc.  It was a fair
> amount of work, but all the doom and gloom predictions that I have
> heard over the years about how 8-byte varlena headers would kill
> performance, 8-byte Oids would kill performance, etc, turned out to
> be quite inaccurate.

Well, it can increase the size of the database, turning a system where
the hot set fits into memory into one where it doesn't anymore. But
really, the performance concerns were more about the catalog lookups.

Fundamentally, I think there's nothing I see preventing such a scheme
from being implemented - but I think there's about zap chance of it ever
getting integrated, it's just far to invasive with very high costs in
scenarios where it's not used for not all that much gain. Not to speak
about the amount of engineering it would require to implement.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Mark Dilger
Дата:
Thanks to both of you for all the feedback.  Your reasoning
about why it is not worth implementing, what the problems
with it would be, etc., are helpful.

Sorry about using the word multimaster where it might
have been better to say sharded.

BTW, since the space shuttle has already left orbit, as you
metaphorically put it, maybe there should be more
visibility to the wider world about this?  You can go to
postgresql.org and find diddly squat about it.  I grant you
that it is not a completed project yet, and so maybe you
want to wait before making major announcements, but
the sort of people who would use this feature are probably
the sort of people who would not mind hearing about it
early.

mark


On Thursday, January 2, 2014 11:18 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-01-02 10:18:52 -0800, Mark Dilger wrote:
> I anticipated that my proposal would require partitioning the catalogs.
> For instance, autovacuum could only run on locally owned tables, and
> would need to store the analyze stats data in a catalog partition belonging
> to the local server, but that doesn't seem like a fundamental barrier to
> it working.

It would make every catalog lookup noticeably more expensive.

>  The partitioned catalog tables would get replicated like
> everything else.  The code that needs to open catalogs and look things
> up could open the specific catalog partition needed if it already knew the
> Oid of the table/index/whatever that it was interested in, as the catalog
> partition desired would have the same modulus as the Oid of the object
> being researched. 

Far, far, far from every lookup is by oid. Most prominently the names of
database objects. Those will have to scan every catalog partition. Not
fun.

> Your point about increasing the runtime of pg_upgrade is taken.  I will
> need to think about that some more.

It's not about increasing the runtime, it's about simply breaking
it. pg_upgrade relies on binary compatibility of user relation's files
and you're breaking that if you change the width of datatypes.

> Your claim that what I describe is not multi-master is at least partially
> correct, depending on how you think about the word "master".  Certainly
> every server is the master of its own chunk.

Well, you're essentially just describing a sharded system - that's not
usually coined multimaster.

> Your claim that BDR doesn't have to be much slower than what I am
> proposing is quite interesting, as if that is true I can ditch this idea and
> use BDR instead.  It is hard to empirically test, though, as I don't have
> the alternate implementation on hand.

Well, I can tell you that for the changeset extraction stuff (which is the
basis for BDR) the biggest bottleneck so far seems to be the CRC
computation when reading the WAL - and that's something plain WAL apply
has to do as well. And it is optimizable.
When actually testing decoding & apply, for workloads fitting into
memory I had to try very hard to construe situations where apply was a
big bottleneck. It is easier for seek bound workloads, where the standby
is less powerful than the primary, since there's more random reads for
those due to full page writes removing the need for reads in many cases.

> I think the expectation that performance will be harmed if postgres
> uses 8 byte Oids is not quite correct.
>
> Several years ago I ported postgresql sources to use 64bit everything.
> Oids, varlena headers, variables tracking offsets, etc.  It was a fair
> amount of work, but all the doom and gloom predictions that I have
> heard over the years about how 8-byte varlena headers would kill
> performance, 8-byte Oids would kill performance, etc, turned out to
> be quite inaccurate.

Well, it can increase the size of the database, turning a system where
the hot set fits into memory into one where it doesn't anymore. But
really, the performance concerns were more about the catalog lookups.

Fundamentally, I think there's nothing I see preventing such a scheme
from being implemented - but I think there's about zap chance of it ever
getting integrated, it's just far to invasive with very high costs in
scenarios where it's not used for not all that much gain. Not to speak
about the amount of engineering it would require to implement.


Greetings,

Andres Freund

--
Andres Freund                      http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

От
Andres Freund
Дата:
On 2014-01-02 11:35:57 -0800, Mark Dilger wrote:
> BTW, since the space shuttle has already left orbit, as you
> metaphorically put it, maybe there should be more
> visibility to the wider world about this?  You can go to
> postgresql.org and find diddly squat about it.  I grant you
> that it is not a completed project yet, and so maybe you
> want to wait before making major announcements, but
> the sort of people who would use this feature are probably
> the sort of people who would not mind hearing about it
> early.

Well, changeset extraction isn't committed yet, so it's not surprising
that you don't find anything there ;). Parts of the patches (notably
wal_level=logical, enriching the WAL to allow decoding) have landed, the
others are being worked over in response to review comments of Robert.

I am pretty sure there will be more publicity once it's committed ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services