Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

Поиск

Список

Период

Сортировка

От	Mark Dilger
Тема	Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization
Дата	2 января 2014 г. 21:21:42
Msg-id	1388686732.24641.YahooMailNeo@web125405.mail.ne1.yahoo.com обсуждение исходный текст
Ответ на	Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization (Andres Freund <andres@2ndquadrant.com>)
Ответы	Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization
Список	pgsql-hackers

Дерево обсуждения

My original email was mostly a question about whether WAL data

could be merged from multiple servers, or whether I was overlooking

some unsolvable difficulty. I'm still mostly curious about that

question.

I anticipated that my proposal would require partitioning the catalogs.

For instance, autovacuum could only run on locally owned tables, and

would need to store the analyze stats data in a catalog partition belonging

to the local server, but that doesn't seem like a fundamental barrier to

it working. The partitioned catalog tables would get replicated like

everything else. The code that needs to open catalogs and look things

up could open the specific catalog partition needed if it already knew the

Oid of the table/index/whatever that it was interested in, as the catalog

partition desired would have the same modulus as the Oid of the object

being researched.

Your point about increasing the runtime of pg_upgrade is taken. I will

need to think about that some more.

Your claim that what I describe is not multi-master is at least partially

correct, depending on how you think about the word "master". Certainly

every server is the master of its own chunk. I see that as a downside

for some people, who want to be able to insert/update/delete any data

on any server. But the ability to modify *anything anywhere* brings

performance problems with it. Either the servers have to wait for each

other before commits go through, in order to avoid incompatible data

changes being committed on both ends, or the servers have to reject

commits after they have already been reported to the client as successful.

I expect my proposal to have better read scalability in a write-heavy
environment, because the less work it takes to integrate data changes
from other workers, the more resources remain per server to answer
read queries.

Your claim that BDR doesn't have to be much slower than what I am
proposing is quite interesting, as if that is true I can ditch this idea and
use BDR instead. It is hard to empirically test, though, as I don't have
the alternate implementation on hand.

I think the expectation that performance will be harmed if postgres

uses 8 byte Oids is not quite correct.

Several years ago I ported postgresql sources to use 64bit everything.

Oids, varlena headers, variables tracking offsets, etc. It was a fair

amount of work, but all the doom and gloom predictions that I have

heard over the years about how 8-byte varlena headers would kill

performance, 8-byte Oids would kill performance, etc, turned out to

be quite inaccurate. The performance impact was ok for me. The

disk space impact wasn't much either, as with 8-byte varlena headers,

anything under 127 bytes had a 1-byte header, and anything under

16383 had a 2-byte header, with 8-bytes only used after that, which

pretty much meant that disk usage for representing varlena data

shrunk slightly rather than growing. Tom Lane had mentioned in a

thread that he didn't want to make the #define for processing

varlena headers any more complicated than it was, because it gets

executed quite a lot. So I tried the 1,2,8 byte vs 1,8 byte varlena

design both ways and found it made little difference to me which I

chose. Of course, my analysis was based on my own usage patterns,

my own schemas, my own data, and might not apply to everyone

else. I tend to conflate the 8-byte Oid change with all these other

changes from 4-byte to 8-byte, because that's what I did and what

I have experience with.

Having 8-byte everything with everything aligned allowed me to use

SSE functions on some stuff that postgres was (at least at the time)

doing less efficiently. Since then, I have noticed that the hash function

for disk blocks is implemented with SSE in mind. With 8-byte aligned

datums, SSE based hashing can be used without all the calls to

realign the data. I was experimenting with forcing data to be 16-byte

aligned to take advantage of newer SSE functions, but this was years

ago and I didn't own any hardware with the newer SSE capabilities,

so I never got to benchmark that.

All this is to say that increasing to 8 bytes is not a pure performance

loss. It is a trade-off, and one that I did not find particularly problematic.

On the up side, I didn't need to worry about Oid exhaustion anymore,

which allows removing the code that checks for it (though I left that

code in place.) It allows using varlena objects instead of the large object

interface, so I could yank that interface and make my code size

smaller. (I never much used the LO interface to begin with, so I might

not be the right person to ask about this.) It allows not worrying about

accidentally bumping into the 1GB limit on varlenas, which means you

don't have to code for that error condition in applications.

mark

On Thursday, January 2, 2014 1:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-12-31 13:51:08 -0800, Mark Dilger wrote:
> The BDR documentation http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf
> says,
>
>     "Physical replication forces us to use just one
>      node: multi-master required for write scalability"
>
>     "Physical replication provides best read scalability"
>
> I am inclined to agree with the second statement, but
> I think my proposal invalidates the first statement, at
> least for a particular rigorous partitioning over which
> server owns which data.

I think you *massively* underestimate the amount of work implementing
this would require.
For one, you'd need to have a catalog that is written to on only one
server, you cannot have several nodes writing to the same table, even if
it's to disparate oid ranges. So you'd need to partition the whole
catalog by oid ranges - which would be a major efficiency loss for many,
many cases.

Not to speak of breaking pg_upgrade and noticeably increasing the size
of the catalog due to bigger oids and additional relations.

> So for me, multi-master with physical replication seems
> possible, and would presumably provide the best
> read scalability.

What you describe isn't really multi master though, as every row can
only be written to by a single node (the owner).

Also, why would this have a better read scalability? Whether a row is
written by streaming rep or not doesn't influence read speed.

> Or I can use logical replication such as BDR, but then the servers
> are spending more effort than with physical replication,
> so I get less bang for the buck when I purchase more
> servers to add to the cluster.

The efficiency difference really hasn't to be big if done right. If
you're so write-heavy that the difference is becoming a problem you
wouldn't implement a shared-everything architecture anyway.

> Am I missing something here? Does BDR really provide
> an equivalent solution?

Not yet, but the plan is to get there.

> Second, it seems that BDR leaves to the client the responsibility
> for making schemas the same everywhere. Perhaps this is just
> a limitation of the implementation so far, which will be resolved
> in the future?

Hopefully something that's going to get lifted.

Greetings,

Andres Freund

--
Andres Freund    http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Robert Haas
Дата: 02 января 2014 г., 21:00:36
Сообщение: Re: fix_PGSTAT_NUM_TABENTRIES_macro patch

Следующее

От: Tom Lane
Дата: 02 января 2014 г., 21:22:00
Сообщение: Re: fix_PGSTAT_NUM_TABENTRIES_macro patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: proposal: multiple read-write masters in a cluster with wal-streaming synchronization

Предыдущее

Следующее