Обсуждение: [GENERAL] Logcal replication in large scale

Поиск
Список
Период
Сортировка

[GENERAL] Logcal replication in large scale

От
Nicklas Avén
Дата:
Hallo all

I am thrilled about logical replication in PostgreSQL 10. My head have
started spinning about use cases.

Would it be possible to use logical replication as a distribution
method of data?

I think about map data from national mapping authorities. The problem
is to get the updates of their data sets. Especially the open data sets
are now distributed as files (shape files) in Sweden and as pg_dump in
Norway.

I guess it is beyond what logical replication is designed for, so I ask
what problems that might arise for a scenario like:

The distributor has a publication database with logical replication
publications of the tables. All users, probably thousands or more,
would subscribe to that publication to get an updated copy of the data
set.

How would the publication server react? I guess the WAL-files will only
be written once anyway?

My guess is that it will be a lower payload than today anyway when the
whole data set have to be fetched to get updates.

In the future maybe it would be possible to make some sort of
distributed arrangement so the replication can be against any of the
client servers instead to reduce payload on the original server. But
that would of course include some way to check data integrity against
the original server with a md5 hash or something so result is identical
with the original. 

ATB

Nicklas Avén


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Logcal replication in large scale

От
Magnus Hagander
Дата:
On Wed, Sep 20, 2017 at 8:53 AM, Nicklas Avén <nicklas.aven@jordogskog.no> wrote:
Hallo all

I am thrilled about logical replication in PostgreSQL 10. My head have
started spinning about use cases.

Would it be possible to use logical replication as a distribution
method of data?

As an answer to the generic  question: yes :)

 
I think about map data from national mapping authorities. The problem
is to get the updates of their data sets. Especially the open data sets
are now distributed as files (shape files) in Sweden and as pg_dump in
Norway.

I guess it is beyond what logical replication is designed for, so I ask
what problems that might arise for a scenario like:

The distributor has a publication database with logical replication
publications of the tables. All users, probably thousands or more,
would subscribe to that publication to get an updated copy of the data
set.

How would the publication server react? I guess the WAL-files will only
be written once anyway?

Yes. But they will  need to be kept around until *all* subscribers have pulled down their changes. So even one subscriber falling behind will mean your WAL will never get cleaned up.

Of course, you can keep some sort of watcher process that kills old replication slots.

I am also not sure how well PostgreSQL will react to having thousands of replication slots. It's not what the system was designed for I believe :)

You might be better of using logical decoding (which goes back to 9.4) to stream the data out, but not having each subscriber be a postgresql subscriber. Either using it to generate some sort of "diff files" that can then be consumed downstream, or by distributing it via some kind of dedicated queuing system designed to handle that many downstreams.

 
My guess is that it will be a lower payload than today anyway when the
whole data set have to be fetched to get updates.

As always that's going to depend on the number of updates. If the same thing is updated 1000 times, then if you use logical replication it gets transferred 1000 times. So there are definitely cases when there will be *more* traffic with logical replication, but in cases like yours I would guess it will be less.

//Magnus

Re: [GENERAL] Logcal replication in large scale

От
Nicklas Avén
Дата:
On Wed, 2017-09-20 at 09:44 +0200, Magnus Hagander wrote:
On Wed, Sep 20, 2017 at 8:53 AM, Nicklas Avén <nicklas.aven@jordogskog.no> wrote:
Hallo all

I am thrilled about logical replication in PostgreSQL 10. My head have
started spinning about use cases.

Would it be possible to use logical replication as a distribution
method of data?

As an answer to the generic  question: yes :)

 
I think about map data from national mapping authorities. The problem
is to get the updates of their data sets. Especially the open data sets
are now distributed as files (shape files) in Sweden and as pg_dump in
Norway.

I guess it is beyond what logical replication is designed for, so I ask
what problems that might arise for a scenario like:

The distributor has a publication database with logical replication
publications of the tables. All users, probably thousands or more,
would subscribe to that publication to get an updated copy of the data
set.

How would the publication server react? I guess the WAL-files will only
be written once anyway?

Yes. But they will  need to be kept around until *all* subscribers have pulled down their changes. So even one subscriber falling behind will mean your WAL will never get cleaned up.

Of course, you can keep some sort of watcher process that kills old replication slots.

I am also not sure how well PostgreSQL will react to having thousands of replication slots. It's not what the system was designed for I believe :)


Ok, I have to read me up on how this works. I thought about it as a bucket of WAL-files that the subscribers just "grab".
But of course there is some bookkeeping to make things work.

I guess there is also mechanisms so a new subscriber can get a complete table after the publcation have been active for a long time? I mean the "cleaning up" of Wal-files will not leave new subscribers missing what is written to the table long ago?


You might be better of using logical decoding (which goes back to 9.4) to stream the data out, but not having each subscriber be a postgresql subscriber. Either using it to generate some sort of "diff files" that can then be consumed downstream, or by distributing it via some kind of dedicated queuing system designed to handle that many downstreams.

The thing is that I am not involved in any of those organisations, just a user that want to give them some good reasons to consider deistributing through PostgreSQL :-)


 
My guess is that it will be a lower payload than today anyway when the
whole data set have to be fetched to get updates.

As always that's going to depend on the number of updates. If the same thing is updated 1000 times, then if you use logical replication it gets transferred 1000 times. So there are definitely cases when there will be *more* traffic with logical replication, but in cases like yours I would guess it will be less.

//Magnus

Re: [GENERAL] Logcal replication in large scale

От
Magnus Hagander
Дата:
On Wed, Sep 20, 2017 at 10:02 AM, Nicklas Avén <nicklas.aven@jordogskog.no> wrote:
On Wed, 2017-09-20 at 09:44 +0200, Magnus Hagander wrote:
On Wed, Sep 20, 2017 at 8:53 AM, Nicklas Avén <nicklas.aven@jordogskog.no> wrote:
Hallo all

I am thrilled about logical replication in PostgreSQL 10. My head have
started spinning about use cases.

Would it be possible to use logical replication as a distribution
method of data?

As an answer to the generic  question: yes :)

 
I think about map data from national mapping authorities. The problem
is to get the updates of their data sets. Especially the open data sets
are now distributed as files (shape files) in Sweden and as pg_dump in
Norway.

I guess it is beyond what logical replication is designed for, so I ask
what problems that might arise for a scenario like:

The distributor has a publication database with logical replication
publications of the tables. All users, probably thousands or more,
would subscribe to that publication to get an updated copy of the data
set.

How would the publication server react? I guess the WAL-files will only
be written once anyway?

Yes. But they will  need to be kept around until *all* subscribers have pulled down their changes. So even one subscriber falling behind will mean your WAL will never get cleaned up.

Of course, you can keep some sort of watcher process that kills old replication slots.

I am also not sure how well PostgreSQL will react to having thousands of replication slots. It's not what the system was designed for I believe :)


Ok, I have to read me up on how this works. I thought about it as a bucket of WAL-files that the subscribers just "grab".
But of course there is some bookkeeping to make things work.

No, it's streaming replication.

And in the end, that bucket becomes infinitely large.
 

I guess there is also mechanisms so a new subscriber can get a complete table after the publcation have been active for a long time? I mean the "cleaning up" of Wal-files will not leave new subscribers missing what is written to the table long ago?

Yes, new subscriptions will get the current version of the data and only then start buffering changes.

--