Обсуждение: Very high latency, low bandwidth replication

Поиск
Список
Период
Сортировка

Very high latency, low bandwidth replication

От
Bob Jolliffe
Дата:
Hi

I have been grappling with a problem for some time I would appreciate some advice on.  We have a public health application which is web based with a postgresql backing store which is designed for use by the public sector ministry of health in a significant number of African, Asian and other countries (http//:dhis2.org).  "Traditionally" it is hosted as a national data warehouse application with users dispersed amongst district offices and sometimes health facilities around the country.  

Particularly in many countries in Africa the public sector typically has limited data centre infrastructure to reliably host the application in-house and so a good number have opted to use some global cloud service (infrastructure as a service) to ensure maximum availability of the application.  Others have managed to make use of in-country resources such as national ISPs and mobile companies.  There are many cost-benefit and governance considerations behind these decisions which I don't need to go into here.

Whereas ministries have been prepared to do this there are important to reasons to ensure that a backup of the database can be maintained in the ministry.  So we attempt to grab the nightly snapshot backups from the database each night.  In the past I have attempted this somewhat simplistically with rsync over ssh but it is a very inefficient approach and particularly so over weak internet connections.

What are people's thoughts about a more optimal solution?  I would like to use a more incremental approach to replication.  This does not have to be a "live" replication .. asynchronously triggering once every 24 hours is sufficient.  Also there are only a subset of tables which are required (the rest consist of data which is generated).

Appreciate any advice.

Regards
Bob  

Re: Very high latency, low bandwidth replication

От
Francisco Olarte
Дата:
Hi Bob.

On Mon, Jun 30, 2014 at 10:05 AM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:
> What are people's thoughts about a more optimal solution?  I would like to
> use a more incremental approach to replication.  This does not have to be a
> "live" replication .. asynchronously triggering once every 24 hours is
> sufficient.  Also there are only a subset of tables which are required (the
> rest consist of data which is generated).


If you only need to replicate once every 24 hours, which means you can
tolerate lags, you could try log shipping. Instead of sending the wal
records from master to standby directly just spool them, compress them
as much as you can ( I would try pglesslog plus an XZ on it's output
), and send it once a day. This for the 'incremental part'. For the
only a subset of tables, you could try to set up a local partial
mirror using any of the trigger based replication products and then do
log-shipping of that.

Also, the logical replication slot stuff added to the latest version
seems really promissing for this kind of thing, but I'm not familiar
enough with it to recommend anything.

Also, depending on your data updating patterns, database sizes and
other stuff, a trigger based replication approach can save a lot of
traffic. I mean, if you have records which are heavily updated, but
only replicate once a day, you can collapse all the day stuff in a
single update. I once did a similar thing to transmit deltas over a
2400bps modem by making daily sorted dumps and sending daily deltas
with previous day ( it needed a bit of coding, about a couple hundred
lines, but produced ridiculously small deltas, and with a bit of care
their application was idempotent, which simplified the recovery on
errors ).

   Francisco Olarte.


Re: Very high latency, low bandwidth replication

От
Stuart Bishop
Дата:
On 30 June 2014 15:05, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

> What are people's thoughts about a more optimal solution?  I would like to
> use a more incremental approach to replication.  This does not have to be a
> "live" replication .. asynchronously triggering once every 24 hours is
> sufficient.  Also there are only a subset of tables which are required (the
> rest consist of data which is generated).

WAL shipping is probably best here. Configure an archive_command on
the master to  compress and push logs to cloud storage, and configure
a hot standby on site to pull and decompress the logs. The wal-e tool
may make things simpler pushing to cloud storage, or just follow the
PostgreSQL documentation to archive the WAL files to a filesystem.

If that isn't good enough, you can look at more esoteric approaches
(eg. nightly plaintext dumps to a git repository, pushing changes to
disk on site).


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/


Re: Very high latency, low bandwidth replication

От
Bob Jolliffe
Дата:
Thanks Francisco for these inputs.  I hadn't considered log shipping as I knew I didn't want to track changes to all tables (and databases).  Setting up a local partial mirror is an interesting thought which hadn't crossed my mind .. I'll giver that some consideration.  

Though currently I am thinking to address the problem of generating deltas at the application level rather than to use postgresql features which are largely optimized for a slightly different set of circumstances and requirements.

Impressive what can be done witha 2400 baud modem when you set your mind to it.  Fortunately this days are mostly behind us :-)


On 30 June 2014 13:05, Francisco Olarte <folarte@peoplecall.com> wrote:
Hi Bob.

On Mon, Jun 30, 2014 at 10:05 AM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:
> What are people's thoughts about a more optimal solution?  I would like to
> use a more incremental approach to replication.  This does not have to be a
> "live" replication .. asynchronously triggering once every 24 hours is
> sufficient.  Also there are only a subset of tables which are required (the
> rest consist of data which is generated).


If you only need to replicate once every 24 hours, which means you can
tolerate lags, you could try log shipping. Instead of sending the wal
records from master to standby directly just spool them, compress them
as much as you can ( I would try pglesslog plus an XZ on it's output
), and send it once a day. This for the 'incremental part'. For the
only a subset of tables, you could try to set up a local partial
mirror using any of the trigger based replication products and then do
log-shipping of that.

Also, the logical replication slot stuff added to the latest version
seems really promissing for this kind of thing, but I'm not familiar
enough with it to recommend anything.

Also, depending on your data updating patterns, database sizes and
other stuff, a trigger based replication approach can save a lot of
traffic. I mean, if you have records which are heavily updated, but
only replicate once a day, you can collapse all the day stuff in a
single update. I once did a similar thing to transmit deltas over a
2400bps modem by making daily sorted dumps and sending daily deltas
with previous day ( it needed a bit of coding, about a couple hundred
lines, but produced ridiculously small deltas, and with a bit of care
their application was idempotent, which simplified the recovery on
errors ).

   Francisco Olarte.

Re: Very high latency, low bandwidth replication

От
Bob Jolliffe
Дата:
Thanks Stuart.  I'll do some measurements on plaintext dump to git.


On 2 July 2014 09:46, Stuart Bishop <stuart@stuartbishop.net> wrote:
On 30 June 2014 15:05, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

> What are people's thoughts about a more optimal solution?  I would like to
> use a more incremental approach to replication.  This does not have to be a
> "live" replication .. asynchronously triggering once every 24 hours is
> sufficient.  Also there are only a subset of tables which are required (the
> rest consist of data which is generated).

WAL shipping is probably best here. Configure an archive_command on
the master to  compress and push logs to cloud storage, and configure
a hot standby on site to pull and decompress the logs. The wal-e tool
may make things simpler pushing to cloud storage, or just follow the
PostgreSQL documentation to archive the WAL files to a filesystem.

If that isn't good enough, you can look at more esoteric approaches
(eg. nightly plaintext dumps to a git repository, pushing changes to
disk on site).


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: Very high latency, low bandwidth replication

От
Francisco Olarte
Дата:
Hi Bob:

On Sat, Jul 5, 2014 at 3:34 PM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:
> Though currently I am thinking to address the problem of generating deltas
> at the application level rather than to use postgresql features which are
> largely optimized for a slightly different set of circumstances and
> requirements.

Yeah, special purpose can always be more efficient. Specially when you
can tune for your special circunstances, like non full replication,
and replicating only once a day, and you probably can minimizae the
changesets knowing data dependencies.


> Impressive what can be done witha 2400 baud modem when you set your mind to
> it.  Fortunately this days are mostly behind us :-)

Not so behind. Then you did that kind of thing for 100Mb data sets,
now you can use the same algorithms for 1Tb data sets. Line speed have
gone up more than data set sizes, but this advantage has been eaten by
customer delay expectations ( in my case, now they want at most 30 min
delay, not 1440 ).

Not knowing your exact data structure or size I cannot recommend
anything, but I've been abble to make really good deltas using the
same algorithms that were used on tapes, dump the table, sort it, diff
with last days, the throughput of current machines is amazing, you can
easily achieve full disk speed.


Regards.
   Francisco Olarte.