Обсуждение: data consolidation: logical replication design considerations

Поиск
Список
Период
Сортировка

data consolidation: logical replication design considerations

От
Rory Campbell-Lange
Дата:
I'd be grateful for some comments on the advisability of using a large
number of concurrent logical replication publications/subscriptions.
Below I've set out the current environment and a suggested design.
Apologies for the length of this email.

We presently have many hundreds of small databases in a cluster, in
separate databases due to historical issues, security considerations and
the ability to scale horizontally in future by using more than one
cluster. The cluster is presently less than half a terabyte in size and
runs very comfortably on a 96GB RAM/32 core Intel E5-2620 server on NVMe
disks in RAID10 configuration on Linux.

The individual databases cover a handful of discrete services (although
they have some common data structures) and are of different sizes
depending on client needs. The largest client database is currently
about 7.5GB in size.

We presently use streaming replication locally and remotely to replicate
the cluster and it has pg_bouncer in front of it. Some settings:

    max_connections: 500
    shared_buffers: 20GB
    work_mem: 15MB

Due to changing client and operational requirements we need to aggregate
some common data between client databases in the same organisation into
single read-only databases for reporting purposes. This new requirement
is in addition to keeping the client databases in operation as they are
now. The potential for using logical replication comes to mind,
specifically the use case of "Consolidating multiple databases into a
single one" mentioned at
https://www.postgresql.org/docs/current/logical-replication.html

Some tests suggest that we can meet the requirements for publication
table replica identity and safe aggregation of data.

At an overview level this consolidation might require the setup of
logical replication publications from say 500 client databases
aggregating in close to real time to 50 target or aggregation
subscribing databases, averaging roughly 10 client database per
aggregation database, but with some aggregation databases having roughly
50 clients.

I would be grateful for comments on the following design proposals:

* to avoid overloading the existing primary host with many replication
  slots, it would be wise to implement aggregation on another host 

* the new aggregation host should receive streaming replication data
  from the primary on a first postgresql instance which will also have
  logical replication publishers on each relevant client database. As
  noted above, there may be ~500 publications

* the new aggregation host would have a second postgresql instance
  serving the aggregation databases each acting as logical
  replication subscribers. As noted above, there would be ~50 target
  databases each with an average of ~10 subscriptions to the first
  postgresql instance.

* that only one subscription per client database is needed (after
  initial synchronisation) to synchronise all tables in a particular
  client database schema

* that publications and subscriptions are brought online on a per-client
  database basis, to reduce the number of replication slots required due
  to initial synchronisation (the docs aren't clear about how many
  temporary replication slots may be needed "for the initial data
  synchronisation of pre-existing table data"; see
  https://www.postgresql.org/docs/current/logical-replication-subscription.html)

* that a similar server to the one noted above can handle two postgresql
  instances as described together with ~250 concurrent client
  connections to the second instance to serve client reporting needs.

Thoughts gratefully received,
Rory



Re: data consolidation: logical replication design considerations

От
Rick Otten
Дата:


On Sat, Jul 16, 2022 at 12:07 PM Rory Campbell-Lange <rory@campbell-lange.net> wrote:
I'd be grateful for some comments on the advisability of using a large
number of concurrent logical replication publications/subscriptions.
Below I've set out the current environment and a suggested design.
Apologies for the length of this email.

Another possibility is to use SymmetricDS for this.  [ https://symmetricds.org ]  SymmetricDS was originally developed to keep databases on thousands of Point-of-Sale databases (in cash registers) in sync with pricing and inventory data for large international retailers.

There are lots of other use cases, but even 10-12 years ago it was scalable to the extent you are describing you need here.

The main drawback is that it is trigger based, so there is some slight latency introduced for insert/update/delete actions on the tables on the appropriate master, but it usually isn't significant.

 

Re: data consolidation: logical replication design considerations

От
Rory Campbell-Lange
Дата:
On 17/07/22, Rick Otten (rottenwindfish@gmail.com) wrote:
> On Sat, Jul 16, 2022 at 12:07 PM Rory Campbell-Lange <
> rory@campbell-lange.net> wrote:
> 
> > I'd be grateful for some comments on the advisability of using a large
> > number of concurrent logical replication publications/subscriptions.
> > Below I've set out the current environment and a suggested design.
> > Apologies for the length of this email.
> 
> Another possibility is to use SymmetricDS for this.  [
> https://symmetricds.org ]  SymmetricDS was originally developed to keep
> databases on thousands of Point-of-Sale databases (in cash registers) in
> sync with pricing and inventory data for large international retailers.
> 
> There are lots of other use cases, but even 10-12 years ago it was scalable
> to the extent you are describing you need here.
> 
> The main drawback is that it is trigger based, so there is some slight
> latency introduced for insert/update/delete actions on the tables on the
> appropriate master, but it usually isn't significant.

Thanks very much for the pointer to SymmetricDS. I haven't come across
it before. The architecture, configuration and use look very
straightforward, although using java would be new to our production
environment, and SymmetricDS doesn't seem to be in Debian.

I'd be grateful to know if 500 odd publishers/subscribers is "out of the
park" or reasonable for a reasonably powerful machine (as described in
my original email). I would have thought using the native logical
replication capabilities would be much more scalable and efficient than
stepping outside of postgresql.

Regards,
Rory



Re: data consolidation: logical replication design considerations

От
Rory Campbell-Lange
Дата:
On 16/07/22, Rory Campbell-Lange (rory@campbell-lange.net) wrote:
> I'd be grateful for some comments on the advisability of using a large
> number of concurrent logical replication publications/subscriptions.
> Below I've set out the current environment and a suggested design.
> Apologies for the length of this email.
...
> * to avoid overloading the existing primary host with many replication
>   slots, it would be wise to implement aggregation on another host 

Looking into this further it appears that a streaming replication secondary can
not act as a logical replication publisher. Is that correct?

Rory