Обсуждение: Archive to secondary Postgres instance

Поиск
Список
Период
Сортировка

Archive to secondary Postgres instance

От
David Rydzewski
Дата:
I have a postgres database that is doubling in size each year and contains many online orders that were delivered long ago and are now dormant in the database.

My desire is to offload these dormant orders to a secondary Postgres instance, so that the information is still available for viewing purchase history, while keeping the active orders in the main, and eventually smaller Postgres instance.

These orders live in a highly normalized model that spans over 30 tables.

I have thought of two approaches to handle this "archiving" use case.

One approach is pull based. As orders naturally go dormant, execute custom SQL to move this information into the secondary postgres instance. A separate process would then purge this order information from the main database after a period of time (e.g., custom deletes). The upside is that the process is fairly lightweight and straightforward. The downside is you have to maintain two data models and the SQL scripts whenever the data model changes.

Second approach is logical replication. The upside is that you don't have to maintain two data models separately. The downside is that it becomes very tricky to differentiate between a "normal" delete, which you would want to apply on the secondary instance versus an "archive" delete which you wouldn't want to apply on the secondary. It could also get complex when replication fails and you have to re-initiate the stream.

My question is whether anyone else has solved this and if so, how? I'm leaning toward pull-based, because replication doesn't feel right with the complexity around "selective" deletes, and other replication failure scenarios.

Thanks!





Re: Archive to secondary Postgres instance

От
Evan Bauer
Дата:
David,

Your approach #1 is one that has been proven for years in both commerce and securities trading applications.  The
pitfallsin approach #2 have always made it too problematic for me — regardless of the DBMS involved. 

A third approach is to use a change data capture (CDC) tool.  These tools manage rules-based asymmetric replication
betweendatabases — it basically provides your option #1 with a logical definition of the transfer, provides both
seedingand catchup capabilities, but without having to manage a ton of replication scripts.  I’ve used the approach for
large-scalefinancial services applications where the “trade” database of active transactions has to be incredibly fast
(andtherefore small with short-lived rows), and the "trade-history” database needs to maintain years of data for
complianceand analysis.  The asymmetric publisher to subscriber relationship allows for all kinds of interesting
indexingand aggregation through materialized views in the history database that you could never afford in the “live”
OLTPdatabase. 

Talend is the commercial open source leader in this area for database-to-database CDC replication; my limited
experiencewith it (though not with PostgreSQL) has been good.  Debezium is based on Apache Kafka, looks promising, but
requiresboth an extension for PostgreSQL publishers and a custom application on the subscriber side.  There are a
numberof commercial products in this space as well.  IBM’s InfoSphere CDC product is the one I know best, but it is
bothproprietary and (when I last looked) had limited PostgreSQL support.  There is a list of other alternatives on the
PostgreSQLwiki at: https://wiki.postgresql.org/wiki/Ecosystem:Replication 

Cheers,

- Evan

Evan Bauer
eb@evanbauer.com
+1 646 641 2973
Skype: evanbauer


> On Aug 31, 2018, at 21:57, David Rydzewski <david.rydzewski@gmail.com> wrote:
>
> I have a postgres database that is doubling in size each year and contains many online orders that were delivered
longago and are now dormant in the database. 
>
> My desire is to offload these dormant orders to a secondary Postgres instance, so that the information is still
availablefor viewing purchase history, while keeping the active orders in the main, and eventually smaller Postgres
instance.
>
> These orders live in a highly normalized model that spans over 30 tables.
>
> I have thought of two approaches to handle this "archiving" use case.
>
> One approach is pull based. As orders naturally go dormant, execute custom SQL to move this information into the
secondarypostgres instance. A separate process would then purge this order information from the main database after a
periodof time (e.g., custom deletes). The upside is that the process is fairly lightweight and straightforward. The
downsideis you have to maintain two data models and the SQL scripts whenever the data model changes. 
>
> Second approach is logical replication. The upside is that you don't have to maintain two data models separately. The
downsideis that it becomes very tricky to differentiate between a "normal" delete, which you would want to apply on the
secondaryinstance versus an "archive" delete which you wouldn't want to apply on the secondary. It could also get
complexwhen replication fails and you have to re-initiate the stream. 
>
> My question is whether anyone else has solved this and if so, how? I'm leaning toward pull-based, because replication
doesn'tfeel right with the complexity around "selective" deletes, and other replication failure scenarios. 
>
> Thanks!
>
>
>
>
>