Обсуждение: Archive to secondary Postgres instance
I have a postgres database that is doubling in size each year and contains many online orders that were delivered long ago and are now dormant in the database.
My desire is to offload these dormant orders to a secondary Postgres instance, so that the information is still available for viewing purchase history, while keeping the active orders in the main, and eventually smaller Postgres instance.
These orders live in a highly normalized model that spans over 30 tables.
I have thought of two approaches to handle this "archiving" use case.
One approach is pull based. As orders naturally go dormant, execute custom SQL to move this information into the secondary postgres instance. A separate process would then purge this order information from the main database after a period of time (e.g., custom deletes). The upside is that the process is fairly lightweight and straightforward. The downside is you have to maintain two data models and the SQL scripts whenever the data model changes.
Second approach is logical replication. The upside is that you don't have to maintain two data models separately. The downside is that it becomes very tricky to differentiate between a "normal" delete, which you would want to apply on the secondary instance versus an "archive" delete which you wouldn't want to apply on the secondary. It could also get complex when replication fails and you have to re-initiate the stream.
My question is whether anyone else has solved this and if so, how? I'm leaning toward pull-based, because replication doesn't feel right with the complexity around "selective" deletes, and other replication failure scenarios.
Thanks!
David, Your approach #1 is one that has been proven for years in both commerce and securities trading applications. The pitfallsin approach #2 have always made it too problematic for me — regardless of the DBMS involved. A third approach is to use a change data capture (CDC) tool. These tools manage rules-based asymmetric replication betweendatabases — it basically provides your option #1 with a logical definition of the transfer, provides both seedingand catchup capabilities, but without having to manage a ton of replication scripts. I’ve used the approach for large-scalefinancial services applications where the “trade” database of active transactions has to be incredibly fast (andtherefore small with short-lived rows), and the "trade-history” database needs to maintain years of data for complianceand analysis. The asymmetric publisher to subscriber relationship allows for all kinds of interesting indexingand aggregation through materialized views in the history database that you could never afford in the “live” OLTPdatabase. Talend is the commercial open source leader in this area for database-to-database CDC replication; my limited experiencewith it (though not with PostgreSQL) has been good. Debezium is based on Apache Kafka, looks promising, but requiresboth an extension for PostgreSQL publishers and a custom application on the subscriber side. There are a numberof commercial products in this space as well. IBM’s InfoSphere CDC product is the one I know best, but it is bothproprietary and (when I last looked) had limited PostgreSQL support. There is a list of other alternatives on the PostgreSQLwiki at: https://wiki.postgresql.org/wiki/Ecosystem:Replication Cheers, - Evan Evan Bauer eb@evanbauer.com +1 646 641 2973 Skype: evanbauer > On Aug 31, 2018, at 21:57, David Rydzewski <david.rydzewski@gmail.com> wrote: > > I have a postgres database that is doubling in size each year and contains many online orders that were delivered longago and are now dormant in the database. > > My desire is to offload these dormant orders to a secondary Postgres instance, so that the information is still availablefor viewing purchase history, while keeping the active orders in the main, and eventually smaller Postgres instance. > > These orders live in a highly normalized model that spans over 30 tables. > > I have thought of two approaches to handle this "archiving" use case. > > One approach is pull based. As orders naturally go dormant, execute custom SQL to move this information into the secondarypostgres instance. A separate process would then purge this order information from the main database after a periodof time (e.g., custom deletes). The upside is that the process is fairly lightweight and straightforward. The downsideis you have to maintain two data models and the SQL scripts whenever the data model changes. > > Second approach is logical replication. The upside is that you don't have to maintain two data models separately. The downsideis that it becomes very tricky to differentiate between a "normal" delete, which you would want to apply on the secondaryinstance versus an "archive" delete which you wouldn't want to apply on the secondary. It could also get complexwhen replication fails and you have to re-initiate the stream. > > My question is whether anyone else has solved this and if so, how? I'm leaning toward pull-based, because replication doesn'tfeel right with the complexity around "selective" deletes, and other replication failure scenarios. > > Thanks! > > > > >