Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL

Поиск
Список
Период
Сортировка
От Motog Plus
Тема Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL
Дата
Msg-id CAL5Gnivs1MKypYvrOGFLyk73KG8wmg1qAint=pkMdTkCGBEXMQ@mail.gmail.com
обсуждение исходный текст
Ответ на Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL  (Motog Plus <mplus7535@gmail.com>)
Список pgsql-admin
Hi Everyone,

Thank you all for the helpful suggestions, insights, and follow-up questions. I truly appreciate the time and effort you’ve taken to share your experiences and recommendations.

To answer one of the common questions: **yes, we are using partitioned tables**, primarily based on a timestamp column. This setup is already helping us manage and isolate historical data more effectively.

The input from this community has been incredibly valuable in helping us shape our archival approach. We’re currently evaluating a few options based on your feedback and will proceed with a solution that best balances efficiency, reliability, and security.

We may reach out again with more specific questions or for further suggestions once we finalize the approach and start implementation.

Thanks again for your support!

Best regards,  
Ramzy

On Sat, May 31, 2025, 01:01 Ron Johnson <ronljohnsonjr@gmail.com> wrote:
That's an unanswerable question, as I would not use Windows.  😁

Seriously though, since it's an image-heavy database full of PDF and TIFF files, I'd do what I did on Linux when needing to migrate/upgrade a 6TB (including indices) db from PG 9.6 to PG 14, and took four hours:
pg_dump -Z1 --jobs=16


On Fri, May 30, 2025 at 2:39 PM Andy Hartman <hartman60home@gmail.com> wrote:
What would you use for backup if PG hosted on Windows

On Fri, May 30, 2025 at 2:10 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Hmm... that was a few years ago, back when v12 was new.  It took about  a month (mainly because they didn't want me running exports during "office hours").

There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was able to add indices on date columns, create by-month views.  (We migrated the dozen or so relatively small UPDATE tables on cut-over day.  On that same day, I migrated the current month and the previous month's data in those 120 tables.

I made separate cron jobs to:
- export views from Oracle into COPY-style tab-separated flat files, 
- lz4-compress views that had finished exporting, and
- scp files that were finished compressing, to an AWS EC2 VM.

These jobs pipelined, so there was always a job exporting, always a job ready to compress tsv files, and another job ready to scp the lz4 files.  When there was nothing for a step to do, the job would sleep for a couple of minutes, then check if there was more work to do.

On the AWS EC2 VM, a different cron job waited for files to finish transferring, then loaded them into the correct table. Just like with the source host jobs, the "load" job would sleep a bit and then check for more work. I manually applied Indices.

The AWS RDS PG12 database was about 4TB.  Snapshots were handled by AWS.  If this had been one of my on-prem systems, I'd have used pgbackrest.  (pgbackrest is impressively fast: takes good advantage of PG's 1GB file max, and globs "small" files into one big file.)

On Fri, May 30, 2025 at 12:15 PM Andy Hartman <hartman60home@gmail.com> wrote:
what was the duration start to finish of the migration of the 6tb of data. then what do you use for a quick backup after archived PG data 

Thanks.

On Fri, May 30, 2025 at 11:29 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
On Fri, May 30, 2025 at 3:51 AM Motog Plus <mplus7535@gmail.com> wrote:
Hi Team,

We are currently planning a data archival initiative for our production PostgreSQL databases and would appreciate suggestions or insights from the community regarding best practices and proven approaches.

**Scenario:**
- We have a few large tables (several hundred million rows) where we want to archive historical data (e.g., older than 1 year).
- The archived data should be moved to a separate PostgreSQL database (on a same or different server).
- Our goals are: efficient data movement, minimal downtime, and safe deletion from the source after successful archival.

- PostgreSQL version: 15.12
- Both source and target databases are PostgreSQL.

We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded to a SharePoint or similar storage system. However, our infrastructure team raised concerns around the computational load of large CSV processing and potential security implications with file transfers.

We’d like to understand:
- What approaches have worked well for you in practice?

This is how I migrated 6TB of data from an Oracle database to Postgresql, and then implemented quarterly archiving of the PG database:
- COPY FROM (SELECT * FROM live_table WHERE date_fld in some_manageable_date_range) TO STDOUT.
- Compress
- scp
- COPY TO archive_table.
- Index
- DELETE FROM live_table WHERE date_fld in some_manageable_date_range  (This I only did in the PG archive process
 
(Naturally, the Oracle migration used Oracle-specific commands.)

- Are there specific tools or strategies you’d recommend for ongoing archival?

I write generic bash loops to which you pass an array that contains the table name, PK, date column and date range.

Given a list of tables, it did the COPY FROM, lz4 and scp.  Once that finished successfully, another script dropped archive indices on the current table, COPY TO and CREATE INDEX statements.  A third script did the deletes.

This works even when the live database tables are all connected via FK.  You just need to carefully order the tables in your script.
 
- Any performance or consistency issues we should watch out for?

My rules for scripting are "bite-sized pieces" and "check those return codes!".
 
Your insights or any relevant documentation/pointers would be immensely helpful.
 
Index support uber alles.  When deleting from a table which relies on a foreign key link to a table which _does_ have a date field, don't hesitate to join on that table.

And DELETE of bite-sized chunks is faster than people give it credit for.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

В списке pgsql-admin по дате отправления: