Re: Disable vacuuming to provide data history

Поиск
Список
Период
Сортировка
От Corey Huinker
Тема Re: Disable vacuuming to provide data history
Дата
Msg-id CADkLM=dt3se+oXEC6FZpTbj1ezaXnfyDm3XougO6mSG2KF69DQ@mail.gmail.com
обсуждение исходный текст
Ответ на Disable vacuuming to provide data history  (marekmosiewicz@gmail.com)
Ответы Re: Disable vacuuming to provide data history
Список pgsql-hackers
On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz@gmail.com> wrote:
Hey,

It depnends on scenario, but there is many use cases that hack data
change from somebody with admin privileges could be disaster.
That is the place where data history could come with help.  Some basic
solution would be trigger which writes previous version of record
to some other table. Trigger however can be disabled or removed (crazy
solution would be to provide pernament
triggers and tables which  can only be pernamently inserted). 
Then we have also possibility to modify tablespace directly on disk.

But Postgres has ability to not override records when two concurrent
transaction modify data to provide MVCC.

So what about pernamently not vacuumable tables. Adding some xid log
tables with hash of record on hash on previous hash.
I think that would be serious additional advantage for best open source
relational databes.

Best regards,
   Marek Mosiewicz

What you are describing sounds like the "system versioning" flavor of "temporal" tables. It's a part of the SQL Standard, but PostgreSQL has yet to implement it in core. Basically, every row has a start_timestamp and end_timestamp field. Updating a row sets the end_timestamp of the old version and inserts a new one with a start_timestamp matching the end-timestamp of the previous row. Once a record has a non-null [1] end_timestamp, it is not possible to update that row via SQL. Regular SQL statements effectively have a "AND end_timestamp IS NULL" filter on them, so the old rows are not visible without specifically invoking temporal features to get point-in-time queries. At the implementation level, this probably means a table with 2 partitions, one for live rows all having null end_timestamps, and one for archived rows which is effectively append-only.

This strategy is common practice for chain of custody and auditing purposes, either as a feature of the RDBMS or home-rolled. I have also seen it used for developing forecasting models (ex "what would this model have told us to do if we had run it a year ago?").

A few years ago, I personally thought about implementing a hash-chain feature, but my research at the time concluded that:

* Few customers were interested in going beyond what was required for regulatory compliance
* Once compliant, any divergence from established procedures, even if it was an unambiguous improvement, only invited re-examination of it and adjacent procedures, and they would avoid that
* They could get the same validation by comparing against a secured backup and out-of-band audit "logs" (most would call them "reports")
* They were of the opinion that if a bad actor got admin access, it was "game over" anyway

The world may have changed since then, but even if there is now interest, I wonder if that isn't better implemented at the OS level rather than the RDBMS level.

 [1] some implementations don't use null, they use an end-timestamp set to a date implausibly far in the future ( 3999-12-31 for example ), but the concept remains that once the column is set to a real timestamp, the row isn't visible to update statements.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Imseih (AWS), Sami"
Дата:
Сообщение: Doc update for pg_stat_statements normalization
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: BRIN indexes vs. SK_SEARCHARRAY (and preprocessing scan keys)