Re: WIP: System Versioned Temporal Table

Поиск
Список
Период
Сортировка
От Corey Huinker
Тема Re: WIP: System Versioned Temporal Table
Дата
Msg-id CADkLM=e=16pyNwHteAXp3r67fJmWz0BYcppphVH9wSSU4kThnQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WIP: System Versioned Temporal Table  (Trevor Gross <t.gross35@gmail.com>)
Ответы Re: WIP: System Versioned Temporal Table  (Vik Fearing <vik@postgresfriends.org>)
Список pgsql-hackers

> 2. Putting data in a side table. This makes DROP SYSTEM VERSIONING
> fairly trivial, but it complicates many DDL commands (please make a
> list?) and requires the optimizer to know about this and cater to it,
> possibly complicating plans. Neither issue is insurmountable, but it
> becomes more intrusive.

I'd vouch for this being the way to go; you completely sidestep issues
like partitioning, unique constraints, optimization, etc. Especially
true when 90% of the time, SELECTs will only be looking at
currently-active data. MDB seems to have gone with the single-table
approach (unless you partition) and I've run into a bug where I can't
add a unique constraint because historical data fails.

#### System versioning & Application versioning
I saw that there is an intent to harmonize system versioning with
application versioning. Haven't read the AV thread so not positive if
that meant intending to split tables by application versioning and
system versioning both: to me it seems like maybe it would be good to
use a separate table for SV, but keep AV in the same table. Reasons
include:

The proposed AV uses just one table.
 
- ISO states only one AV config per table, but there's no reason this
always has to be the case; maybe you're storing products that are
active for a period of time, EOL for a period of time, and obsolete
for a period of time. If ISO sometime decides >1 AV config is OK,
there would be a mess trying to split that into tables.

The proposed AV (so far) allows for that.
 
- DB users who are allowed to change AV items likely won't be allowed
to rewrite history by changing SV items. My proposed schema would keep
these separate.
- Table schemas change, and all (SV active) AV items would logically
need to fit the active schema or be updated to do so. Different story
for SV, nothing there should ever need to be changed.

Yeah, there's a mess (which you state below) about what happens if you create a table and then rename a column, or drop a column and add a same-named column back of another type at a later date, etc. In theory, this means that the valid set of columns and their types changes according to the time range specified. I may not be remembering correctly, but Vik stated that the SQL spec seemed to imply that you had to track all those things.
 
- Partitioning for AV tables isn't as clear as with SV and is likely
better to be user-defined

So this was something I was asking various parties about at PgConf NYC just a few weeks ago. I am supposing that the main reason for SV is a regulatory concern, what tolerance to regulators have for the ability to manipulate the SV side-table? Is it possible to directly insert rows into one? If not, then moving rows into a new partition becomes impossible, and you'd be stuck with the partitioning strategy (if any) that you defined at SV creation time.

The feedback I got was "well, you're already a superuser, if a regulator had a problem with that then they would have required that the SV table's storage be outside the server, either a foreign table, a csv foreign data wrapper of some sort, or a trigger writing to a non-db storage (which wouldn't even need SV).

From that, I concluded that every single AV partition would have it's own SV table, which could in turn be partitioned. In a sense, it might be helpful to think of the SV tables as partitions of the main table, and the period definition would effectively be the constraint that prunes the SV partition.
 

Sorry for acronyms, SV=system versioning, AV=application versioning

In general, I think AV should be treated literally as extra rows in
the main DB, plus the extra PK element and shortcut functions. SV
though, needs to have a lot more nuance.

#### ALTER TABLE
On to ideas about how ALTER TABLE could work. I don't think the
question was ever answered "Do schema changes need to be tracked?" I'm
generally in favor of saying that it should be possible to recreate
the table exactly as it was, schema and all, at a specific period of
time (perhaps for a view) using a fancy combination of SELECT ... AS
and such - but it doesn't need to be straightforward. In any case, no
data should ever be deleted by ALTER TABLE. As someone pointed out
earlier, speed and storage space of ALTER TABLE are likely low
considerations for system versioned tables.

- ADD COLUMN easy, add the column to both the current and historical
table, all null in historical
- DROP COLUMN delete the column from the current table. Historical is
difficult, because what happens if a new column with the same name is
added? Maybe `DROP COLUMN col1` would rename col1 to _col1_1642929683
(epoch time) in the historical table or something like that.
- RENAME COLUMN is a bit tricky too - from a usability standpoint, the
historical table should be renamed as well. A quick thought is maybe
`RENAME col1 TO new_name` would perform the rename in the historical
table, but also create _col1_1642929683 as an alias to new_name to
track that there was a change. I don't think there would be any name
violations in the history table because there would never be a column
name in history that isn't in current (because of the rename described
with DROP).
- Changing column data type: ouch. This needs to be mainly planned for
cases where data types are incompatible, possibly optimized for times
when they are compatible. Seems like another _col1_1642929683 rename
would be in order, and a new col1 created with the new datatype, and a
historical SELECT would automatically merge the two. Possible
optimization: if the old type fits into the new type, just change the
data type in history and make _col1_1642929683 an alias to it.
- Change defaults, nullability, constraints, etc: I think these can
safely be done for the current table only. Realistically, historical
tables could probably skip all checks, always (except their tuple PK),
since trying to enforce them would just be opening the door to bugs.
Trying to think of any times this isn't true.
- FKs: I'm generally in the same boat as above, thinking that these
don't need to affect historical tables. Section 2.5 in the paper I
link below discusses period joins, but I don't think any special
behavior is needed for now. Perhaps references could be kept in
history but not enforced
- Changing PK / adding/removing more columns to PK: Annoying and not
easily dealt with. Maybe just disallow
- Triggers: no affect on historical
- DROP TABLE bye bye, history & all

You seem to have covered all the bases, and the only way I can think to sensibly track all of those things is to allow for multiple SV tables, and every time the main table is altered, you simply start fresh with a new, empty SV table. You'd probably also slap a constraint on the previous SV table to reflect the fact that no rows newer than X will ever be entered there, which would further aid constraint exclusion.
 
Things like row level security add extra complication but can probably
be disregarded. Maybe just have a `select history` permission or
similar.

+1 
 
An interesting idea could be to automatically add system versioning to
information_schema whenever it is added to a table. This would provide
a way to easily query historical DDL. It would also help solve how to
keep historical FKs. This would make it possible to perfectly recreate
system versioned parts of your database at any period of time, schema
and data both.

Interesting...
 
#### Misc
- Seems like a good idea to include MDB's option to exclude columns
from versioning (`WITHOUT SYSTEM VERSIONING` as a column argument).
This is relatively nuanced and I'm not sure if it's officially part of
ISO, but probably helpful for frequently updating small data in rows
with BLOBs. Easy enough to implement, just forget the column in the
historical table.

First I've heard of it. Others will know more.
 
- I thought I saw somewhere that somebody was discussing adding both
row_start and row_end to the PK. Why would this be? Row_end should be
all that's needed to keep unique, but maybe I misread.

I don't think they need to be part of the PK at all. The main table has the PK that it knows of, and the SV tables are indexed independently.

In fact, I don't think row_end needs to be an actual stored value in the main table, because it will never be anything other than null. How we represent such an attribute is another question, but the answer to that possibly ties into how we implement the virtual side of GENERATED ALWAYS AS...
 

#### Links
- I haven't seen it linked here yet but this paper does a phenomenal
deep dive into SV and AV
https://sigmodrecord.org/publications/sigmodRecord/1209/pdfs/07.industry.kulkarni.pdf

My link is different but this seems to be the same PDF that has been cited earlier for both SV and AV.
 
- It's not perfect, but MDB's system versioning is pretty well thought
out. You get a good idea of their thought process going through this
page, worth a read
https://mariadb.com/kb/en/system-versioned-tables/#excluding-columns-from-versioning

#### Finally, the end
There's a heck of a lot of thought that could go into this thing,
probably worth making sure there's a formal agreement on what to be
done before coding starts (PGEP for postgres enhancement proposal,
like PEP? Not sure if something like that exists but it probably
should.). Large parts of the existing patch could likely be reused for
whatever is decided.

Thanks for the input, it helps us get some momentum on this.

 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Replace uses of deprecated Python module distutils.sysconfig
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Replace uses of deprecated Python module distutils.sysconfig