Обсуждение: Understanding the differences between Temporal tables, CDC and Time Series.

Поиск
Список
Период
Сортировка

Understanding the differences between Temporal tables, CDC and Time Series.

От
SQL Padawan
Дата:

Good morning to everyone,

I'm looking into Temporal Tables (TT - i.e. System/Application time versioning) but I would like to understand the differences between TTs and two other features (which appear similar in many ways) - Change Data Capture and Time Series.

Are Temporal Tables (in a sense) a query framework framework around CDC?

And if that's the case, what then are Time Series - which are specifically designed to host (time) changing data?

Are Temporal Tables (again, in a sense) Time Series databases which only record changes in time(-stamps)?

I'm unclear as to the "philosophical" distinctions here and would be grateful if anybody could explain the diffence(s) between them?

I did look at this myself and found these posts on StackOverflow, but I don't feel as if they've fundamentally explained the difference(s).



Any ideas, discussion, references, URLs welcome.

Thx,

SQLP!


Sent with ProtonMail Secure Email.


Re: Understanding the differences between Temporal tables, CDC and Time Series.

От
Peter Eisentraut
Дата:
On 10.11.21 07:43, SQL Padawan wrote:
> I'm looking into Temporal Tables (TT - i.e. System/Application time 
> versioning) but I would like to understand the differences between TTs 
> and two other features (which appear similar in many ways) - Change Data 
> Capture and Time Series.
> 
> Are Temporal Tables (in a sense) a query framework framework around CDC?
> 
> And if that's the case, what then are Time Series - which are 
> specifically designed to host (time) changing data?
> 
> Are Temporal Tables (again, in a sense) Time Series databases which only 
> record changes in time(-stamps)?
> 
> I'm unclear as to the "philosophical" distinctions here and would be 
> grateful if anybody could explain the diffence(s) between them?

A time series database contains data records containing a time stamp (a 
point in time).  These records represent things that happened, for 
example log files, measurements, events.  This data is then usually used 
for analytics, such as, how many web site visitors did we have per hour 
over the last week.

A temporal database contains records that have a timestamp range (in 
PostgreSQL terms, also known as period or interval, delimited by two 
points in time).  There are (at least) two ways to use this.  One is 
that the database system automatically records when a record was 
inserted, updated, and deleted, so that historical versions of records 
are automatically kept, and you can then query, what was the value of 
this at this time, or more generally what was the result of this query 
at this time.  Use cases here are record-keeping, auditing, and the 
like.  The other is that the database users themselves enter when a 
record is supposed to be valid, and the database system takes this into 
account in queries.  So you could have product prices or salaries that 
automatically change at a certain time.  This is basically a tool to 
make development of certain database applications more convenient.

So these two concepts are really totally separate and independent 
things, except that they have a word based on "time" in their name.



Re: Understanding the differences between Temporal tables, CDC and Time Series.

От
SQL Padawan
Дата:

Hi Peter, and thanks for answering.


> > I'm unclear as to the "philosophical" distinctions here and would be
> > grateful if anybody could explain the diffence(s) between them?

> A time series database contains data records containing a time stamp

> A temporal database contains records that have a timestamp range


OK - so we have two stock "ticker" dbs - one a Time Series db (TS-db) and one a Temporal db (T-db).

All of the code below is on a fiddle here

https://dbfiddle.uk/?rdbms=postgres_13&fiddle=207f38e5c7d6e7861c402c2c4926840c

CREATE TABLE ts_portfolio  -- time series
(
  txn_id  INTEGER NOT NULL,
  company TEXT NOT NULL,
  price   INTEGER NOT NULL,
  txn_ts  TIMESTAMP NOT NULL
);

with a few records


INSERT INTO ts_portfolio VALUES
(1234,       'DB',     10,  '07/11/21 00:12:00'),    -- record inserted on 07/11/21
(2345,       'DB',     20,  '08/11/21 00:12:00'),    -- record inserted on 08/11/21
(3456,       'DB',     30,  '09/11/21 00:12:00'),    -- record inserted on 09/11/21
(4567,       'DB',     40,  '10/11/21 00:12:00'),    -- record inserted on 10/11/21
(5678,       'DB',     50,  '11/11/21 00:12:00');    -- record inserted on 11/11/21


So, for example, in order to query the share price AS OF 00:00:00 on the 09/11

SELECT company, price
FROM ts_portfolio
WHERE company = 'DB'
AND txn_ts <= '09/11/21 00:00:00'
ORDER BY txn_ts DESC
LIMIT 1;

We pick up the price from the latest transaction either before or exactly on 09/11/2021 00:00:00

Things become tricky when we want to see the prices on a given day: 09/11/2021

--
-- UNION better here? See EXPLAIN in fiddle - not important for this debate!
--

WITH cte1 AS
(
  SELECT txn_id AS t_start
  FROM ts_portfolio
  WHERE company = 'DB'
  AND txn_ts <= '09/11/21 00:00:00'
  ORDER BY txn_ts DESC
  LIMIT 1
), cte2 AS
(
  SELECT txn_id AS t_end
  FROM ts_portfolio
  WHERE company = 'DB'
  AND txn_ts <= '09/11/21 23:59:59.999999'
  ORDER BY txn_ts DESC
  LIMIT 1
)
SELECT
  t.txn_id, t.company, t.price
FROM
  ts_portfolio t
WHERE txn_id BETWEEN (SELECT t_start FROM cte1) AND (SELECT t_end FROM cte2);


txn_id    company    price
2345    DB    20
3456    DB    30

which is correct - from 08/11 midday till 09/11 midday, the price was 20 and then at midday 09/11 it rose to 30!


Becoming complex.

However, a Temporal (versioned) table would only require one extra field:

  txn_t_from TIMESTAMP NOT NULL,
  txn_t_to   TIMESTAMP NOT NULL  -- extra field - see fiddle for data inserts and updates!


Each INSERT (behind the scenes) is an INSERT and an UPDATE.

INSERT INTO tdb_portfolio VALUES
(1000, 'DB',  5, '01/01/1900',        '07/11/21 00:12:00'),  -- inserted in the past, updated 07/11 @ 12:00
(1234, 'DB', 10, '07/11/21 00:12:00', '08/11/21 00:12:00'),  -- record inserted on 07/11/21 - updated 08/11 @ 12:00
(2345, 'DB', 20, '08/11/21 00:12:00', '09/11/21 00:12:00'),  -- record inserted on 08/11/21 - and so on...
(3456, 'DB', 30, '09/11/21 00:12:00', '10/11/21 00:12:00'),  -- record inserted on 09/11/21
(4567, 'DB', 40, '10/11/21 00:12:00', '11/11/21 00:12:00'),  -- record inserted on 10/11/21
(5678, 'DB', 50, '11/11/21 00:12:00', '19/01/2038');  -- record inserted on 11/11/21 - never updated (effectively
INFINITYuntil next update) 



The queries become simpler:

SELECT txn_id, company, price
FROM tdb_portfolio AS OF '09/11/2021 00:00:00';

and the complex one above collapses to:

SELECT txn_id, company, price FROM tdb_portfolio
FOR SYSTEM_TIME BETWEEN ('09/11/2021 00:00:00' AND '09/11/2021 23:59:59.999999');

or maybe if there was a DATE(SYSTEM_TIME) function, ... FOR DATE(SYSTEM_TIME) = '09/11/2021';


I'm assuming that (when properly introduced into PostgreSQL), there will be optimisations for these sorts of query.

Do temporal tables bring anything else "to the party" - the augmented functionality is a nice-to-have, but hardly
earth-shattering?

Have I missed out on anything important?

Thx, SQLP!