Обсуждение: How to properly query lots of rows based on timestamps?

Поиск
Список
Период
Сортировка

How to properly query lots of rows based on timestamps?

От
Thorsten Schöning
Дата:
Hi all,

I have a table containing around 95 million rows, pretty much only
storing a timestamp and further IDs of related tables containing the
actual data in the end.

> CREATE TABLE clt_rec
> (
>   id BIGSERIAL NOT NULL,
>   oms_rec     BIGINT NOT NULL,
>   captured_at TIMESTAMP(6) WITH TIME ZONE NOT NULL,
>   rssi        SMALLINT NOT NULL,
>   CONSTRAINT pk_clt_rec PRIMARY KEY (id),
>   CONSTRAINT fk_clt_rec_oms_rec FOREIGN KEY (oms_rec) REFERENCES "oms_rec" ("id"),
>   CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec)
> );

In many use cases I need to search all of those rows based on their
timestamp to find rows arbitrary in the past: Sometimes it's only 15
minutes into the past, sometimes it's 2 years, sometimes it's finding
the first day of each month over 15 months for some of those telegrams
etc. In the end, I pretty often need to compare those timestamps and
some queries simply take multiple seconds in the end, especially
adding up if multiple, but slightly different queries need to be
executed one after another. The following are two abstracts of
Postgres' query plans:

Plan 1:

> ->  Nested Loop  (cost=1.14..343169.49 rows=43543 width=20) (actual time=0.313..113.974 rows=34266 loops=3)
>     ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..3437.90 rows=43543 width=24)
(actualtime=0.153..20.192 rows=34266 loops=3) 
>         Index Cond: ((captured_at >= ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval)) AND
(captured_at<= ('2020-08-01 00:00:00+02'::timestamp with time zone + '1 day'::interval))) 
>     ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.80 rows=1 width=12) (actual time=0.002..0.002 rows=1
loops=102799)
>         Index Cond: (id = clt_rec.oms_rec)

Plan 2:

> ->  Nested Loop  (cost=1.14..836381.50 rows=111934 width=20) (actual time=0.379..911.697 rows=334465 loops=3)
>     ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..8819.57 rows=111934 width=24)
(actualtime=0.193..154.176 rows=334465 loops=3) 
>         Index Cond: ((captured_at >= ('2020-08-28 10:21:06.968+02'::timestamp with time zone - '14 days'::interval))
AND(captured_at <= ('2020-08-28 10:21:06.968+02'::timestamp with time zone + '00:00:00'::interval))) 
>     ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.39 rows=1 width=12) (actual time=0.002..0.002 rows=1
loops=1003394)
>         Index Cond: (id = clt_rec.oms_rec)

Postgres seems to properly use available indexes, parallel workers and
stuff like that. But looking at the actual times and compared to all
the other parts of the query, comparing those timestamps simply takes
the most time.

I've looked into this topic and found statements about that one
shouldn't put too many rows into the index[1] and stuff like that or
it will be ignored at all. But that doesn't seem to be the case for me
according to the plan. OTOH, my index really simply is about the
column containing the timestamp, no function reducing things to dates
or stuff like that to reduce the number of rows.

>  CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at );

So where does the actual runtime come from in the above plan? Is it
simply comparing lots of timestamps and that takes how long it takes?
Or is it calculating with those timestamps already?

Thanks!

[1]: https://stackoverflow.com/questions/15977741/postgres-is-ignoring-a-timestamp-index-why

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow




Re: How to properly query lots of rows based on timestamps?

От
Alban Hertroys
Дата:
> On 29 Aug 2020, at 10:24, Thorsten Schöning <tschoening@am-soft.de> wrote:
>
> Hi all,
>
> I have a table containing around 95 million rows, pretty much only
> storing a timestamp and further IDs of related tables containing the
> actual data in the end.
>
>> CREATE TABLE clt_rec
>> (
>>  id BIGSERIAL NOT NULL,
>>  oms_rec     BIGINT NOT NULL,
>>  captured_at TIMESTAMP(6) WITH TIME ZONE NOT NULL,
>>  rssi        SMALLINT NOT NULL,
>>  CONSTRAINT pk_clt_rec PRIMARY KEY (id),
>>  CONSTRAINT fk_clt_rec_oms_rec FOREIGN KEY (oms_rec) REFERENCES "oms_rec" ("id"),
>>  CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec)
>> );
>
> In many use cases I need to search all of those rows based on their
> timestamp to find rows arbitrary in the past: Sometimes it's only 15
> minutes into the past, sometimes it's 2 years, sometimes it's finding
> the first day of each month over 15 months for some of those telegrams
> etc. In the end, I pretty often need to compare those timestamps and
> some queries simply take multiple seconds in the end, especially
> adding up if multiple, but slightly different queries need to be
> executed one after another. The following are two abstracts of
> Postgres' query plans:
>
> Plan 1:
>
>> ->  Nested Loop  (cost=1.14..343169.49 rows=43543 width=20) (actual time=0.313..113.974 rows=34266 loops=3)
>>    ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..3437.90 rows=43543 width=24)
(actualtime=0.153..20.192 rows=34266 loops=3) 
>>        Index Cond: ((captured_at >= ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval)) AND
(captured_at<= ('2020-08-01 00:00:00+02'::timestamp with time zone + '1 day'::interval))) 
>>    ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.80 rows=1 width=12) (actual time=0.002..0.002 rows=1
loops=102799)
>>        Index Cond: (id = clt_rec.oms_rec)

What happens here is that the planner looks up the lower and upper boundaries, everything in between those index nodes
isa candidate record. Next, it loops over those to match the other condition of your query (id = clt_rec.oms_rec). You
didn’ttell whether there’s an index on that column. 

You’d probably see a performance improvement were you to create an index on (captured_at, id). If your Postgres version
issomewhat recent, that could even lead to an Index Only Scan. 


> Plan 2:
>
>> ->  Nested Loop  (cost=1.14..836381.50 rows=111934 width=20) (actual time=0.379..911.697 rows=334465 loops=3)
>>    ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..8819.57 rows=111934 width=24)
(actualtime=0.193..154.176 rows=334465 loops=3) 
>>        Index Cond: ((captured_at >= ('2020-08-28 10:21:06.968+02'::timestamp with time zone - '14 days'::interval))
AND(captured_at <= ('2020-08-28 10:21:06.968+02'::timestamp with time zone + '00:00:00'::interval))) 
>>    ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.39 rows=1 width=12) (actual time=0.002..0.002 rows=1
loops=1003394)
>>        Index Cond: (id = clt_rec.oms_rec)

And this situation is very much the same issue, apart from the larger number of candidate records.

> Postgres seems to properly use available indexes, parallel workers and
> stuff like that. But looking at the actual times and compared to all
> the other parts of the query, comparing those timestamps simply takes
> the most time.

It only needs to compare 2 timestamps.

> I've looked into this topic and found statements about that one
> shouldn't put too many rows into the index[1] and stuff like that or
> it will be ignored at all. But that doesn't seem to be the case for me
> according to the plan. OTOH, my index really simply is about the
> column containing the timestamp, no function reducing things to dates
> or stuff like that to reduce the number of rows.
>
>> CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at );

Try this:
CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at, id );


Alban Hertroys
--
There is always an exception to always.







Re: How to properly query lots of rows based on timestamps?

От
Tom Lane
Дата:
=?utf-8?Q?Thorsten_Sch=C3=B6ning?= <tschoening@am-soft.de> writes:
> I have a table containing around 95 million rows, pretty much only
> storing a timestamp and further IDs of related tables containing the
> actual data in the end.

>> CREATE TABLE clt_rec
>> (
>> id BIGSERIAL NOT NULL,
>> oms_rec     BIGINT NOT NULL,
>> captured_at TIMESTAMP(6) WITH TIME ZONE NOT NULL,
>> rssi        SMALLINT NOT NULL,
>> CONSTRAINT pk_clt_rec PRIMARY KEY (id),
>> CONSTRAINT fk_clt_rec_oms_rec FOREIGN KEY (oms_rec) REFERENCES "oms_rec" ("id"),
>> CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec)
>> );

TBH, this seems like a pretty awful data design.  If you included the
timestamp column into oms_rec, and had an index on it, then you would
not need a join at all.

> Postgres seems to properly use available indexes, parallel workers and
> stuff like that. But looking at the actual times and compared to all
> the other parts of the query, comparing those timestamps simply takes
> the most time.

Timestamp comparison reduces to comparison of int64's, so it's
hard to imagine that anything could be much cheaper.  The part of
your plan that is actually eating the most cycles is the repeated
index probes into oms_rec:

>>     ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.39 rows=1 width=12) (actual time=0.002..0.002 rows=1
loops=1003394)

.002 * 1003394 = 2006.788, which of course can't be because the
whole query took 911 ms; but there's not much accuracy in this
per-loop measurement.  In any case, the 155 msec spent scanning
clt_rec would be entirely unnecessary if the timestamp existed in
oms_rec.  We can also bet that the index traversal costs would be
quite a bit less: in this query, we are re-descending pk_oms_rec
from the root, 1003394 times, which is a lot of cycles that wouldn't
be needed with a single scan on a timestamp index.

            regards, tom lane



Re: How to properly query lots of rows based on timestamps?

От
Thorsten Schöning
Дата:
Guten Tag Tom Lane,
am Samstag, 29. August 2020 um 17:19 schrieben Sie:

> TBH, this seems like a pretty awful data design.  If you included the
> timestamp column into oms_rec, and had an index on it, then you would
> not need a join at all.

clt_rec and oms_rec are different types of data and only the former
includes the timestamp, that's simply how the stored data works. So if
at all, oms_rec needs to be embedded into clt_rec entirely. While I
considered that in the past already, wasn't sure if that is the best
approach to model those two different data types.

I'll make a note and consider refactoring at some point, so thanks for
your opinion!

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow




Re: How to properly query lots of rows based on timestamps?

От
Thorsten Schöning
Дата:
Guten Tag Alban Hertroys,
am Samstag, 29. August 2020 um 11:29 schrieben Sie:

> [...]Next, it loops over those to match the other condition of
> your query (id = clt_rec.oms_rec). You didn’t tell whether there’s
> an index on that column.

Which column, oms_rec.id or clt_rec.oms_rec? The former has one
because it's the unique key and the plan says so as well:

> Index Scan using pk_oms_rec on oms_rec

clt_rec.oms_rec OTOH is only marked as unique, without explicitly
creating an index.

> CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec)

But doesn't that mean an index is available behind the scenes,
maintained by Postgres? So it should have been used if it would
provide any benefit?

> Adding a unique constraint will automatically create a unique B-tree
> index on the column or group of columns listed in the constraint.

https://www.postgresql.org/docs/9.4/ddl-constraints.html

> You’d probably see a performance improvement were you to create an
> index on (captured_at, id). If your Postgres version is somewhat
> recent, that could even lead to an Index Only Scan.

Tried that with your statement at the bottom and it didn't seem to
change anything even when using Postgres 11:

> ->  Nested Loop  (cost=1.14..343169.49 rows=43543 width=20) (actual time=0.228..95.554 rows=34266 loops=3)
>     ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..3437.90 rows=43543 width=24)
(actualtime=0.119..16.895 rows=34266 loops=3) 
>           Index Cond: ((captured_at >= ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval))
AND(captured_at <= ('2020-08-01 00:00:00+02'::timestamp with time zone + '1 day'::interval))) 
>     ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.80 rows=1 width=12) (actual time=0.002..0.002 rows=1
loops=102799)
>           Index Cond: (id = clt_rec.oms_rec)

vs. with your suggested change:

> ->  Nested Loop  (cost=1.14..513397.11 rows=43543 width=20) (actual time=0.236..97.044 rows=34266 loops=3)
>     ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..173665.52 rows=43543 width=24)
(actualtime=0.183..17.464 rows=34266 loops=3) 
>           Index Cond: ((captured_at >= ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval))
AND(captured_at <= ('2020-08-01 00:00:00+02'::timestamp with time zone + '1 day'::interval))) 
>     ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.80 rows=1 width=12) (actual time=0.002..0.002 rows=1
loops=102799)
>           Index Cond: (id = clt_rec.oms_rec)

Good news is that having one of those indexes in place at all makes a
huge difference compared to having neither of both. :-D

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow




Re: How to properly query lots of rows based on timestamps?

От
Thorsten Schöning
Дата:
Guten Tag Tom Lane,
am Samstag, 29. August 2020 um 17:19 schrieben Sie:

> Timestamp comparison reduces to comparison of int64's, so it's
> hard to imagine that anything could be much cheaper.[...]

Do you think it matters much if the timestamp related calculations are
done outside the query by the executing Java instead?

> ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval)

In Java, I currently only have the timestamp and the intervals as
ISO8601-expressions like P1D, P14D etc. and leave all of their parsing
to Postgres as much as possible. This results in calculations like the
above necessary before actually comparing against stored timestamps.

I would expect that it doesn't make much difference if Postgres or
Java calculate the necessary timestamps. As long as Postgres is clever
enough to NOT calculate by row, but only once instead? Which I guess
it is? Is there any hint to such things in the plan? I didn't see or
understood them.

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow