Обсуждение: Filtering before join with date_trunc()

Поиск
Список
Период
Сортировка

Filtering before join with date_trunc()

От
"Phil Endecott"
Дата:
Dear Experts,

I have a few tables with "raw" timestamsps like this:

+-------------------------------+----------+
|             time              | pressure |
+-------------------------------+----------+
| 2018-09-14 00:00:07.148378+00 |  1007.52 |
| 2018-09-14 00:10:07.147506+00 |  1007.43 |
| 2018-09-14 00:20:07.147533+00 |  1007.28 |
+-------------------------------+----------+

For each of these tables I have a view which rounds the timestamp 
to the nearest minute, and ensures there is only one row per minute:

 SELECT date_trunc('minute'::text, tbl."time") AS "time",
    max(tbl.pressure) AS pressure
   FROM tbl
  GROUP BY (date_trunc('minute'::text, tbl."time"))
  ORDER BY (date_trunc('minute'::text, tbl."time"));

I then join these tables on the rounded time:

 SELECT COALESCE(rain."time", pressures."time", temperatures."time") AS "time",
    rain.rain,
    pressures.pressure,
    temperatures.temperature
   FROM rain
     FULL JOIN pressures USING ("time")
     FULL JOIN temperatures USING ("time");

+------------------------+------+----------+-------------+
|          time          | rain | pressure | temperature |
+------------------------+------+----------+-------------+
| 2018-09-14 00:00:00+00 |    0 |  1007.52 |      11.349 |
| 2018-09-14 00:10:00+00 |    0 |  1007.43 |     11.2317 |
| 2018-09-14 00:20:00+00 |    0 |  1007.28 |     11.2317 |
+------------------------+------+----------+-------------+

The COALESCE for time and the full joins are needed because some 
columns may be missing for some minutes.

Now I'd like to find the values for a particular short time period:

SELECT * FROM readings
WHERE "time" BETWEEN '2018-10-01T00:00:00' AND '2018-10-01T24:00:00'

This works, but it is inefficient; it seems to create all the rounded 
data, do the join on all of it, and then filter on the time period.  
Ideally it would filter the raw data, and then need to round and join 
far fewer rows.

It would not be difficult for me to round the timestamps when inserting 
the data, and also ensure that there is only one row for each minute. 
But I've done some experiments and even if I remove all the rounding and 
replace the full joins with regular joins, it still does sequential 
scans on at least one of the tables:

Nested Loop  (cost=12.95..144.99 rows=135 width=20)
   Join Filter: (x_rain."time" = x_pressures."time")
   ->  Hash Join  (cost=12.67..97.83 rows=135 width=24)
         Hash Cond: (x_temperatures."time" = x_rain."time")
         ->  Seq Scan on x_temperatures  (cost=0.00..67.50 rows=4350 width=12)
         ->  Hash  (cost=10.98..10.98 rows=135 width=12)
               ->  Index Scan using x_rain_by_time on x_rain  (cost=0.28..10.98 rows=135 width=12)
                     Index Cond: (("time" >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND ("time" <=
'2018-10-0200:00:00+00'::timestamp with time zone))
 
   ->  Index Scan using x_pressures_by_time on x_pressures  (cost=0.28..0.34 rows=1 width=12)
         Index Cond: ("time" = x_temperatures."time")

Maybe that is because the tables are currently relatively small (a 
few thousands rows) and it believes that sequential scans are faster. 
(I have sometimes wished for an "explain" variant that tells me what 
query plan it would choose if all the tables were 100X larger.)

Is there anything I can do to make this more efficient when the tables 
are larger?


Thanks for any suggestions.


Regards, Phil.




Re: Filtering before join with date_trunc()

От
Francisco Olarte
Дата:
Hi Phil:

On Mon, Oct 15, 2018 at 5:57 PM, Phil Endecott
<spam_from_pgsql_lists@chezphil.org> wrote:
...
> For each of these tables I have a view which rounds the timestamp
> to the nearest minute, and ensures there is only one row per minute:
...
> I then join these tables on the rounded time:
....
> Now I'd like to find the values for a particular short time period:

For what I propose I assume the SHORT time is really short....

...
> This works, but it is inefficient; it seems to create all the rounded
> data, do the join on all of it, and then filter on the time period.
> Ideally it would filter the raw data, and then need to round and join
> far fewer rows.

I think you are expecting too much, I mean, you are expecting the
server to know it can expand your time-period into a (maybe) bigger
one which covers the original data and push that condition down. In my
experience this is normally too much.

...

> Is there anything I can do to make this more efficient when the tables
> are larger?

If your periods are really short you could try to replace the time
condition on the views to a time condition in the tables and do the
rounding and grouping afterwards. I mean, use a half-open interval to
catch the relevant chunks of the tables and then join the short
results.

I think with a trio of CTE selecting with a time interval on the WHERE
and doing the date_trunc()/MAX() group by you should be able to do
three index scans producing short results which can then be
full-joined and coalesced. If you want the interval from $A to $B (
rounded to minutes ), do something like....

WITH pressures AS (
 SELECT date_trunc('minute'::text, tbl."time") AS "time",
    max(tbl.pressure) AS pressure
   FROM tbl
-- Chop the relevant time..
WHERE time >= $A and time < $B + '1 minute'::interval
-- There may be easier ways to make the above condition if you are
generating the text, but always use half-open
  GROUP BY 1 ),  -- Easier to type, and no order-by here ( and I
normally label order by in views as a code smell )....
,
yada, yada....-- repeat for temperatures, rain

SELECT
  SELECT COALESCE(rain."time", pressures."time", temperatures."time") AS "time",
    rain.rain,
    pressures.pressure,
    temperatures.temperature
   FROM rain
     FULL JOIN pressures USING ("time")
     FULL JOIN temperatures USING ("time")
ORDER BY 1; -- ORDER BY GOES HERE, I suspect your example got ordered
by chance, not by dessign.

( use other names, I just used the view names for C&P, lazy me ).


Francisco Olarte.


Re: Filtering before join with date_trunc()

От
Adrian Klaver
Дата:
On 10/15/18 8:57 AM, Phil Endecott wrote:
> Dear Experts,
> 
> I have a few tables with "raw" timestamsps like this:
> 
> +-------------------------------+----------+
> |             time              | pressure |
> +-------------------------------+----------+
> | 2018-09-14 00:00:07.148378+00 |  1007.52 |
> | 2018-09-14 00:10:07.147506+00 |  1007.43 |
> | 2018-09-14 00:20:07.147533+00 |  1007.28 |
> +-------------------------------+----------+
> 
> For each of these tables I have a view which rounds the timestamp
> to the nearest minute, and ensures there is only one row per minute:
> 
>   SELECT date_trunc('minute'::text, tbl."time") AS "time",
>      max(tbl.pressure) AS pressure
>     FROM tbl
>    GROUP BY (date_trunc('minute'::text, tbl."time"))
>    ORDER BY (date_trunc('minute'::text, tbl."time"));
> 
> I then join these tables on the rounded time:
> 
>   SELECT COALESCE(rain."time", pressures."time", temperatures."time") AS "time",
>      rain.rain,
>      pressures.pressure,
>      temperatures.temperature
>     FROM rain
>       FULL JOIN pressures USING ("time")
>       FULL JOIN temperatures USING ("time");
> 
> +------------------------+------+----------+-------------+
> |          time          | rain | pressure | temperature |
> +------------------------+------+----------+-------------+
> | 2018-09-14 00:00:00+00 |    0 |  1007.52 |      11.349 |
> | 2018-09-14 00:10:00+00 |    0 |  1007.43 |     11.2317 |
> | 2018-09-14 00:20:00+00 |    0 |  1007.28 |     11.2317 |
> +------------------------+------+----------+-------------+
> 
> The COALESCE for time and the full joins are needed because some
> columns may be missing for some minutes.
> 
> Now I'd like to find the values for a particular short time period:
> 
> SELECT * FROM readings
> WHERE "time" BETWEEN '2018-10-01T00:00:00' AND '2018-10-01T24:00:00'

Is readings a table or view?

If view is the SELECT COALESCE ... query the view query?

> 
> This works, but it is inefficient; it seems to create all the rounded
> data, do the join on all of it, and then filter on the time period.
> Ideally it would filter the raw data, and then need to round and join
> far fewer rows.
> 
> It would not be difficult for me to round the timestamps when inserting
> the data, and also ensure that there is only one row for each minute.
> But I've done some experiments and even if I remove all the rounding and
> replace the full joins with regular joins, it still does sequential
> scans on at least one of the tables:
> 
> Nested Loop  (cost=12.95..144.99 rows=135 width=20)
>     Join Filter: (x_rain."time" = x_pressures."time")
>     ->  Hash Join  (cost=12.67..97.83 rows=135 width=24)
>           Hash Cond: (x_temperatures."time" = x_rain."time")
>           ->  Seq Scan on x_temperatures  (cost=0.00..67.50 rows=4350 width=12)
>           ->  Hash  (cost=10.98..10.98 rows=135 width=12)
>                 ->  Index Scan using x_rain_by_time on x_rain  (cost=0.28..10.98 rows=135 width=12)
>                       Index Cond: (("time" >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND ("time" <=
'2018-10-0200:00:00+00'::timestamp with time zone))
 
>     ->  Index Scan using x_pressures_by_time on x_pressures  (cost=0.28..0.34 rows=1 width=12)
>           Index Cond: ("time" = x_temperatures."time")
> 
> Maybe that is because the tables are currently relatively small (a
> few thousands rows) and it believes that sequential scans are faster.
> (I have sometimes wished for an "explain" variant that tells me what
> query plan it would choose if all the tables were 100X larger.)
> 
> Is there anything I can do to make this more efficient when the tables
> are larger?
> 
> 
> Thanks for any suggestions.
> 
> 
> Regards, Phil.
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Filtering before join with date_trunc()

От
Tom Lane
Дата:
"Phil Endecott" <spam_from_pgsql_lists@chezphil.org> writes:
> ...
> For each of these tables I have a view which rounds the timestamp
> to the nearest minute, and ensures there is only one row per minute:

>  SELECT date_trunc('minute'::text, tbl."time") AS "time",
>     max(tbl.pressure) AS pressure
>    FROM tbl
>   GROUP BY (date_trunc('minute'::text, tbl."time"))
>   ORDER BY (date_trunc('minute'::text, tbl."time"));

> I then join these tables on the rounded time:

>  SELECT COALESCE(rain."time", pressures."time", temperatures."time") AS "time",
>     rain.rain,
>     pressures.pressure,
>     temperatures.temperature
>    FROM rain
>      FULL JOIN pressures USING ("time")
>      FULL JOIN temperatures USING ("time");

> The COALESCE for time and the full joins are needed because some
> columns may be missing for some minutes.

> Now I'd like to find the values for a particular short time period:

> SELECT * FROM readings
> WHERE "time" BETWEEN '2018-10-01T00:00:00' AND '2018-10-01T24:00:00'

> This works, but it is inefficient; it seems to create all the rounded
> data, do the join on all of it, and then filter on the time period.
> Ideally it would filter the raw data, and then need to round and join
> far fewer rows.

You're expecting too much.  In the first place, it's generally impossible
to push constraints down through full joins, because that could change the
results.  (For example, if we remove a row from one join input, that could
result in getting unwanted null-extended rows from the other join input.)
Maybe with a whole lot of analysis we could prove that applying the "same"
constraint to both join keys doesn't break anything, but there's no such
intelligence there now --- in general, we've expended little if any effort
on optimizing full joins.  It's also not very clear to me that we can do
anything at all with pushing down constraints that are expressed in terms
of a JOIN USING merged column; they don't really constrain either input
individually.

> But I've done some experiments and even if I remove all the rounding and
> replace the full joins with regular joins, it still does sequential
> scans on at least one of the tables:

I think you're also expecting the system to deduce that it can apply an
inequality on one join column to the other one.  It doesn't; only equality
constraints have any sort of transitivity logic.

So you'll need to write out the BETWEEN separately for each table,
and put it below the full join, which means you won't be able to
use those nice views :-(

            regards, tom lane


Re: Filtering before join with date_trunc()

От
"Phil Endecott"
Дата:
Thanks all for the replies.

Tom Lane wrote:
> You're expecting too much.

That often seems to be the case.

> I think you're also expecting the system to deduce that it can apply an
> inequality on one join column to the other one.  It doesn't; only equality
> constraints have any sort of transitivity logic.
>
> So you'll need to write out the BETWEEN separately for each table,
> and put it below the full join, which means you won't be able to
> use those nice views :-(

Here's an example:

create table t1 ("time" timestamptz, value1 numeric);
create index t1_time on t1("time");
\copy t1 from ......

create table t2 ("time" timestamptz, value2 numeric);
create index t2_time on t2("time");
\copy t2 from ......

explain select * from t1 join t2 using("time") where "time" between '2018-10-01' and '2018-10-02';
+------------------------------------------------------------------------------------------------+
|                                           QUERY PLAN                                           |
+------------------------------------------------------------------------------------------------+
| Hash Join  (cost=12.99..101.03 rows=138 width=21)                                              |
|   Hash Cond: (t2."time" = t1."time")                                                           |
|   ->  Seq Scan on t2  (cost=0.00..70.11 rows=4411 width=15)                                    |
|   ->  Hash  (cost=11.18..11.18 rows=145 width=14)                                              |
|         ->  Index Scan using t1_time on t1  (cost=0.28..11.18 rows=145 width=14)               |
|               Index Cond: (("time" >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND .|
|.("time" <= '2018-10-02 00:00:00+00'::timestamp with time zone))                                |
+------------------------------------------------------------------------------------------------+

explain with q1 as (select * from t1 where "time" between '2018-10-01' and '2018-10-02'), q2 as (select * from t2 where
"time"between '2018-10-01' and '2018-10-02') select * from q1 join q2 using("time");
 
+------------------------------------------------------------------------------------------------+
|                                           QUERY PLAN                                           |
+------------------------------------------------------------------------------------------------+
| Hash Join  (cost=26.60..31.41 rows=136 width=72)                                               |
|   Hash Cond: (q1."time" = q2."time")                                                           |
|   CTE q1                                                                                       |
|     ->  Index Scan using t1_time on t1  (cost=0.28..11.18 rows=145 width=14)                   |
|           Index Cond: (("time" >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND ("ti.|
|.me" <= '2018-10-02 00:00:00+00'::timestamp with time zone))                                    |
|   CTE q2                                                                                       |
|     ->  Index Scan using t2_time on t2  (cost=0.28..11.00 rows=136 width=15)                   |
|           Index Cond: (("time" >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND ("ti.|
|.me" <= '2018-10-02 00:00:00+00'::timestamp with time zone))                                    |
|   ->  CTE Scan on q1  (cost=0.00..2.90 rows=145 width=40)                                      |
|   ->  Hash  (cost=2.72..2.72 rows=136 width=40)                                                |
|         ->  CTE Scan on q2  (cost=0.00..2.72 rows=136 width=40)                                |
+------------------------------------------------------------------------------------------------+


So.... as you say, even if I strip out all of the complexity of approximate 
timestamps and missing values, it's never going to push the BETWEEN filter 
down below the join.  Even with just a few thousand rows I see a 5X speedup 
with the second query with the explicit filtering below the join.

This is rather disappointing.  Am I the only person who's ever wanted to do 
this?


Regards, Phil.