Обсуждение: [HACKERS] ASOF join

Поиск
Список
Период
Сортировка

[HACKERS] ASOF join

От
Konstantin Knizhnik
Дата:
I wonder if there were some discussion/attempts to add ASOF join to Postgres  (sorry, may be there is better term for it, I am refereeing KDB definition: http://code.kx.com/wiki/Reference/aj ).
Such kind of join can be useful when we need to associate two timeseries. It is quite popular in trading:

//join the eq price and size for the op trade time
a::aj[`underlyingSym`time;select time, underlyingSym, sym, putorcall, ex, cond, price, seqnum, contracts, contractsTraded from t;eqtrades];

...and not only. Below is one example of how people now manually coding ASOF join:

select
    count(*),
    count(*)
        filter (where timedelta_prev < -30),
    count(*)
        filter (where ride_prev = ride_next),
    ... -- many different aggregates
from
    (
        select
            p.provider_id,
            p.ts,
            (
                select extract(epoch from t.ts - p.ts)
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
                order by t.ts desc
                limit 1
            ) as timedelta_prev,
            (
                select extract(epoch from t.ts - p.ts)
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
                order by t.ts
                limit 1
            ) as timedelta,
            (
                select ride_id
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
                order by t.ts desc
                limit 1
            ) as ride_prev,
            (
                select ride_id
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
                order by t.ts
                limit 1
            ) as ride_next
        from (
                 select
                     provider_id,
                     ts,
                     event_name
                 from
                     lvl2_681_parsed p
             ) p
        where
            p.event_name = 'GPS signal restored'
       -- offset 0
    ) z;

Without OFFSET 0 this query generates awful execution plans with hundreds (!) of subplans  corresponding to the subqueries.
Number of subplans (most of them are the same) is equal number of occurrences of timedelta, timedelta_prev, ... columns in target aggregates.
OFFSET 0 reduce number of subplans to 4. And I expect that using LATERAL join can reduce it to two and even without "OFFSET 0" trick.
But in any case - it is very complex and unnatural way of expressing this not so complex query.
With ASOF join is can be written much simpler.

Also Postgres is implementing this query using nested loop with index scan, which is definitely not the most efficient strategy.
The best way to implement ASOF join is to use something like mergejoin. Usually there are indexes for both timeseries, so what we need is to merge two ordered sets using ASOF join rules.
It will require minimal changes in SQL syntax, just adding ASOF keyword seems to be enough:

   select  * from Trades ASOF JOIN EqTrades USING (underlyingSym,time);

It seems to me that adding ASOF joins should not require huge amount of work and can be done with minimal number of changes in executor and optimizer.
But may be there are some problems/challenges which I do not realize now?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] ASOF join

От
Thomas Munro
Дата:
On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> I wonder if there were some discussion/attempts to add ASOF join to Postgres
> (sorry, may be there is better term for it, I am refereeing KDB definition:
> http://code.kx.com/wiki/Reference/aj ).

Interesting idea.  Also in Pandas:

http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof

I suppose you could write a function that pulls tuples out of a bunch
of cursors and zips them together like this, as a kind of hand-coded
special merge join "except that we match on nearest key rather than
equal keys" (as they put it).

I've written code like this before in a trading context, where we
called that 'previous tick interpolation', and in a scientific context
where other kinds of interpolation were called for (so not really
matching a tuple but synthesising one if no exact match).  If you view
the former case as a kind of degenerate case of interpolation then it
doesn't feel like a "join" as we know it, but clearly it is.  I had
never considered before that such things might belong inside the
database as a kind of join operator.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] ASOF join

От
David Fetter
Дата:
On Fri, Jun 16, 2017 at 11:51:34AM +1200, Thomas Munro wrote:
> On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
> > I wonder if there were some discussion/attempts to add ASOF join to Postgres
> > (sorry, may be there is better term for it, I am refereeing KDB definition:
> > http://code.kx.com/wiki/Reference/aj ).
> 
> Interesting idea.  Also in Pandas:
> 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof
> 
> I suppose you could write a function that pulls tuples out of a bunch
> of cursors and zips them together like this, as a kind of hand-coded
> special merge join "except that we match on nearest key rather than
> equal keys" (as they put it).
> 
> I've written code like this before in a trading context, where we
> called that 'previous tick interpolation', and in a scientific context
> where other kinds of interpolation were called for (so not really
> matching a tuple but synthesising one if no exact match).  If you view
> the former case as a kind of degenerate case of interpolation then it
> doesn't feel like a "join" as we know it, but clearly it is.  I had
> never considered before that such things might belong inside the
> database as a kind of join operator.

If you turn your head sideways, it's very similar to the range merge
join Jeff Davis proposed.  https://commitfest.postgresql.org/14/1106/

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] ASOF join

От
Konstantin Knizhnik
Дата:

On 16.06.2017 19:07, David Fetter wrote:
> On Fri, Jun 16, 2017 at 11:51:34AM +1200, Thomas Munro wrote:
>> On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik
>> <k.knizhnik@postgrespro.ru> wrote:
>>> I wonder if there were some discussion/attempts to add ASOF join to Postgres
>>> (sorry, may be there is better term for it, I am refereeing KDB definition:
>>> http://code.kx.com/wiki/Reference/aj ).
>> Interesting idea.  Also in Pandas:
>>
>> http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof

I attached simple patch adding ASOF join to Postgres. Right now it 
support only outer join and requires USING clause (consequently it is 
not possible to join two tables which joi keys has different names. May 
be it is also possible to support ON clause with condition written like 
o.k1 = i.k2 AND o.k2 = i.k2 AND ... AND o.kN >= i.kN
But such notation can be confusing, because join result includes only 
one matching inner record with kN smaller or equal than kN of outer 
record and not all such records.
As alternative we can add specia

If people fin such construction really useful, I will continue work on it.


>>
>> I suppose you could write a function that pulls tuples out of a bunch
>> of cursors and zips them together like this, as a kind of hand-coded
>> special merge join "except that we match on nearest key rather than
>> equal keys" (as they put it).
>>
>> I've written code like this before in a trading context, where we
>> called that 'previous tick interpolation', and in a scientific context
>> where other kinds of interpolation were called for (so not really
>> matching a tuple but synthesising one if no exact match).  If you view
>> the former case as a kind of degenerate case of interpolation then it
>> doesn't feel like a "join" as we know it, but clearly it is.  I had
>> never considered before that such things might belong inside the
>> database as a kind of join operator.
> If you turn your head sideways, it's very similar to the range merge
> join Jeff Davis proposed.  https://commitfest.postgresql.org/14/1106/

May be, but I do not understand how to limit result to contain exactly 
one (last) inner tuple for each outer tuple.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] ASOF join

От
Thomas Munro
Дата:
On Mon, Jun 19, 2017 at 11:57 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> I attached simple patch adding ASOF join to Postgres. Right now it support
> only outer join and requires USING clause (consequently it is not possible
> to join two tables which joi keys has different names. May be it is also
> possible to support ON clause with condition written like o.k1 = i.k2 AND
> o.k2 = i.k2 AND ... AND o.kN >= i.kN
> But such notation can be confusing, because join result includes only one
> matching inner record with kN smaller or equal than kN of outer record and
> not all such records.
> As alternative we can add specia

Hmm.  Yeah, I see the notational problem.  It's hard to come up with a
new syntax that has SQL nature.  What if... we didn't use a new syntax
at all, but recognised existing queries that are executable with this
strategy?  Queries like this:

WITH ticks(time, price) AS      (VALUES ('2017-07-20 12:00:00'::timestamptz, 100.00),              ('2017-07-21
11:00:00'::timestamptz,150.00)),    times(time) AS      (VALUES ('2017-07-19 12:00:00'::timestamptz),
('2017-07-2012:00:00'::timestamptz),              ('2017-07-21 12:00:00'::timestamptz),              ('2017-07-22
12:00:00'::timestamptz))

SELECT times.time, previous_tick.price FROM times LEFT JOIN LATERAL (SELECT * FROM ticks                     WHERE
ticks.time<= times.time                     ORDER BY ticks.time DESC LIMIT 1) previous_tick ON trueORDER BY
times.time;
         time          | price
------------------------+--------2017-07-19 12:00:00+12 |2017-07-20 12:00:00+12 | 100.002017-07-21 12:00:00+12 |
150.002017-07-2212:00:00+12 | 150.00
 
(4 rows)

I haven't used LATERAL much myself but I've noticed that it's often
used to express this type of thing.  "Get me the latest ... as of time
...".

It'd a bit like the way we recognise EXISTS (...) as a semi-join and
execute it with a join operator instead of having a SEMI JOIN syntax.
On the other hand it's a bit more long winded, extreme and probably
quite niche.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] ASOF join

От
Konstantin Knizhnik
Дата:

On 21.06.2017 11:00, Thomas Munro wrote:
> Hmm.  Yeah, I see the notational problem.  It's hard to come up with a
> new syntax that has SQL nature.  What if... we didn't use a new syntax
> at all, but recognised existing queries that are executable with this
> strategy?  Queries like this:
>
> WITH ticks(time, price) AS
>         (VALUES ('2017-07-20 12:00:00'::timestamptz, 100.00),
>                 ('2017-07-21 11:00:00'::timestamptz, 150.00)),
>       times(time) AS
>         (VALUES ('2017-07-19 12:00:00'::timestamptz),
>                 ('2017-07-20 12:00:00'::timestamptz),
>                 ('2017-07-21 12:00:00'::timestamptz),
>                 ('2017-07-22 12:00:00'::timestamptz))
>
> SELECT times.time, previous_tick.price
>    FROM times
>    LEFT JOIN LATERAL (SELECT * FROM ticks
>                        WHERE ticks.time <= times.time
>                        ORDER BY ticks.time DESC LIMIT 1) previous_tick ON true
>   ORDER BY times.time;
>
>            time          | price
> ------------------------+--------
>   2017-07-19 12:00:00+12 |
>   2017-07-20 12:00:00+12 | 100.00
>   2017-07-21 12:00:00+12 | 150.00
>   2017-07-22 12:00:00+12 | 150.00
> (4 rows)
>
> I haven't used LATERAL much myself but I've noticed that it's often
> used to express this type of thing.  "Get me the latest ... as of time
> ...".
>
> It'd a bit like the way we recognise EXISTS (...) as a semi-join and
> execute it with a join operator instead of having a SEMI JOIN syntax.
> On the other hand it's a bit more long winded, extreme and probably
> quite niche.
Thank you for this idea. I agree that it is the best way of implementing 
ASOF join - just as optimization of standard SQL query.
But do you think that still it will be good idea to extend SQL syntax 
with ASOF JOIN ... USING ... clause? It will significantly simplify 
writing queries like above
and IMHO doesn't introduce some confusions with standard SQL syntax. My 
primary idea of suggesting ASOF join for Postgres was not  just building 
more efficient plan (using merge join instead of nested loop) but also 
simplifying writing of such queries. Or do you think that nobody will be 
interested in non-standard SQL extensions?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] ASOF join

От
Thomas Munro
Дата:
On Mon, Jun 19, 2017 at 11:57 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> On 16.06.2017 19:07, David Fetter wrote:
>> If you turn your head sideways, it's very similar to the range merge
>> join Jeff Davis proposed.  https://commitfest.postgresql.org/14/1106/
>
> May be, but I do not understand how to limit result to contain exactly one
> (last) inner tuple for each outer tuple.

Yeah, it's somehow related but it's not the same thing.  I guess you
can think of the keys in the ASOF case as modelling range starts, with
range ends implied by the record with next higher/lower key.
Concretely, if every 'tick' row from my example in a nearby message
had a time but also an end time to be set when a new tick is inserted
so that each tick row had the complete effective time range for that
tick, then you could rewrite the query as "get me the tick whose
effective time range overlaps with each value of times.time" and get a
nice range merge join with Jeff's patch.  That sort of thing might be
very useful for SQL:2011 temporal query-style stuff, but it's not what
Konstantin wants to do: he wants to merge time series where the
effective range is not explicitly stored in every row.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] ASOF join

От
Thomas Munro
Дата:
On Wed, Jun 21, 2017 at 9:46 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Thank you for this idea. I agree that it is the best way of implementing
> ASOF join - just as optimization of standard SQL query.

Great.  I think this part definitely has potential.

> But do you think that still it will be good idea to extend SQL syntax with
> ASOF JOIN ... USING ... clause? It will significantly simplify writing
> queries like above
> and IMHO doesn't introduce some confusions with standard SQL syntax. My
> primary idea of suggesting ASOF join for Postgres was not  just building
> more efficient plan (using merge join instead of nested loop) but also
> simplifying writing of such queries. Or do you think that nobody will be
> interested in non-standard SQL extensions?

I can see the appeal, but I expect it to be difficult to convince the
project to accept a non-standard syntax for a niche use case that can
be expressed already.  Q is super terse and designed for time series
data.  SQL is neither of those things.

Some first reactions to the syntaxes you mentioned:

1.  times LEFT ASOF JOIN ticks ON ticks.time <= times.time
2.  times LEFT ASOF JOIN ticks USING (time)
3.  times LEFT ASOF JOIN ticks USING (ticks.time, times.time)

The USING ideas don't seem to be general enough, because there is no
place to say whether to use a lower or higher value if there is no
match, or did I miss something?  Relying on an ORDER BY clause in the
query to control the meaning of the join seems too weird, and making
it always (for example) <= would be an arbitrary limitation.  The
first syntax at least has enough information: when you say one of <,
>, <=, >= you also imply the search order.  I'm not sure if there are
any problems with that, perhaps when combined with other quals.

The equivalent nearly-standard syntax is definitely quite verbose, but
it has the merit of being absolutely explicit about which row from
'ticks' will be selected:
 times LEFT JOIN LATERAL (SELECT * FROM ticks                           WHERE ticks.time <= times.time
     ORDER BY ticks.time DESC LIMIT 1) x ON true
 

-- 
Thomas Munro
http://www.enterprisedb.com