Patch: Range Merge Join

Поиск

Список

Период

Сортировка

От	Thomas
Тема	Patch: Range Merge Join
Дата	9 июня 2021 г. 15:05:35
Msg-id	CAMWfgiAsJgVrkbrsv740Y2=2duO4rVYRhaD08EhFqBuJFmBH1A@mail.gmail.com обсуждение исходный текст
Ответы	Re: Patch: Range Merge Join
Список	pgsql-hackers

Дерево обсуждения

Hi Hackers,

More than a year ago we submitted a patch that offered two primitives
(ALIGN and NORMALIZE) to support the processing of temporal data with range types. During the ensuing discussion we decided to withdraw the original patch
and to split it into smaller parts.

In the context of my BSc thesis, we started working and implementing a Range Merge Join (RMJ), which is key for most temporal operations. The RMJ is a useful operator in its own right and it greatly benefits any possible temporal extension.

We have implemented the Range Merge Join algorithm by extending the
existing Merge Join to also support range conditions, i.e., BETWEEN-AND
or @> (containment for range types). Range joins contain a containment
condition and may have (optional) equality conditions. For example the
following query joins employees with a department and work period with
events on a specific day for that department:

SELECT emps.name, emps.dept, events.event, events.day
FROM emps JOIN events ON emps.dept = events.dept
AND events.day <@ emps.eperiod;

The resulting query plan is as follows:

QUERY PLAN

----------------------------------------------------------------------------------------------
Range Merge Join (cost=106.73..118.01 rows=3 width=100) (actual rows=6
loops=1)
Merge Cond: (emps.dept = events.dept)
Range Cond: (events.day <@ emps.eperiod)
-> Sort (cost=46.87..48.49 rows=650 width=96) (actual rows=5 loops=1)
Sort Key: emps.dept, emps.eperiod
Sort Method: quicksort Memory: 25kB
-> Seq Scan on emps (cost=0.00..16.50 rows=650 width=96) (actual
rows=5 loops=1)
-> Sort (cost=59.86..61.98 rows=850 width=68) (actual rows=6 loops=1)
Sort Key: events.dept, events.day
Sort Method: quicksort Memory: 25kB
-> Seq Scan on events (cost=0.00..18.50 rows=850 width=68)
(actual rows=5 loops=1)
Planning Time: 0.077 ms
Execution Time: 0.092 ms
(13 rows)

Example queries and instances of tables can be found at the end of the mail.

The range merge join works with range types using <@ and also scalar data types
using "a.ts BETWEEN b.ts AND b.te" or "b.ts <= a.ts AND a.ts <= b.te".
Currently, PostgreSQL does not provide specialized join algorithms for range
conditions (besides index nested loops), or Hash Join and Merge Joins that
evaluate an equality condition only.

Our idea is to have a separate range_cond besides the merge_cond for the
Merge Join that stores the potential range conditions of a query. The state
diagram of the Merge Join is then extended to also take into consideration
the range_cond. See the simplified state diagram of the Range Merge Join as
an extension of the Merge Join in the attachment. These additions besides a
boolean check have no effect on the Marge Join when no range condition is
present.

We provide extensive testing results and further information, including the
full BSc Thesis (technical report), describing the implementation and tests
in detail on http://tpg.inf.unibz.it/project-rmj and
http://tpg.inf.unibz.it/downloads/rmj-report.pdf.

We performed several experiments and show that depending on the selectivity of
the range condition the range merge join outperforms existing execution
algorithms up to an order of magnitude. We found that the range merge join that
needs to find range_cond from inequalities, incurs only a very small overhead
in planning time in some TPCH queries (see Table 5.3 in the technical report)
and in general only a very small overhead for a large number of joins or many
inequality conditions (see Figure 5.1). To check the overhead of our extension
for the traditional merge join execution time, we executed the TPCH queries
using the merge join (hash join disabled) and found no statistically
significant difference (see Table 5.4).

We are looking forward to your feedback and any suggestions to improve the patch.

Best Regards,

Thomas Mannhart

Attachments: State Diagram and Patch

OPEN POINTS AND TODOs:

- Currently we do not consider parallelization
- Not all cases for input sort orders are considered yet

EXAMPLE QUERIES:

The first query uses a range condition using BETWEEN AND only and no
equality condition.

----------------------------------------------------------------------------------------------

DROP TABLE IF EXISTS marks;
DROP TABLE IF EXISTS grades;

CREATE TABLE marks (name text, snumber numeric, mark numeric);
CREATE TABLE grades (mmin numeric, mmax numeric, grade numeric);

INSERT INTO marks (name, snumber, mark) VALUES
('Anton', 1232, 23.5),
('Thomas', 4356, 95),
('Michael', 1125, 72),
('Hans', 3425, 90);

INSERT INTO grades (mmin, mmax, grade) VALUES
(0.0, 18, 1),
(18.5, 36, 2),
(36.5, 54, 3),
(54.5, 72, 4),
(72.5, 90, 5),
(90.5, 100, 6);

EXPLAIN(ANALYZE, TIMING FALSE)
SELECT marks.name, marks.snumber, grades.grade
FROM marks JOIN grades ON marks.mark BETWEEN grades.mmin AND grades.mmax;

QUERY PLAN

-----------------------------------------------------------------------------------------------
Range Merge Join (cost=93.74..920.13 rows=46944 width=96) (actual rows=16
loops=1)
Range Cond: ((marks.mark >= grades.mmin) AND (marks.mark <= grades.mmax))
-> Sort (cost=46.87..48.49 rows=650 width=96) (actual rows=12 loops=1)
Sort Key: grades.mmin
Sort Method: quicksort Memory: 25kB
-> Seq Scan on grades (cost=0.00..16.50 rows=650 width=96)
(actual rows=12 loops=1)
-> Sort (cost=46.87..48.49 rows=650 width=96) (actual rows=21 loops=1)
Sort Key: marks.mark
Sort Method: quicksort Memory: 25kB
-> Seq Scan on marks (cost=0.00..16.50 rows=650 width=96)
(actual rows=8 loops=1)
Planning Time: 0.078 ms
Execution Time: 0.068 ms
(12 rows)

----------------------------------------------------------------------------------------------

The second query uses a range and an equality condition and joins the
relations using contained in (<@).

----------------------------------------------------------------------------------------------

DROP TABLE IF EXISTS emps;
DROP TABLE IF EXISTS events;

CREATE TABLE emps (name text, dept text, eperiod daterange);
CREATE TABLE events (event text, dept text, day date);

INSERT INTO emps (name, dept, eperiod) VALUES
('Anton', 'Sales', '(2020-01-01, 2020-03-31)'),
('Thomas', 'Marketing', '(2020-01-01, 2020-06-30)'),
('Michael', 'Marketing', '(2020-03-01, 2020-12-31)'),
('Hans', 'Sales', '(2020-01-01, 2020-12-31)'),
('Thomas', 'Accounting', '(2020-07-01, 2020-12-31)');

INSERT INTO events (event, dept, day) VALUES
('Fair CH', 'Marketing', '2020-03-05'),
('Presentation', 'Sales', '2020-06-15'),
('Fair IT', 'Marketing', '2020-08-03'),
('Balance Report', 'Accounting', '2020-08-03'),
('Product launch', 'Marketing', '2020-10-15');

EXPLAIN(ANALYZE, TIMING FALSE)
SELECT emps.name, emps.dept, events.event, events.day
FROM emps JOIN events ON emps.dept = events.dept
AND events.day <@ emps.eperiod;

QUERY PLAN

----------------------------------------------------------------------------------------------
Range Merge Join (cost=106.73..118.01 rows=3 width=100) (actual rows=6
loops=1)
Merge Cond: (emps.dept = events.dept)
Range Cond: (events.day <@ emps.eperiod)
-> Sort (cost=46.87..48.49 rows=650 width=96) (actual rows=5 loops=1)
Sort Key: emps.dept, emps.eperiod
Sort Method: quicksort Memory: 25kB
-> Seq Scan on emps (cost=0.00..16.50 rows=650 width=96) (actual
rows=5 loops=1)
-> Sort (cost=59.86..61.98 rows=850 width=68) (actual rows=6 loops=1)
Sort Key: events.dept, events.day
Sort Method: quicksort Memory: 25kB
-> Seq Scan on events (cost=0.00..18.50 rows=850 width=68)
(actual rows=5 loops=1)
Planning Time: 0.077 ms
Execution Time: 0.092 ms
(13 rows)

----------------------------------------------------------------------------------------------

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Patch: Range Merge Join

Вложения