Обсуждение: Eliminating unnecessary left joins
Hi,
When using views built with left joins, and then querying against these views, there are a lot of join in the plan that are not necessary, because I don't select/use any column of each table in the views every time. Tables that are left joined and never referenced anywhere else in the query should be removed from the plan. I think this can be done without any other analyzation or catalog lookup, so it is a quite cheap optimization step, and doing it won't influence the result, but the query will run faster.
This way with a complex query against these views usually the half of the join can be eliminated, and the plan will be quite more optimal.
Why left join a table if never used/referenced in the query?
How easy is to teach Postgres to this?
I would like to help somehow to introduce this feature as soon as possible. What should I do?
Thanks,
Otto
"Ottó Havasvölgyi" <havasvolgyi.otto@gmail.com> writes: > When using views built with left joins, and then querying against these > views, there are a lot of join in the plan that are not necessary, because I > don't select/use any column of each table in the views every time. Tables > that are left joined and never referenced anywhere else in the query should > be removed from the plan. That might cause you to get the wrong number of copies of some rows --- what if a row of the left table should join to multiple rows on the right? regards, tom lane
Tom Lane wrote: > "Ottó Havasvölgyi" <havasvolgyi.otto@gmail.com> writes: > >> When using views built with left joins, and then querying against these >> views, there are a lot of join in the plan that are not necessary, because I >> don't select/use any column of each table in the views every time. Tables >> that are left joined and never referenced anywhere else in the query should >> be removed from the plan. >> > > That might cause you to get the wrong number of copies of some rows --- > what if a row of the left table should join to multiple rows on the right? > That would be trouble. But I've seen quite some cases where the right can contain only zero or one row, because of PK constraints. In this case, elimination would be safe. Regards, Andreas
On Fri, 2007-04-06 at 19:38 -0400, Tom Lane wrote: > "Ottó Havasvölgyi" <havasvolgyi.otto@gmail.com> writes: > > When using views built with left joins, and then querying against these > > views, there are a lot of join in the plan that are not necessary, because I > > don't select/use any column of each table in the views every time. Tables > > that are left joined and never referenced anywhere else in the query should > > be removed from the plan. > > That might cause you to get the wrong number of copies of some rows --- > what if a row of the left table should join to multiple rows on the right? In the case that PKs match between the tables, then exclusion is safe. This would enable vertical partitioning, so is a very desirable feature. If this was possible, it would be a commonly used optimisation. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
2007/4/7, Andreas Pflug <pgadmin@pse-consulting.de>: > Tom Lane wrote: > >> "Ottó Havasvölgyi" <havasvolgyi.otto@gmail.com> writes: >> >>> When using views built with left joins, and then querying against these >>> views, there are a lot of join in the plan that are not necessary, because I >>> don't select/use any column of each table in the views every time. Tables >>> that are left joined and never referenced anywhere else in the query should >>> be removed from the plan. >> >> That might cause you to get the wrong number of copies of some rows --- >> what if a row of the left table should join to multiple rows on the right? > >That would be trouble. But I've seen quite some cases where the right >can contain only zero or one row, because of PK constraints. In this > case, elimination would be safe. I would like to mention that this kind of structure is used by Hibernate (ORM for Java/.NET) for mapping class hierarchies. I can attest that this optimization is supported by MS-SQL and I think (not tested) also by Oracle. To recapitulate, the optimization would be: Remove left outer joined tables from the join list, if they are not used by the query, and the join attributes are a key for it (I assume an equality join). Typical example: PARENT_CLASS (PK: ID) CHILD_CLASS (PK: ID) In query: SELECT P.ID FROM PARENT_CLASS P LEFT OUTER JOIN CHILD_CLASS C ON P.ID = C.ID; the join on CHILD_CLASS can be eliminated, because the join attribute ID is a key for it, and none of its attributes are used in the query. Hibernate: <url:http://www.hibernate.org/> Hibernate Inheritance Mapping: <url:http://www.hibernate.org/hib_docs/reference/en/html/inheritance.html> greetings, Nicolas -- Nicolas Barbier http://www.gnu.org/philosophy/no-word-attachments.html
Sorry, I have left out the PK requirement.
What Nicolas wrote is right, I also use an O/R mapper and inheritance is solved with vertical partitioning. The tables are connected to each other with the PK. And the mapper defines views for each class with left joins. The mapper generates queries based on these views. A high fraction of the joins would be eliminated almost in every query.
My simple example:
Class hierarchy and fields:
Shape (ID, X, Y)
|
+-Circle (ID, Radius)
|
+-Rectangle (ID, Width, Height)
The mapper creates 3 tables with the columns next to the class name.
And it creates 3 views. One of them:
RectangleView: SELECT r."ID" as "ID", s."X" as "X", s."Y" as "Y", r."Width" as "Width", r."Height" as "Height" FROM "Rectangle" r LEFT JOIN "Shape" s ON ( r.ID=s.ID)
Now if I query Rectangle object IDs, whose Width is greater than 5, it will generate this:
SELECT "ID" FROM RectangleView WHERE "Width">5
In this case I don't need to left join the Shape table, because X and Y columns are not used.
The other typical situation is when I execute more complex, not-O/Rmapper-generated SQL commands based on these views for reporting. For example the average width of rectangles whose height is greater than 10.
----------------------------------------------------
This optimization should be also applied to subqueries.
Is this optimization relatively easy to introduce?
I would gladly work on this, but unfortunately I don't know the codebase at all.
I would really appreciate if someone competent implemented this feature in 8.4.
Thank you in advance,
Otto
2007/4/7, Ottó Havasvölgyi <havasvolgyi.otto@gmail.com>: > My simple example: > > Class hierarchy and fields: > Shape (ID, X, Y) > | > +-Circle (ID, Radius) > | > +-Rectangle (ID, Width, Height) > > The mapper creates 3 tables with the columns next to the class name. > And it creates 3 views. One of them: > > RectangleView: SELECT r."ID" as "ID", s."X" as "X", s."Y" as "Y", r."Width" > as "Width", r."Height" as "Height" FROM "Rectangle" r LEFT JOIN "Shape" s ON > ( r.ID=s.ID) I find this view definition a bit strange: why is there a left outer join? I expect there to be a FK from Rectangle.ID to Shape.ID ("all rectangles are shapes"), which makes the definition totally equivalent with one in which a normal join is used (whether attributes of Shape are used or not). The main use case I see for the original optimization is ORMs that join in a whole hierarchy, even when only a part of it is needed. I guess that that is rather common. The ORM that I use does exactly this, because the main target-DBMSs (MS-SQL and Oracle) do the optimization for it. Example (somewhat less contrived than my previous one): Imagine an implementation of the typical "books that are borrowed by people" n-m relationship, using three tables ("Book", "Borrowed", "Person"). Let's find all books that have been borrowed by a certain person. The "non-ORM" version would be something like: SELECT Book.* FROM Book JOIN Borrowed ON Borrowed.book_id = Book.id WHERE Borrowed.person_id = <x>; Now assume that Borrowed is a class hierarchy mapped into multiple tables by a typical ORM. The query would probably become something like: SELECT Book.* FROM Book JOIN Borrowed_Parent ON Borrowed_Parent.book_id = Book.id LEFT JOIN Borrowed_Child1 ON Borrowed_Child1.id= Borrowed_Parent.id LEFT JOIN Borrowed_Child2 ON Borrowed_Child2.id = Borrowed_Parent.id (...) WHERE Borrowed_Parent.person_id = <x>; It is clear that the children of the hierarchy are needlessly joined in (as the only attribute that is actually needed is person_id, which is on the parent level). It is not always trivial for the ORM to find that out, without writing stuff that looks suspiciously similar to a DBMS optimizer. Maybe it is debatable whether this optimization should be done by the application (i.e. the ORM) or by the DBMS. I am personally in favor of doing it in the DBMS. greetings, Nicolas -- Nicolas Barbier http://www.gnu.org/philosophy/no-word-attachments.html
My mapper joins the parent classes' tables to the current class' table in the view. In the ShapeView only ID, X, and Y is selected from the shape table, and none of the child tables are touched, opposite to your sample. But even though all Shape objects (circles and rectangles too) are in the resultset as Shape objects. I see this storage model quite consistent.
You are right, that this can be done with inner join too, this is an option in the mapper. Oracle and MSSQL performs this left join optimization, so it is usually used with left join by other mapper users. I have asked them (the developers of the mapper) to perform this optimization at mapper level because not all DBMSs supported this optimization, but it seemed they didn't like this idea... And then I came here. This optimization would be useful for every Postgres users.
To be honest I also find your sample strange, more exactly that *sibling* child tables are left joined to the parent. Maybe because the storage model is different than in my mapper.
In my case the left joined parent tables should be excluded by the optimizer if possible.
Best regards,
Otto
2007/4/8, Nicolas Barbier <nicolas.barbier@gmail.com>:
2007/4/7, Ottó Havasvölgyi < havasvolgyi.otto@gmail.com>:
> My simple example:
>
> Class hierarchy and fields:
> Shape (ID, X, Y)
> |
> +-Circle (ID, Radius)
> |
> +-Rectangle (ID, Width, Height)
>
> The mapper creates 3 tables with the columns next to the class name.
> And it creates 3 views. One of them:
>
> RectangleView: SELECT r."ID" as "ID", s."X" as "X", s."Y" as "Y", r."Width"
> as "Width", r."Height" as "Height" FROM "Rectangle" r LEFT JOIN "Shape" s ON
> ( r.ID=s.ID)
I find this view definition a bit strange: why is there a left outer
join? I expect there to be a FK from Rectangle.ID to Shape.ID ("all
rectangles are shapes"), which makes the definition totally equivalent
with one in which a normal join is used (whether attributes of Shape
are used or not).
The main use case I see for the original optimization is ORMs that
join in a whole hierarchy, even when only a part of it is needed. I
guess that that is rather common. The ORM that I use does exactly
this, because the main target-DBMSs (MS-SQL and Oracle) do the
optimization for it.
Example (somewhat less contrived than my previous one):
Imagine an implementation of the typical "books that are borrowed by
people" n-m relationship, using three tables ("Book", "Borrowed",
"Person"). Let's find all books that have been borrowed by a certain
person.
The "non-ORM" version would be something like:
SELECT Book.*
FROM
Book
JOIN Borrowed ON Borrowed.book_id = Book.id
WHERE Borrowed.person_id = <x>;
Now assume that Borrowed is a class hierarchy mapped into multiple
tables by a typical ORM. The query would probably become something
like:
SELECT Book.*
FROM
Book
JOIN Borrowed_Parent ON Borrowed_Parent.book_id = Book.id
LEFT JOIN Borrowed_Child1 ON Borrowed_Child1.id = Borrowed_Parent.id
LEFT JOIN Borrowed_Child2 ON Borrowed_Child2.id = Borrowed_Parent.id
(...)
WHERE Borrowed_Parent.person_id = <x>;
It is clear that the children of the hierarchy are needlessly joined
in (as the only attribute that is actually needed is person_id, which
is on the parent level). It is not always trivial for the ORM to find
that out, without writing stuff that looks suspiciously similar to a
DBMS optimizer.
Maybe it is debatable whether this optimization should be done by the
application (i.e. the ORM) or by the DBMS. I am personally in favor of
doing it in the DBMS.
greetings,
Nicolas
--
Nicolas Barbier
http://www.gnu.org/philosophy/no-word-attachments.html
I agree with others that the way that query is constructed is a bit odd, but it does bring another optimization to mind: when doing an inner-join between a parent and child table when RI is defined between them, if the query only refers to the child table you can drop the parent table from the join, because each row in the child table must have one and only one row in the parent. Use-case: I'll often use views to make it easier to query several related tables, but not all queries against the views need to hit every table. IE: if a table has several status fields that have RI to parent tables that describe what each status is, you sometimes will query for the status description, sometimes not. I suspect that checking to see if tables have the right unique keys or RI would add a noticeable amount of extra work to query planning, so we might want a GUC to disable it. On Apr 7, 2007, at 12:45 PM, Ottó Havasvölgyi wrote: > Sorry, I have left out the PK requirement. > What Nicolas wrote is right, I also use an O/R mapper and > inheritance is solved with vertical partitioning. The tables are > connected to each other with the PK. And the mapper defines views > for each class with left joins. The mapper generates queries based > on these views. A high fraction of the joins would be eliminated > almost in every query. > > My simple example: > > Class hierarchy and fields: > Shape (ID, X, Y) > | > +-Circle (ID, Radius) > | > +-Rectangle (ID, Width, Height) > > The mapper creates 3 tables with the columns next to the class name. > And it creates 3 views. One of them: > > RectangleView: SELECT r."ID" as "ID", s."X" as "X", s."Y" as "Y", > r."Width" as "Width", r."Height" as "Height" FROM "Rectangle" r > LEFT JOIN "Shape" s ON ( r.ID=s.ID) > > Now if I query Rectangle object IDs, whose Width is greater than 5, > it will generate this: > > SELECT "ID" FROM RectangleView WHERE "Width">5 > > In this case I don't need to left join the Shape table, because X > and Y columns are not used. > > > The other typical situation is when I execute more complex, not-O/ > Rmapper-generated SQL commands based on these views for reporting. > For example the average width of rectangles whose height is greater > than 10. > ---------------------------------------------------- > > This optimization should be also applied to subqueries. > > > > Is this optimization relatively easy to introduce? > > I would gladly work on this, but unfortunately I don't know the > codebase at all. > I would really appreciate if someone competent implemented this > feature in 8.4. > > Thank you in advance, > Otto > -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim,
Maybe odd, but simpler to optimize this way.
Your idea would be also a very good optimization, there was already a discussion about that here: http://archives.postgresql.org/pgsql-performance/2006-01/msg00151.php , but that time Tom refused it because it was too expensive and rare. Maybe now he has a different opinion.
However, left join optimization is lot simpler and cheaper, and can be useful not only for O/R mappers, but for efficient vertical partitioning as Simon mentioned.
Best regards,
Otto
2007/4/12, Jim Nasby <decibel@decibel.org>:
I agree with others that the way that query is constructed is a bit
odd, but it does bring another optimization to mind: when doing an
inner-join between a parent and child table when RI is defined
between them, if the query only refers to the child table you can
drop the parent table from the join, because each row in the child
table must have one and only one row in the parent.
Use-case: I'll often use views to make it easier to query several
related tables, but not all queries against the views need to hit
every table. IE: if a table has several status fields that have RI to
parent tables that describe what each status is, you sometimes will
query for the status description, sometimes not.
I suspect that checking to see if tables have the right unique keys
or RI would add a noticeable amount of extra work to query planning,
so we might want a GUC to disable it.
On Apr 7, 2007, at 12:45 PM, Ottó Havasvölgyi wrote:
> Sorry, I have left out the PK requirement.
> What Nicolas wrote is right, I also use an O/R mapper and
> inheritance is solved with vertical partitioning. The tables are
> connected to each other with the PK. And the mapper defines views
> for each class with left joins. The mapper generates queries based
> on these views. A high fraction of the joins would be eliminated
> almost in every query.
>
> My simple example:
>
> Class hierarchy and fields:
> Shape (ID, X, Y)
> |
> +-Circle (ID, Radius)
> |
> +-Rectangle (ID, Width, Height)
>
> The mapper creates 3 tables with the columns next to the class name.
> And it creates 3 views. One of them:
>
> RectangleView: SELECT r."ID" as "ID", s."X" as "X", s."Y" as "Y",
> r."Width" as "Width", r."Height" as "Height" FROM "Rectangle" r
> LEFT JOIN "Shape" s ON ( r.ID=s.ID)
>
> Now if I query Rectangle object IDs, whose Width is greater than 5,
> it will generate this:
>
> SELECT "ID" FROM RectangleView WHERE "Width">5
>
> In this case I don't need to left join the Shape table, because X
> and Y columns are not used.
>
>
> The other typical situation is when I execute more complex, not-O/
> Rmapper-generated SQL commands based on these views for reporting.
> For example the average width of rectangles whose height is greater
> than 10.
> ----------------------------------------------------
>
> This optimization should be also applied to subqueries.
>
>
>
> Is this optimization relatively easy to introduce?
>
> I would gladly work on this, but unfortunately I don't know the
> codebase at all.
> I would really appreciate if someone competent implemented this
> feature in 8.4.
>
> Thank you in advance,
> Otto
>
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
> Maybe odd, but simpler to optimize this way. > > Your idea would be also a very good optimization, there was > already a discussion about that here: > http://archives.postgresql.org/pgsql-performance/2006-01/msg00 > 151.php, but that time Tom refused it because it was too > expensive and rare. Maybe now he has a different opinion. > However, left join optimization is lot simpler and cheaper, > and can be useful not only for O/R mappers, but for efficient > vertical partitioning as Simon mentioned. For the views use case there is a simple solution without the expensive optimization: If you have a PK FK relationship simply rewrite the view to use a left join instead of a join. Since there is always one row on the outer (PK) side it makes no difference to the result set. And then the left join optimization can be used. Andreas
On Wed, 11 Apr 2007, Jim Nasby wrote: > I agree with others that the way that query is constructed is a bit > odd, but it does bring another optimization to mind: when doing an > inner-join between a parent and child table when RI is defined > between them, if the query only refers to the child table you can > drop the parent table from the join, because each row in the child > table must have one and only one row in the parent. I don't think that's quite true without qualifications. First, I think it needs to be an immediate constraint (and I don't remember how we handle set constraints inside functions that might be called from a statement, so it might need to be not deferrable). Second, I think you also need to take care of NULLs since child rows with NULLs in the key pass the constraint but have no rows in the parent and would get culled by the inner join. Also, there's a possible issue that constraints do not actually guarantee that they always hold true, merely that they hold true at particular times. I don't know if it's possible to get a statement executed such that it would see the table state between the action and constraint check or if such is allowed by spec.
I have this exact problem a lot. There are actually cases where you can eliminate regular joins, not just left joins. For example: CREATE TABLE partner (id serial,name varchar(40) not null,primary key (id) ); CREATE TABLE project (id serial,name varchar(40) not null,partner_id integer not null referencesproject (id) ); CREATE VIEW project_view AS SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM project p, partner pp WHERE p.partner_id = pp.id; If someone does a select from project_view and does not select the partner column, the join can be eliminated, because the not null and foreign key constraints on the partner_id column guarantee that there will always be exactly one matching row in the project table. If you didn't have the NOT NULL constraint on the partner_id column, you'd have to write the view this way, as described in the original email: CREATE VIEW project_view AS SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM project p LEFT JOIN partner pp ON p.partner_id = pp.id; In this example, I just have one join, so the benefit to eliminating it is minimal (unless the tables are very large). But in the real application, project_view joins the project table against six other tables using inner joins (all against the primary keys of those other tables) and four additional tables using left joins (also against the primary keys of those other tables). Most queries only use a subset of these columns - a typical query requires evaluating only about three of the ten joins. ...Robert
> I have this exact problem a lot. There are actually cases > where you can eliminate regular joins, not just left joins. > For example: > > CREATE TABLE partner ( > id serial, > name varchar(40) not null, > primary key (id) > ); > > CREATE TABLE project ( > id serial, > name varchar(40) not null, > partner_id integer not null references project (id) ^^^^^^^ -- I assume typo, should be partner > ); > > CREATE VIEW project_view AS > SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM > project p, partner pp WHERE p.partner_id = pp.id; Same advice to you: 1. add not null to your id's 2. CREATE VIEW project_view AS SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM project p left outer join partnerpp ON p.partner_id = pp.id; 3. wait (or implement :-) the left join optimization in pg Andreas
Hi,
Could you Bruce please add a TODO item for this feature?
The description could look something like this:
Eliminate the table T from the query/subquery if the following requirements are satisfied:
1. T is left joined
2. T is referenced only in the join expression where it is left joined
3. the left join's join expression is a simple equality expression like T1.C1=T2.C2; T1!=T2 and (T==T1 or T==T2)
4. the column of T in the join exression is the primary key of T
----------------------------------------------------
I hope it is comlete.
I think this is the simplest case, so we should start with this.
Thanks,
Otto
2007/4/16, Ottó Havasvölgyi <havasvolgyi.otto@gmail.com>: > Eliminate the table T from the query/subquery if the following requirements > are satisfied: > 1. T is left joined > 2. T is referenced only in the join expression where it is left joined > 3. the left join's join expression is a simple equality expression like > T1.C1=T2.C2; T1!=T2 and (T==T1 or T==T2) > 4. the column of T in the join exression is the primary key of T Condition 4 should be: the column of T in the join expression is a key of T (i.e. it doesn't need to be the PK, a UNIQUE constraint would be enough). This process can be done recursively (implementation doesn't have to be recursive, of course), to eliminate whole sub-trees of the join tree. Nicolas -- Nicolas Barbier http://www.gnu.org/philosophy/no-word-attachments.html
But then what about the null values? Perhaps unique + notnull is better?
Otto
2007/4/20, Nicolas Barbier <nicolas.barbier@gmail.com>:
2007/4/16, Ottó Havasvölgyi <havasvolgyi.otto@gmail.com >:
> Eliminate the table T from the query/subquery if the following requirements
> are satisfied:
> 1. T is left joined
> 2. T is referenced only in the join expression where it is left joined
> 3. the left join's join expression is a simple equality expression like
> T1.C1=T2.C2; T1!=T2 and (T==T1 or T==T2)
> 4. the column of T in the join exression is the primary key of T
Condition 4 should be: the column of T in the join expression is a key
of T (i.e. it doesn't need to be the PK, a UNIQUE constraint would be
enough).
This process can be done recursively (implementation doesn't have to
be recursive, of course), to eliminate whole sub-trees of the join
tree.
Nicolas
--
Nicolas Barbier
http://www.gnu.org/philosophy/no-word-attachments.html