Обсуждение: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

Поиск
Список
Период
Сортировка

Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

От
Steven Grimm
Дата:
We have a table, call it "multi_id", that contains columns with IDs of
various kinds of objects in my system, and another table that's a
generic owner/key/value store for object attributes (think configuration
settings, and I'll refer to this table as "settings"). To wit:

---------------------------------------------
CREATE TABLE multi_id (
   id1 INTEGER PRIMARY KEY,
   id2 INTEGER,
   id3 INTEGER
);
CREATE TABLE settings (
   owner_id INTEGER,
   setting_id INTEGER,
   setting_value TEXT,
   PRIMARY KEY (owner_id, setting_id)
);
CREATE UNIQUE INDEX multi_id_idx_id1 ON multi_id (id1, id2);
CREATE UNIQUE INDEX multi_id_idx_id2 ON multi_id (id2, id1);
CREATE INDEX settings_idx_setting_id ON settings (setting_id,
setting_value);
---------------------------------------------

We want to find all the rows from multi_id where any of the IDs
(including its primary key) have a certain setting with a certain value.

LATERAL seemed like the tool for the job, so we tried the following:

---------------------------------------------
SELECT mid.id1
FROM multi_id AS mid,
LATERAL (
     SELECT 1
     FROM settings
     WHERE setting_id = 1
     AND setting_value = 'common_1'
     AND owner_id IN (mid.id1, mid.id2, mid.id3)
) AS setting_matcher;
---------------------------------------------

When we're searching for a common value, this query takes a LONG time.
It turns out the culprit is the IN clause in the subquery. If I change
"owner_id IN (mid.id1, mid.id2, mid.id3)" to "owner_id = mid.id1", the
query executes in about 1/900 the time. It remains that fast if I change
mid.id1 to mid.id2 or mid.id3, meaning if I do a UNION of those three
queries to get the same result set as the query above, the whole thing
is roughly 300x faster.

Execution plan for the IN version followed by the = version (for just
one of the IDs):

---------------------------------------------
Nested Loop  (cost=5.39..8107.18 rows=285 width=4) (actual
time=1.230..6456.567 rows=4499 loops=1)
    Join Filter: (settings.owner_id = ANY (ARRAY[mid.id1, mid.id2,
mid.id3]))
    Rows Removed by Join Filter: 22495501
    ->  Seq Scan on multi_id mid  (cost=0.00..78.00 rows=5000 width=12)
(actual time=0.010..1.385 rows=5000 loops=1)
    ->  Materialize  (cost=5.39..310.66 rows=95 width=4) (actual
time=0.000..0.263 rows=4500 loops=5000)
          ->  Bitmap Heap Scan on settings  (cost=5.39..310.19 rows=95
width=4) (actual time=1.207..3.210 rows=4500 loops=1)
                Recheck Cond: ((setting_id = 1) AND (setting_value =
'common_1'::text))
                Heap Blocks: exact=1405
                ->  Bitmap Index Scan on settings_idx_setting_id
(cost=0.00..5.37 rows=95 width=0) (actual time=0.930..0.930 rows=4500
loops=1)
                      Index Cond: ((setting_id = 1) AND (setting_value =
'common_1'::text))
  Planning time: 0.178 ms
  Execution time: 6456.897 ms


  Hash Join  (cost=145.98..472.93 rows=103 width=4) (actual
time=2.677..6.890 rows=4500 loops=1)
    Hash Cond: (settings.owner_id = mid.id1)
    ->  Bitmap Heap Scan on settings  (cost=5.48..330.50 rows=103
width=4) (actual time=1.194..3.477 rows=4500 loops=1)
          Recheck Cond: ((setting_id = 1) AND (setting_value =
'common_1'::text))
          Heap Blocks: exact=1405
          ->  Bitmap Index Scan on settings_idx_setting_id
(cost=0.00..5.45 rows=103 width=0) (actual time=0.854..0.854 rows=4500
loops=1)
                Index Cond: ((setting_id = 1) AND (setting_value =
'common_1'::text))
    ->  Hash  (cost=78.00..78.00 rows=5000 width=4) (actual
time=1.463..1.463 rows=5000 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 176kB
          ->  Seq Scan on multi_id mid  (cost=0.00..78.00 rows=5000
width=4) (actual time=0.007..0.717 rows=5000 loops=1)
  Planning time: 0.311 ms
  Execution time: 7.166 ms
---------------------------------------------

What am I doing wrong in the IN version of the query, if anything?

I wrote a script to populate a test database with a simplified version
of our real data model and demonstrate the behavior I'm seeing:
https://gist.github.com/sgrimm-sg/2722068ef844d3e02129

Thanks!



Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

От
"David G. Johnston"
Дата:
On Fri, Nov 13, 2015 at 11:25 PM, Steven Grimm <sgrimm@thesegovia.com> wrote:
We want to find all the rows from multi_id where any of the IDs (including its primary key) have a certain setting with a certain value.

LATERAL seemed like the tool for the job, so we tried the following:

---------------------------------------------
SELECT mid.id1
FROM multi_id AS mid,
LATERAL (
    SELECT 1
    FROM settings
    WHERE setting_id = 1
    AND setting_value = 'common_1'
    AND owner_id IN (mid.id1, mid.id2, mid.id3)
) AS setting_matcher;
---------------------------------------------
 
​IN semantics w.r.t NULL can result in atrocious performance in some instances.  I cannot speak to this one in particular but I'm curious if
[...]
WHERE setting_id = 1
AND setting_value = 'common_1'
AND (
owner_id = mid.id1
OR
owner_id = mid.id2
OR
owner_id = mid.id3
)​
 

placed into an EXISTS would work any better.  It seems pointless to include a LATERAL if you are not going to output any of the fields from the laterally joined relation.  If you want a join I'm not sure that INNER wouldn't be just as good, with an ON clause of (owner_id = mid.id1 OR owner_id = mid.id2 OR owner_id = mid.id3)

David J.

Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

От
Steven Grimm
Дата:


November 13, 2015 at 10:48 PM
 
​IN semantics w.r.t NULL can result in atrocious performance in some instances.  I cannot speak to this one in particular but I'm curious if
[...]
WHERE setting_id = 1
AND setting_value = 'common_1'
AND (
owner_id = mid.id1
OR
owner_id = mid.id2
OR
owner_id = mid.id3
)​
 

placed into an EXISTS would work any better.

Better, but still bad. Execution time on my test system goes from 6583ms to 4394ms, whereas the version with just one "=" takes 12ms.

  It seems pointless to include a LATERAL if you are not going to output any of the fields from the laterally joined relation.

True. Our actual query is more complex than the stripped-down one here. I wanted to reduce it to the bare minimum to demonstrate the performance problem so as to make it easier to figure out what was going on. Including columns from the LATERAL query or leaving them out doesn't have much impact on execution time.

-Steve

Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

От
David Rowley
Дата:
On 14 November 2015 at 19:25, Steven Grimm <sgrimm@thesegovia.com> wrote:

Execution plan for the IN version followed by the = version (for just one of the IDs):

---------------------------------------------
Nested Loop  (cost=5.39..8107.18 rows=285 width=4) (actual time=1.230..6456.567 rows=4499 loops=1)
   Join Filter: (settings.owner_id = ANY (ARRAY[mid.id1, mid.id2, mid.id3]))
   Rows Removed by Join Filter: 22495501
   ->  Seq Scan on multi_id mid  (cost=0.00..78.00 rows=5000 width=12) (actual time=0.010..1.385 rows=5000 loops=1)
   ->  Materialize  (cost=5.39..310.66 rows=95 width=4) (actual time=0.000..0.263 rows=4500 loops=5000)
         ->  Bitmap Heap Scan on settings  (cost=5.39..310.19 rows=95 width=4) (actual time=1.207..3.210 rows=4500 loops=1)
               Recheck Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
               Heap Blocks: exact=1405
               ->  Bitmap Index Scan on settings_idx_setting_id  (cost=0.00..5.37 rows=95 width=0) (actual time=0.930..0.930 rows=4500 loops=1)
                     Index Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
 Planning time: 0.178 ms
 Execution time: 6456.897 ms


 Hash Join  (cost=145.98..472.93 rows=103 width=4) (actual time=2.677..6.890 rows=4500 loops=1)
   Hash Cond: (settings.owner_id = mid.id1)
   ->  Bitmap Heap Scan on settings  (cost=5.48..330.50 rows=103 width=4) (actual time=1.194..3.477 rows=4500 loops=1)
         Recheck Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
         Heap Blocks: exact=1405
         ->  Bitmap Index Scan on settings_idx_setting_id  (cost=0.00..5.45 rows=103 width=0) (actual time=0.854..0.854 rows=4500 loops=1)
               Index Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
   ->  Hash  (cost=78.00..78.00 rows=5000 width=4) (actual time=1.463..1.463 rows=5000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 176kB
         ->  Seq Scan on multi_id mid  (cost=0.00..78.00 rows=5000 width=4) (actual time=0.007..0.717 rows=5000 loops=1)
 Planning time: 0.311 ms
 Execution time: 7.166 ms
---------------------------------------------

What am I doing wrong in the IN version of the query, if anything?


In order to allow hash or merge joins, there needs to be at least 1 common equality expression in the join condition. Your IN query does not have this.

Notice that with your IN() query that the IN() clause is evaluated in the nested loop's join condition:

Join Filter: (settings.owner_id = ANY (ARRAY[mid.id1, mid.id2, mid.id3]))

If you think of what a nested loop is, basically for each outer row it must scan each inner row until it finds a match. Since this is a SEMI join, it can stop once the first match is found, and then move on to the next outer row. It then goes and does all that again with the next outer row until there's no outer rows left. This is why the IN version is so slow. In you case 22495501 + 4499 comparisons of that join condition have been made, which is probably not too unreasonable for 6.5 seconds.

The problem is that the optimizer is unable to use hash join or merge joins when you have the IN() condition as the join condition, the reason for this is that you're effectively saying to join on any of 3 conditions: settings.owner_id = mid.id1 OR settings.owner_id = mid.id2 OR settings.owner_id = mid.id3. If you think how a hash join works then 1 hash table is no good here, as you effectively have 3 possible keys for the hash table, the executor would have to build 3 tables to make that possible, but we only ever build 1 in PostgreSQL. As you may know, a hash table is a very efficient data structure for key value lookups, but there can only be a single key.  Merge join has the same problem because it's only possible to have a single sort order.

You might think that all this sounds terrible, but the main reason that you've got this issue is down to the design of the multi_id table. If you just had a single ID and perhaps another column to store the 1,2,3 (the need for this is not quite clear to me), also perhaps your PK to reference some other table, you could just write the query as:

select id from multi_id where id in(select owner_id from settings where  setting_id = 1 and setting_value = 'common_1');

Quite possibly this would produce a plan with a hash join and execute very quickly.

With your current schema the only option that I can see is to build a query such as:

select id1 from multi_id where id1 in (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1')
union
select id2 from multi_id where id2 in (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1')
union
select id3 from multi_id where id3 in (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1')

but then you're suffering from having to perhaps build the same hash table 3 times, and also the overhead of the UNION getting rid of duplicate rows. You could perhaps make it slightly better with a common table expression such as:

with cte (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1') as
select id1 from multi_id where id1 in (select owner_id from cte)
union
select id2 from multi_id where id2 in (select owner_id from cte)
union
select id3 from multi_id where id3 in (select owner_id from cte);

but you still have the union overhead.
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

От
Thomas Kellerer
Дата:
Steven Grimm schrieb am 14.11.2015 um 07:25:
> We have a table, call it "multi_id", that contains columns with IDs of various kinds of objects in my system,
>and another table that's a generic owner/key/value store for object attributes (think configuration settings,
>and I'll refer to this table as "settings"). To wit:
>
> ---------------------------------------------
> CREATE TABLE multi_id (
>    id1 INTEGER PRIMARY KEY,
>    id2 INTEGER,
>    id3 INTEGER
> );
> CREATE TABLE settings (
>    owner_id INTEGER,
>    setting_id INTEGER,
>    setting_value TEXT,
>    PRIMARY KEY (owner_id, setting_id)
> );
> CREATE UNIQUE INDEX multi_id_idx_id1 ON multi_id (id1, id2);
> CREATE UNIQUE INDEX multi_id_idx_id2 ON multi_id (id2, id1);
> CREATE INDEX settings_idx_setting_id ON settings (setting_id, setting_value);
> ---------------------------------------------
>
> We want to find all the rows from multi_id where any of the IDs (including its primary key) have a certain setting
witha certain value. 
>
> LATERAL seemed like the tool for the job, so we tried the following:
>
> ---------------------------------------------
> SELECT mid.id1
> FROM multi_id AS mid,
> LATERAL (
>      SELECT 1
>      FROM settings
>      WHERE setting_id = 1
>      AND setting_value = 'common_1'
>      AND owner_id IN (mid.id1, mid.id2, mid.id3)
> ) AS setting_matcher;
> ---------------------------------------------
>

The above is actualy a CROSS JOIN between multi_id and settings which generates duplicate values for id1 and is
probablynot what you want 

I _think_ what you are after is something like this:

   with sett as (
     SELECT owner_id
     FROM settings
     WHERE setting_id = 1
     AND setting_value = 'common_1'
   )
   select mid.id1
   from multi_id as mid
   where exists (SELECT 1
                 FROM sett
                 WHERE owner_id = mid.id1)
   or exists (SELECT 1
              FROM sett
              where owner_id = mid.id2)
   or exists (SELECT 1
              FROM sett
              where owner_id = mid.id3);


This returns the same result as your original query (when I apply a DISTINCT on it to remove the duplicate ids).
It runs in 23ms on my computer, your cross join takes roughly 4 seconds.

This is the plan from your statement: http://explain.depesz.com/s/EyjJ
This is the plan for my statement: http://explain.depesz.com/s/Dt7x



Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

От
Steven Grimm
Дата:

November 14, 2015 at 12:32 AM
The problem is that the optimizer is unable to use hash join or merge joins when you have the IN() condition as the join condition, the reason for this is that you're effectively saying to join on any of 3 conditions: settings.owner_id = mid.id1 OR settings.owner_id = mid.id2 OR settings.owner_id = mid.id3. If you think how a hash join works then 1 hash table is no good here, as you effectively have 3 possible keys for the hash table, the executor would have to build 3 tables to make that possible, but we only ever build 1 in PostgreSQL. As you may know, a hash table is a very efficient data structure for key value lookups, but there can only be a single key.  Merge join has the same problem because it's only possible to have a single sort order.

Thanks, that's the key thing I was missing. I was expecting it to see that there were exactly three conditions (as opposed to a variable number) and evaluate "build 3 hash tables" as a possible execution plan. Knowing that it doesn't do that completely explains the behavior I'm seeing.

It is puzzling that if, as suggested by someone else in the thread, I expand IN(a,b,c) to (x = a OR x = b OR x = c) it gets substantially faster, though still obviously falls afoul of the problem you describe above (~4 seconds instead of ~6 seconds). Should those two be equivalent?

-Steve

Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

От
Tom Lane
Дата:
Steven Grimm <sgrimm@thesegovia.com> writes:
> It is puzzling that if, as suggested by someone else in the thread, I
> expand IN(a,b,c) to (x = a OR x = b OR x = c) it gets substantially
> faster, though still obviously falls afoul of the problem you describe
> above (~4 seconds instead of ~6 seconds). Should those two be equivalent?

The parser actually will do that expansion for you, when the IN-list items
contain variables ... but its definition of "variable" for this purpose is
"contain_vars_of_level(expr, 0)" so the outer-level Vars you've got in
this LATERAL subquery formulation don't trigger that behavior.  I seem to
remember writing it that way intentionally, but this example makes me
think maybe excluding outer-level Vars wasn't such a hot idea.  It will
remain a ScalarArrayOpExpr even if the query later gets flattened to the
point where the Vars aren't outer-level anymore, which is probably not
what we want it to be.

            regards, tom lane