Обсуждение: full outer performance problem
Hi,
I'm having problems with the query optimizer and FULL OUTER JOIN on
PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
I might be naive, but I think that it should be possible?
I have two BIG tables (virtually identical) with 3 NOT NULL columns
Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id,
TimeObs) and valid ANALYSE (set statistics=100). I want to join the two
tables with a FULL OUTER JOIN.
When I specify the query as:
SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
FROM temp_dry_at_2m a
FULL OUTER JOIN temp_grass b
USING (station_id, timeobs)
WHERE station_id = 52981
AND timeobs = '2004-1-1 0:0:0'
I get the correct results
station_id | timeobs | temp_grass | temp_dry_at_2m
------------+---------------------+------------+----------------
52944 | 2004-01-01 00:10:00 | | -1.1
(1 row)
BUT LOUSY performance, and the following EXPLAIN:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Full Join (cost=1542369.83..1618958.58 rows=6956994 width=32) (actual time=187176.408..201436.264 rows=1
loops=1)
Merge Cond: (("outer".station_id = "inner".station_id) AND ("outer".timeobs = "inner".timeobs))
Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND (COALESCE("outer".timeobs, "inner".timeobs)
='2004-01-01 00:00:00'::timestamp without time zone))
-> Sort (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual time=145748.253..153851.607 rows=6956994
loops=1)
Sort Key: a.station_id, a.timeobs
-> Seq Scan on temp_dry_at_2m a (cost=0.00..117549.94 rows=6956994 width=16) (actual time=0.049..54226.770
rows=6956994loops=1)
-> Sort (cost=334456.38..340472.11 rows=2406292 width=16) (actual time=31668.876..34491.123 rows=2406292 loops=1)
Sort Key: b.station_id, b.timeobs
-> Seq Scan on temp_grass b (cost=0.00..40658.92 rows=2406292 width=16) (actual time=0.052..5484.489
rows=2406292loops=1)
Total runtime: 201795.989 ms
(10 rows)
If I change the query (note the "b."s)
explain analyse SELECT b.station_id, b.timeobs,temp_grass, temp_dry_at_2m
FROM temp_dry_at_2m a
FULL OUTER JOIN temp_grass b
USING (station_id, timeobs)
WHERE b.station_id = 52981
AND b.timeobs = '2004-1-1 0:0:0'
I seem to destroy the FULL OUTER JOIN and get wrong results (nothing)
If I had happend to use "a.", and not "b.", I would have gotten correct
results (by accident).
The "a." variant gives this EXPLAIN:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..11.97 rows=1 width=20) (actual time=0.060..0.067 rows=1 loops=1)
-> Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a (cost=0.00..5.99 rows=1 width=16) (actual
time=0.033..0.036rows=1 loops=1)
Index Cond: ((station_id = 52981) AND (timeobs = '2004-01-01 00:00:00'::timestamp without time zone))
-> Index Scan using temp_grass_idx on temp_grass b (cost=0.00..5.96 rows=1 width=16) (actual time=0.018..0.021
rows=1loops=1)
Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs = b.timeobs))
Total runtime: 0.140 ms
(6 rows)
Why will PostgreSQL not use the same plan for both these queries - they
are virtually identical??
I have tried to formulate the problem with left joins, but this demands
from me that I know which table has all the values (and thus has to go
first), and in practice no such table excists.
TIA,
Kim Bisgaard.
On Wed, Jun 08, 2005 at 11:37:40 +0200,
Kim Bisgaard <kib+pg@dmi.dk> wrote:
> Hi,
>
> I'm having problems with the query optimizer and FULL OUTER JOIN on
> PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
> I might be naive, but I think that it should be possible?
>
> I have two BIG tables (virtually identical) with 3 NOT NULL columns
> Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id,
> TimeObs) and valid ANALYSE (set statistics=100). I want to join the two
> tables with a FULL OUTER JOIN.
>
> When I specify the query as:
>
> SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
> FROM temp_dry_at_2m a
> FULL OUTER JOIN temp_grass b
> USING (station_id, timeobs)
> WHERE station_id = 52981
> AND timeobs = '2004-1-1 0:0:0'
>
> I get the correct results
>
> station_id | timeobs | temp_grass | temp_dry_at_2m
> ------------+---------------------+------------+----------------
> 52944 | 2004-01-01 00:10:00 | | -1.1
> (1 row)
>
> BUT LOUSY performance, and the following EXPLAIN:
>
> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Merge Full Join (cost=1542369.83..1618958.58 rows=6956994 width=32)
> (actual time=187176.408..201436.264 rows=1 loops=1)
> Merge Cond: (("outer".station_id = "inner".station_id) AND
> ("outer".timeobs = "inner".timeobs))
> Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND
> (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01
> 00:00:00'::timestamp without time zone))
> -> Sort (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual
> time=145748.253..153851.607 rows=6956994 loops=1)
> Sort Key: a.station_id, a.timeobs
> -> Seq Scan on temp_dry_at_2m a (cost=0.00..117549.94
> rows=6956994 width=16) (actual time=0.049..54226.770 rows=6956994
> loops=1)
> -> Sort (cost=334456.38..340472.11 rows=2406292 width=16) (actual
> time=31668.876..34491.123 rows=2406292 loops=1)
> Sort Key: b.station_id, b.timeobs
> -> Seq Scan on temp_grass b (cost=0.00..40658.92 rows=2406292
> width=16) (actual time=0.052..5484.489 rows=2406292 loops=1)
> Total runtime: 201795.989 ms
> (10 rows)
Someone else will need to comment on why Postgres can't use a more
efficient plan. What I think will work for you is to restrict
the station_id and timeobs on each side and then do a full join.
You can try something like the sample query below (which hasn't been tested):
SELECT station_id, timeobs, temp_grass, temp_dry_at_2m
FROM
(SELECT station_id, timeobs, temp_dry_at_2m
FROM temp_dry_at_2m
WHERE
station_id = 52981
AND
timeobs = '2004-1-1 0:0:0') a
FULL OUTER JOIN
(SELECT station_id, timeobs, temp_grass
FROM temp_grass
WHERE
station_id = 52981
AND
timeobs = '2004-1-1 0:0:0') b
USING (station_id, timeobs)
Hi Bruno,
Thanks for the moral support! I feel so too - but I am confident it will
show up soon.
W.r.t. your rewrite of the query, I get this "ERROR: could not devise a
query plan for the given query" but no further details - I will try google
Regards,
Kim.
Bruno Wolff III wrote:
>On Wed, Jun 08, 2005 at 11:37:40 +0200,
> Kim Bisgaard <kib+pg@dmi.dk> wrote:
>
>
>>Hi,
>>
>>I'm having problems with the query optimizer and FULL OUTER JOIN on
>>PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
>>I might be naive, but I think that it should be possible?
>>
>>I have two BIG tables (virtually identical) with 3 NOT NULL columns
>>Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id,
>>TimeObs) and valid ANALYSE (set statistics=100). I want to join the two
>>tables with a FULL OUTER JOIN.
>>
>>When I specify the query as:
>>
>>SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
>> FROM temp_dry_at_2m a
>> FULL OUTER JOIN temp_grass b
>> USING (station_id, timeobs)
>> WHERE station_id = 52981
>> AND timeobs = '2004-1-1 0:0:0'
>>
>>I get the correct results
>>
>>station_id | timeobs | temp_grass | temp_dry_at_2m
>>------------+---------------------+------------+----------------
>> 52944 | 2004-01-01 00:10:00 | | -1.1
>>(1 row)
>>
>>BUT LOUSY performance, and the following EXPLAIN:
>>
>> QUERY PLAN
>>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>Merge Full Join (cost=1542369.83..1618958.58 rows=6956994 width=32)
>>(actual time=187176.408..201436.264 rows=1 loops=1)
>> Merge Cond: (("outer".station_id = "inner".station_id) AND
>> ("outer".timeobs = "inner".timeobs))
>> Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND
>> (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01
>> 00:00:00'::timestamp without time zone))
>> -> Sort (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual
>> time=145748.253..153851.607 rows=6956994 loops=1)
>> Sort Key: a.station_id, a.timeobs
>> -> Seq Scan on temp_dry_at_2m a (cost=0.00..117549.94
>> rows=6956994 width=16) (actual time=0.049..54226.770 rows=6956994
>> loops=1)
>> -> Sort (cost=334456.38..340472.11 rows=2406292 width=16) (actual
>> time=31668.876..34491.123 rows=2406292 loops=1)
>> Sort Key: b.station_id, b.timeobs
>> -> Seq Scan on temp_grass b (cost=0.00..40658.92 rows=2406292
>> width=16) (actual time=0.052..5484.489 rows=2406292 loops=1)
>>Total runtime: 201795.989 ms
>>(10 rows)
>>
>>
>
>Someone else will need to comment on why Postgres can't use a more
>efficient plan. What I think will work for you is to restrict
>the station_id and timeobs on each side and then do a full join.
>You can try something like the sample query below (which hasn't been tested):
>SELECT station_id, timeobs, temp_grass, temp_dry_at_2m
> FROM
> (SELECT station_id, timeobs, temp_dry_at_2m
> FROM temp_dry_at_2m
> WHERE
> station_id = 52981
> AND
> timeobs = '2004-1-1 0:0:0') a
> FULL OUTER JOIN
> (SELECT station_id, timeobs, temp_grass
> FROM temp_grass
> WHERE
> station_id = 52981
> AND
> timeobs = '2004-1-1 0:0:0') b
> USING (station_id, timeobs)
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>
>
>
Kim Bisgaard <kib+pg@dmi.dk> writes:
> SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
> FROM temp_dry_at_2m a
> FULL OUTER JOIN temp_grass b
> USING (station_id, timeobs)
> WHERE station_id = 52981
> AND timeobs = '2004-1-1 0:0:0'
> explain analyse SELECT b.station_id, b.timeobs,temp_grass, temp_dry_at_2m
> FROM temp_dry_at_2m a
> FULL OUTER JOIN temp_grass b
> USING (station_id, timeobs)
> WHERE b.station_id = 52981
> AND b.timeobs = '2004-1-1 0:0:0'
> Why will PostgreSQL not use the same plan for both these queries - they
> are virtually identical??
Because they're semantically completely different. The second query is
effectively a RIGHT JOIN, because join rows in which b is all-null will
be thrown away by the WHERE. The optimizer sees this (note your second
plan doesn't use a Full Join step anywhere) and is able to produce a
much better plan. Full outer join is difficult to optimize, in part
because we have no choice but to use a merge join for it --- the other
join types don't support full join.
regards, tom lane
Kim Bisgaard <kib+pg@dmi.dk> writes:
> W.r.t. your rewrite of the query, I get this "ERROR: could not devise a
> query plan for the given query" but no further details - I will try google
Which PG version are you using again? That should be fixed in 7.4.3
and later.
regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Kim Bisgaard <kib+pg@dmi.dk> writes: > > SELECT station_id, timeobs,temp_grass, temp_dry_at_2m > > FROM temp_dry_at_2m a > > FULL OUTER JOIN temp_grass b > > USING (station_id, timeobs) > > WHERE station_id = 52981 > > AND timeobs = '2004-1-1 0:0:0' > > > explain analyse SELECT b.station_id, b.timeobs,temp_grass, temp_dry_at_2m > > FROM temp_dry_at_2m a > > FULL OUTER JOIN temp_grass b > > USING (station_id, timeobs) > > WHERE b.station_id = 52981 > > AND b.timeobs = '2004-1-1 0:0:0' > > > Why will PostgreSQL not use the same plan for both these queries - they > > are virtually identical?? > > Because they're semantically completely different. The second query is > effectively a RIGHT JOIN, because join rows in which b is all-null will > be thrown away by the WHERE. The optimizer sees this (note your second > plan doesn't use a Full Join step anywhere) and is able to produce a > much better plan. Full outer join is difficult to optimize, in part > because we have no choice but to use a merge join for it --- the other > join types don't support full join. > > regards, tom lane > Yes I am aware that they are not "identical", they also give different results, but the data nessesary to compute the results is (0-2 rows, 0-1 row from each table), and thus ideally have the potential to have similar performance - to my head anyway, but I may not have grasped the complete picture yet :-) Regards, Kim.
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Kim Bisgaard <kib+pg@dmi.dk> writes: > > W.r.t. your rewrite of the query, I get this "ERROR: could not devise a > > query plan for the given query" but no further details - I will try google > > Which PG version are you using again? That should be fixed in 7.4.3 > and later. > > regards, tom lane > Its 7.4.1. I am in the process (may take a while yet) of installing 8.0.3 on the same hardware in order to have a parallel system. Time is a finite meassure :-) I must admit I would rather have the first query perform, that have this workaround function ;-) Regards, Kim.