Optimising "full outer join where" for muti-row to multi-column view

Поиск
Список
Период
Сортировка
От Phil Endecott
Тема Optimising "full outer join where" for muti-row to multi-column view
Дата
Msg-id 1167337796666@dmwebmail.belize.chezphil.org
обсуждение исходный текст
Ответы Re: Optimising "full outer join where" for muti-row to multi-column view  (Martijn van Oosterhout <kleptog@svana.org>)
Список pgsql-general
Dear PostgreSQL experts,

I have a database that records the EXIF data for a collection of
photos.  (EXIF is a method for embedding arbitary name-value data
in a JPEG, and digital cameras typically use it to record things
like exposure information.)  My exif table looks something like
this:

 photo_id |           tag            |          value
----------+--------------------------+--------------------------
     1001 | DateTime                 | 2006:10:26 11:19:29
     1001 | Orientation              | top - left
     1001 | PixelXDimension          | 3888
     1001 | PixelYDimension          | 2592
     1002 | DateTimeOriginal         | 2006:10:26 13:34:06
     1002 | Orientation              | left - bottom
     1002 | PixelXDimension          | 3888
     1002 | PixelYDimension          | 2592

photo_id and tag together form the primary key.  The data also
includes many tags that I'm not currently interested in.  From
this I create a view containing only the tags of interest:

 photo_id |  orientation  |      datetime       | xsize | ysize
----------+---------------+---------------------+-------+-------
     1001 | top - left    | 2006:10:30 11:19:29 | 3888  | 2592
     1002 | left - bottom | 2006:10:26 13:34:06 | 3888  | 2592

My first attempt did this in the obvious way by joining the
exif table with itself once per column:

create view photo_info_v as
select photo_id,
       e1.value as orientation,
       e2.value as datetime,
       e3.value as xsize,
       e4.value as ysize
  from exif e1
  join exif e2 using(photo_id)
  join exif e3 using(photo_id)
  join exif e4 using(photo_id)
  where e1.tag='Orientation'
    and e2.tag='DateTime'
    and e3.tag='PixelXDimension'
    and e4.tag='PixelYDimension';

This works well for one important query, where I find one photo's
information from the view:

photos=> explain analyse select * from photo_info_v where photo_id=1201;
                                                              QUERY PLAN
              

--------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..12.09 rows=1 width=60) (actual time=1.198..1.290 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..9.07 rows=1 width=46) (actual time=0.953..1.009 rows=1 loops=1)
         ->  Nested Loop  (cost=0.00..6.04 rows=1 width=32) (actual time=0.693..0.731 rows=1 loops=1)
               ->  Index Scan using exif_pkey on exif e1  (cost=0.00..3.02 rows=1 width=18) (actual time=0.384..0.394
rows=1loops=1) 
                     Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
               ->  Index Scan using exif_pkey on exif e4  (cost=0.00..3.02 rows=1 width=18) (actual time=0.189..0.205
rows=1loops=1) 
                     Index Cond: ((1201 = photo_id) AND (tag = 'PixelYDimension'::text))
         ->  Index Scan using exif_pkey on exif e3  (cost=0.00..3.02 rows=1 width=18) (actual time=0.186..0.194 rows=1
loops=1)
               Index Cond: ((1201 = photo_id) AND (tag = 'PixelXDimension'::text))
   ->  Index Scan using exif_pkey on exif e2  (cost=0.00..3.02 rows=1 width=18) (actual time=0.171..0.195 rows=1
loops=1)
         Index Cond: ((1201 = photo_id) AND (tag = 'DateTime'::text))
 Total runtime: 3.064 ms

However, I might just want one column from the view:

photos=> explain analyse select orientation from photo_info_v where photo_id=1201;
                                                              QUERY PLAN
              

--------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..12.09 rows=1 width=14) (actual time=1.266..1.353 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..9.07 rows=1 width=18) (actual time=1.009..1.062 rows=1 loops=1)
         ->  Nested Loop  (cost=0.00..6.04 rows=1 width=18) (actual time=0.752..0.787 rows=1 loops=1)
               ->  Index Scan using exif_pkey on exif e1  (cost=0.00..3.02 rows=1 width=18) (actual time=0.410..0.418
rows=1loops=1) 
                     Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
               ->  Index Scan using exif_pkey on exif e4  (cost=0.00..3.02 rows=1 width=4) (actual time=0.183..0.199
rows=1loops=1) 
                     Index Cond: ((1201 = photo_id) AND (tag = 'PixelYDimension'::text))
         ->  Index Scan using exif_pkey on exif e3  (cost=0.00..3.02 rows=1 width=4) (actual time=0.168..0.176 rows=1
loops=1)
               Index Cond: ((1201 = photo_id) AND (tag = 'PixelXDimension'::text))
   ->  Index Scan using exif_pkey on exif e2  (cost=0.00..3.02 rows=1 width=4) (actual time=0.168..0.191 rows=1
loops=1)
         Index Cond: ((1201 = photo_id) AND (tag = 'DateTime'::text))
 Total runtime: 3.123 ms

I only wanted the orientation information, which could be found
using this faster query:

photos=> explain analyse select value from exif where photo_id=1201 and tag='Orientation';
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Index Scan using exif_pkey on exif  (cost=0.00..3.02 rows=1 width=14) (actual time=0.504..0.536 rows=1 loops=1)
   Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
 Total runtime: 1.298 ms

Instead it builds up all four columns of the view and discards three
of them.  It has to do this, because if data for any one of the four
columns did not exist then the whole row would not exist in the view.
I know that all of the data is always present, but PostgreSQL doesn't
know that, and I don't think there is any way that I can tell it that
it does.

To try and avoid this, I tried using a full outer join in the view
definition.  In this case the row would always be present in the view,
even if the data for the other columns were not present.  I hoped that
the query would then be optimised to look up only the orientation
information:

create view photo_info_v as
select photo_id,
       e1.value as orientation,
       e2.value as datetime,
       e3.value as xsize,
       e4.value as ysize
  from exif e1
  full outer join exif e2 using(photo_id)
  full outer join exif e3 using(photo_id)
  full outer join exif e4 using(photo_id)
  where e1.tag='Orientation'
    and e2.tag='DateTime'
    and e3.tag='PixelXDimension'
    and e4.tag='PixelYDimension';

photos=> explain analyse select orientation from photo_info_v where photo_id=1201;
                                                                         QUERY PLAN
                                    

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=2431.73..3198.49 rows=7 width=14) (actual time=2256.194..2345.563 rows=1 loops=1)
   Merge Cond: ("outer".photo_id = "inner"."?column5?")
   Join Filter: (COALESCE("inner"."?column5?", "outer".photo_id) = 1201)
   ->  Index Scan using exif_pkey on exif e4  (cost=0.00..740.62 rows=1229 width=4) (actual time=1.937..462.014
rows=1229loops=1) 
         Index Cond: (tag = 'PixelYDimension'::text)
   ->  Sort  (cost=2431.73..2434.99 rows=1302 width=26) (actual time=1766.523..1802.629 rows=1229 loops=1)
         Sort Key: COALESCE(COALESCE(e1.photo_id, e2.photo_id), e3.photo_id)
         ->  Merge Join  (cost=1601.15..2364.38 rows=1302 width=26) (actual time=931.551..1604.342 rows=1229 loops=1)
               Merge Cond: ("outer".photo_id = "inner"."?column4?")
               ->  Index Scan using exif_pkey on exif e3  (cost=0.00..740.62 rows=1229 width=4) (actual
time=0.240..519.953rows=1229 loops=1) 
                     Index Cond: (tag = 'PixelXDimension'::text)
               ->  Sort  (cost=1601.15..1604.37 rows=1286 width=22) (actual time=931.092..971.617 rows=1229 loops=1)
                     Sort Key: COALESCE(e1.photo_id, e2.photo_id)
                     ->  Merge Join  (cost=0.00..1534.74 rows=1286 width=22) (actual time=0.467..797.332 rows=1229
loops=1)
                           Merge Cond: ("outer".photo_id = "inner".photo_id)
                           ->  Index Scan using exif_pkey on exif e1  (cost=0.00..740.62 rows=1229 width=18) (actual
time=0.193..271.369rows=1229 loops=1) 
                                 Index Cond: (tag = 'Orientation'::text)
                           ->  Index Scan using exif_pkey on exif e2  (cost=0.00..740.62 rows=1229 width=4) (actual
time=0.118..395.892rows=1229 loops=1) 
                                 Index Cond: (tag = 'DateTime'::text)
 Total runtime: 2350.601 ms

Ooops!  It has got worse not better.  The runtime has increased by
three orders of magnitude, because it seems to compute most of the
view before filtering.  It boils down to this query:

photos=> select e1.value as orientation
photos->   from exif e1
photos->   full outer join exif e2 using(photo_id)
photos->   full outer join exif e3 using(photo_id)
photos->   full outer join exif e4 using(photo_id)
photos->   where e1.tag='Orientation'
photos->     and e2.tag='DateTime'
photos->     and e3.tag='PixelXDimension'
photos->     and e4.tag='PixelYDimension'
photos->     and photo_id=1201;

I would like it to optimise away the unused joins, leaving:

photos=> explain analyse select e1.value as orientation
photos->   from exif e1
photos->   where e1.tag='Orientation'
photos->   and photo_id=1201;

Is there a good reason why it cannot do this?  Remeber that photo_id and
tag are the primary key for exif, so the view cannot have more than one
row per photo_id.


This is not the first problem I have had with non-trivial views, where
PostgreSQL has failed to simplify a query on the view as I had hoped it
would, either because the semantics of SQL mean that it is unable to
(with no way of describing the additional constraints that apply to
that data and could make the simplification possible), or because the
query optimiser doesn't detect a possible optimisation.  Back in 7.4
days I resorted to materialised views updated using triggers, which
involved enormously more work than a normal view.  Has anything been
done, or is anything planned, to make this easier?

I imagine that the basic problem here, projecting data from multiple
rows into a single row with multiple columns, is a fairly common one.
Are there any tricks to doing this that anyone can share?

I'm using PostgreSQL 8.1.

Many thanks,

Phil.

(You are welcome to CC: me in any replies.)




В списке pgsql-general по дате отправления:

Предыдущее
От: Dave Page
Дата:
Сообщение: Re: Backup Restore
Следующее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: Optimising "full outer join where" for muti-row to multi-column view