Обсуждение: Need help optimizing this query

Поиск
Список
Период
Сортировка

Need help optimizing this query

От
"Pat Maddox"
Дата:
I've got a query that's taking forever (as will be obvious when you
see it and the explain output).  I can't figure out what indexes to
add to make this run faster.  I'd appreciate any help.

Pat



SELECT
 SUM(CASE WHEN (hit IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0 END)
AS count_hits_console,
 SUM(CASE WHEN (hit IS TRUE AND logged_in IS FALSE) THEN 1 ELSE 0 END)
AS count_hits_remote,
 SUM(CASE WHEN (played IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0
END) AS count_played_console,
 SUM(CASE WHEN (played IS TRUE AND logged_in IS FALSE) THEN 1 ELSE 0
END) AS count_played_remote,
 SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS TRUE) THEN 1 ELSE
0 END) AS count_downloaded_console,
 SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS FALSE) THEN 1 ELSE
0 END) AS count_downloaded_remote,
 SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND logged_in
IS TRUE) THEN assets.size ELSE 0 END) as download_size_console,
 SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND logged_in
IS FALSE) THEN assets.size ELSE 0 END) as download_size_remote,
videos.id, videos.title, videos.guid FROM video_views, assets, videos
WHERE videos.company_id=1 AND video_views.video_id=videos.id AND
video_views.asset_id=assets.id GROUP BY videos.id, videos.title,
videos.guid ORDER BY count_hits_remote DESC LIMIT 100




Limit  (cost=139735.51..139735.68 rows=69 width=64)
   ->  Sort  (cost=139735.51..139735.68 rows=69 width=64)
         Sort Key: sum(CASE WHEN ((video_views.hit IS TRUE) AND
(video_views.logged_in IS FALSE)) THEN 1 ELSE 0 END)
         ->  HashAggregate  (cost=139731.33..139733.40 rows=69 width=64)
               ->  Hash Join  (cost=1164.79..138880.04 rows=30956 width=64)
                     Hash Cond: (video_views.asset_id = assets.id)
                     ->  Hash Join  (cost=324.39..137343.13 rows=30956 width=60)
                           Hash Cond: (video_views.video_id = videos.id)
                           ->  Seq Scan on video_views
(cost=0.00..114500.13 rows=5922413 width=12)
                           ->  Hash  (cost=323.52..323.52 rows=69 width=52)
                                 ->  Bitmap Heap Scan on videos
(cost=64.90..323.52 rows=69 width=52)
                                       Recheck Cond: (company_id = 1)
                                       ->  Bitmap Index Scan on
complete_videos_without_deleted_at  (cost=0.00..64.88 rows=69 width=0)
                                             Index Cond: (company_id = 1)
                     ->  Hash  (cost=645.18..645.18 rows=15618 width=12)
                           ->  Seq Scan on assets  (cost=0.00..645.18
rows=15618 width=12)

Re: Need help optimizing this query

От
"Pat Maddox"
Дата:
On 7/18/07, Pat Maddox <pergesu@gmail.com> wrote:
> I've got a query that's taking forever (as will be obvious when you
> see it and the explain output).  I can't figure out what indexes to
> add to make this run faster.  I'd appreciate any help.
>
> Pat
>
>
>
> SELECT
>  SUM(CASE WHEN (hit IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0 END)
> AS count_hits_console,
>  SUM(CASE WHEN (hit IS TRUE AND logged_in IS FALSE) THEN 1 ELSE 0 END)
> AS count_hits_remote,
>  SUM(CASE WHEN (played IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0
> END) AS count_played_console,
>  SUM(CASE WHEN (played IS TRUE AND logged_in IS FALSE) THEN 1 ELSE 0
> END) AS count_played_remote,
>  SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS TRUE) THEN 1 ELSE
> 0 END) AS count_downloaded_console,
>  SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS FALSE) THEN 1 ELSE
> 0 END) AS count_downloaded_remote,
>  SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND logged_in
> IS TRUE) THEN assets.size ELSE 0 END) as download_size_console,
>  SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND logged_in
> IS FALSE) THEN assets.size ELSE 0 END) as download_size_remote,
> videos.id, videos.title, videos.guid FROM video_views, assets, videos
> WHERE videos.company_id=1 AND video_views.video_id=videos.id AND
> video_views.asset_id=assets.id GROUP BY videos.id, videos.title,
> videos.guid ORDER BY count_hits_remote DESC LIMIT 100
>
>
>
>
> Limit  (cost=139735.51..139735.68 rows=69 width=64)
>    ->  Sort  (cost=139735.51..139735.68 rows=69 width=64)
>          Sort Key: sum(CASE WHEN ((video_views.hit IS TRUE) AND
> (video_views.logged_in IS FALSE)) THEN 1 ELSE 0 END)
>          ->  HashAggregate  (cost=139731.33..139733.40 rows=69 width=64)
>                ->  Hash Join  (cost=1164.79..138880.04 rows=30956 width=64)
>                      Hash Cond: (video_views.asset_id = assets.id)
>                      ->  Hash Join  (cost=324.39..137343.13 rows=30956 width=60)
>                            Hash Cond: (video_views.video_id = videos.id)
>                            ->  Seq Scan on video_views
> (cost=0.00..114500.13 rows=5922413 width=12)
>                            ->  Hash  (cost=323.52..323.52 rows=69 width=52)
>                                  ->  Bitmap Heap Scan on videos
> (cost=64.90..323.52 rows=69 width=52)
>                                        Recheck Cond: (company_id = 1)
>                                        ->  Bitmap Index Scan on
> complete_videos_without_deleted_at  (cost=0.00..64.88 rows=69 width=0)
>                                              Index Cond: (company_id = 1)
>                      ->  Hash  (cost=645.18..645.18 rows=15618 width=12)
>                            ->  Seq Scan on assets  (cost=0.00..645.18
> rows=15618 width=12)
>


Here are the indexes I already have on the table:

    "video_views_pkey" PRIMARY KEY, btree (id)
    "index_video_views_on_asset_id" btree (asset_id)
    "index_video_views_on_video_id" btree (video_id)
    "index_video_views_on_video_id_and_asset_id_and_created_at" btree
(video_id, created_at, asset_id)

Re: Need help optimizing this query

От
Michael Glaesemann
Дата:
On 7/18/07, Pat Maddox <pergesu@gmail.com> wrote:
> I've got a query that's taking forever (as will be obvious when you
> see it and the explain output).  I can't figure out what indexes to
> add to make this run faster.  I'd appreciate any help.
>

I'm curious why it's choosing to use hash joins rather than taking
advantage of the indexes you have on the foreign key columns. What
are the table definitions? Are hit, logged_in, played, downloaded all
columns of videos_views?

Have you ANALYZEd these tables? What is the EXPLAIN ANALYZE output
for this query? You only provided the EXPLAIN output, which doesn't
compare the plan with the actual query cost.

I found it quite difficult to read you query. I reformatted it and
also used some SQL functions to abstract away the CASE expressions.
These SQL functions will probably be inlined so there should be very
little overhead. If you have a boolean column, you don't need to test
IS TRUE or IS FALSE: you can just use the value itself. I also find
it helpful to separate the join conditions (in the JOIN clause) from
the restrictions (the WHERE clause), which I've done below.

CREATE OR REPLACE FUNCTION ternary(BOOLEAN, INTEGER, INTEGER)
RETURNS INTEGER
LANGUAGE SQL AS $_$SELECT CASE WHEN $1 THEN $2 ELSE $3 END$_$;

CREATE OR REPLACE FUNCTION value_when(BOOLEAN, INTEGER)
RETURNS INTEGER
LANGUAGE SQL AS $_$SELECT ternary($1,$2,0)$_$;

CREATE OR REPLACE FUNCTION one_when(BOOLEAN)
RETURNS INTEGER
LANGUAGE SQL as $_$SELECT value_when($1,1)$_$;

SELECT
     SUM (one_when(hit AND logged_in)) AS count_hits_console
     , SUM (one_when(hit AND NOT logged_in)) AS count_hits_remote
     , SUM (one_when(played AND logged_in)) AS count_played_console
     , SUM (one_when(played AND NOT logged_in)) AS count_played_remote
     , SUM (one_when(downloaded AND logged_in)) AS
count_downloaded_console
     , SUM (one_when(downloaded AND NOT logged_in)) AS
count_downloaded_remote
     , SUM (value_when((played OR downloaded) AND logged_in,
assets.size))
         as download_size_console
     , SUM (value_when((played OR downloaded) AND NOT logged_in),
assets.size)
         as download_size_remote
     , videos.id
     , videos.title
     , videos.guid
FROM video_views
JOIN assets ON (video_views.video_id=videos.id)
JOIN videos ON (video_views.asset_id=assets.id)
WHERE videos.company_id=1
GROUP BY videos.id
     , videos.title
     , videos.guid
ORDER BY count_hits_remote
DESC LIMIT 100



Michael Glaesemann
grzm seespotcode net



Re: Need help optimizing this query

От
"Pat Maddox"
Дата:
On 7/18/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
>
> On 7/18/07, Pat Maddox <pergesu@gmail.com> wrote:
> > I've got a query that's taking forever (as will be obvious when you
> > see it and the explain output).  I can't figure out what indexes to
> > add to make this run faster.  I'd appreciate any help.
> >
>
> I'm curious why it's choosing to use hash joins rather than taking
> advantage of the indexes you have on the foreign key columns. What
> are the table definitions? Are hit, logged_in, played, downloaded all
> columns of videos_views?
>
> Have you ANALYZEd these tables? What is the EXPLAIN ANALYZE output
> for this query? You only provided the EXPLAIN output, which doesn't
> compare the plan with the actual query cost.
>
> I found it quite difficult to read you query. I reformatted it and
> also used some SQL functions to abstract away the CASE expressions.
> These SQL functions will probably be inlined so there should be very
> little overhead. If you have a boolean column, you don't need to test
> IS TRUE or IS FALSE: you can just use the value itself. I also find
> it helpful to separate the join conditions (in the JOIN clause) from
> the restrictions (the WHERE clause), which I've done below.
>
> CREATE OR REPLACE FUNCTION ternary(BOOLEAN, INTEGER, INTEGER)
> RETURNS INTEGER
> LANGUAGE SQL AS $_$SELECT CASE WHEN $1 THEN $2 ELSE $3 END$_$;
>
> CREATE OR REPLACE FUNCTION value_when(BOOLEAN, INTEGER)
> RETURNS INTEGER
> LANGUAGE SQL AS $_$SELECT ternary($1,$2,0)$_$;
>
> CREATE OR REPLACE FUNCTION one_when(BOOLEAN)
> RETURNS INTEGER
> LANGUAGE SQL as $_$SELECT value_when($1,1)$_$;
>
> SELECT
>      SUM (one_when(hit AND logged_in)) AS count_hits_console
>      , SUM (one_when(hit AND NOT logged_in)) AS count_hits_remote
>      , SUM (one_when(played AND logged_in)) AS count_played_console
>      , SUM (one_when(played AND NOT logged_in)) AS count_played_remote
>      , SUM (one_when(downloaded AND logged_in)) AS
> count_downloaded_console
>      , SUM (one_when(downloaded AND NOT logged_in)) AS
> count_downloaded_remote
>      , SUM (value_when((played OR downloaded) AND logged_in,
> assets.size))
>          as download_size_console
>      , SUM (value_when((played OR downloaded) AND NOT logged_in),
> assets.size)
>          as download_size_remote
>      , videos.id
>      , videos.title
>      , videos.guid
> FROM video_views
> JOIN assets ON (video_views.video_id=videos.id)
> JOIN videos ON (video_views.asset_id=assets.id)
> WHERE videos.company_id=1
> GROUP BY videos.id
>      , videos.title
>      , videos.guid
> ORDER BY count_hits_remote
> DESC LIMIT 100
>
>
>
> Michael Glaesemann
> grzm seespotcode net
>
>
>

Michael,

I tried your SQL but it didn't work - it was missing the videos table
in the FROM clause.  But when I add it, I get the error:

ERROR:  invalid reference to FROM-clause entry for table "video_views"
LINE 20: JOIN assets ON (video_views.video_id=videos.id)
                         ^
HINT:  There is an entry for table "video_views", but it cannot be
referenced from this part of the query.


Not really sure what that means.

Here are the table definitions:

twistage_development=# \d video_views
                                     Table "public.video_views"
   Column   |            Type             |
Modifiers
------------+-----------------------------+----------------------------------------------------------
 id         | integer                     | not null default
nextval('video_views_id_seq'::regclass)
 video_id   | integer                     |
 created_at | timestamp without time zone |
 asset_id   | integer                     |
 played     | boolean                     | default false
 downloaded | boolean                     | default false
 hit        | boolean                     | default false
 logged_in  | boolean                     | default false
Indexes:
    "video_views_pkey" PRIMARY KEY, btree (id)
    "index_video_views_on_asset_id" btree (asset_id)
    "index_video_views_on_video_id" btree (video_id)
    "index_video_views_on_video_id_and_asset_id_and_created_at" btree
(video_id, created_at, asset_id)

twistage_development=# \d videos
                                           Table "public.videos"
        Column         |            Type             |                      Modi
fiers
-----------------------+-----------------------------+--------------------------
---------------------------
 id                    | integer                     | not null default nextval(
'videos_id_seq'::regclass)
 title                 | character varying(255)      |
 duration              | double precision            |
 description           | text                        |
 status                | character varying(255)      |
 user_id               | integer                     |
 created_at            | timestamp without time zone |
 upload_finished       | boolean                     | default false
 publisher_name        | character varying(255)      |
 company_id            | integer                     |
 available_for_display | boolean                     | default true
 guid                  | character varying(255)      |
 main_asset_id         | integer                     |
 container_type        | character varying(255)      |
 codec                 | character varying(255)      |
 site_id               | integer                     |
 deleted_at            | timestamp without time zone |
 purged_at             | timestamp without time zone |
 remote_hits_count     | integer                     | default 0
Indexes:
    "videos_pkey" PRIMARY KEY, btree (id)
    "complete_videos_without_deleted_at" btree (company_id, status)
    "index_complete_videos" btree (deleted_at, purged_at, status, created_at, co
mpany_id)
    "index_videos_on_company_id" btree (company_id)
    "index_videos_on_deleted_at_and_status_and_site_id" btree (status, deleted_a
t, site_id)
    "index_videos_on_guid" btree (guid)
    "index_videos_on_publisher_name" btree (publisher_name)
    "index_videos_on_site_id" btree (site_id)
    "index_videos_on_user_id" btree (user_id)

twistage_development=# \d assets
                                        Table "public.assets"
     Column      |            Type             |
Modifiers
-----------------+-----------------------------+-----------------------------------------------------
 id              | integer                     | not null default
nextval('assets_id_seq'::regclass)
 video_id        | integer                     |
 video_format_id | integer                     |
 guid            | character varying(255)      |
 source_path     | character varying(255)      |
 size            | bigint                      |
 vresolution     | integer                     |
 hresolution     | integer                     |
 video_bitrate   | integer                     |
 frame_rate      | integer                     |
 container       | character varying(255)      |
 vcodec          | character varying(255)      |
 status          | character varying(255)      |
 deleted_at      | timestamp without time zone |
 audio_bitrate   | integer                     |
 acodec          | character varying(255)      |
 duration        | double precision            |
Indexes:
    "assets_pkey" PRIMARY KEY, btree (id)
    "index_assets_on_video_format_id" btree (video_format_id)
    "index_assets_on_video_id" btree (video_id)

Re: Need help optimizing this query

От
Michael Glaesemann
Дата:
On Jul 18, 2007, at 16:01 , Michael Glaesemann wrote:

> CREATE OR REPLACE FUNCTION one_when(BOOLEAN)
> RETURNS INTEGER
> LANGUAGE SQL as $_$SELECT value_when($1,1)$_$;

I forgot to add that you can cast booleans to integers, so one_when
(expr) is equivalent to expr::int:

# SELECT (true and false)::int, (true or false)::int;
int4 | int4
------+------
     0 |    1
(1 row)

Of course, this rewriting shouldn't affect the performance at all: it
should just make it easier for you to read, which does have some value.

Michael Glaesemann
grzm seespotcode net



Re: Need help optimizing this query

От
Michael Glaesemann
Дата:
On Jul 18, 2007, at 16:12 , Pat Maddox wrote:

> ERROR:  invalid reference to FROM-clause entry for table "video_views"
> LINE 20: JOIN assets ON (video_views.video_id=videos.id)
>                         ^
> HINT:  There is an entry for table "video_views", but it cannot be
> referenced from this part of the query.

It's because I mismatched the JOIN clauses during my copy-and-paste :(

> On 7/18/07, Michael Glaesemann <grzm@seespotcode.net> wrote:

>> FROM video_views
>> JOIN assets ON (video_views.video_id=videos.id)
>> JOIN videos ON (video_views.asset_id=assets.id)

This should be

FROM video_views
JOIN assets ON (video_views.asset_id=assets.id)
JOIN videos ON (video_views.video_id=videos.id)

Do you have the EXPLAIN ANALYE output of the query?

Michael Glaesemann
grzm seespotcode net



Re: Need help optimizing this query

От
"Pat Maddox"
Дата:
On 7/18/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
>
> On Jul 18, 2007, at 16:12 , Pat Maddox wrote:
>
> > ERROR:  invalid reference to FROM-clause entry for table "video_views"
> > LINE 20: JOIN assets ON (video_views.video_id=videos.id)
> >                         ^
> > HINT:  There is an entry for table "video_views", but it cannot be
> > referenced from this part of the query.
>
> It's because I mismatched the JOIN clauses during my copy-and-paste :(
>
> > On 7/18/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
>
> >> FROM video_views
> >> JOIN assets ON (video_views.video_id=videos.id)
> >> JOIN videos ON (video_views.asset_id=assets.id)
>
> This should be
>
> FROM video_views
> JOIN assets ON (video_views.asset_id=assets.id)
> JOIN videos ON (video_views.video_id=videos.id)
>
> Do you have the EXPLAIN ANALYE output of the query?
>
> Michael Glaesemann
> grzm seespotcode net
>
>
>

For some reason the functions you wrote are giving me trouble (there's
a BIGINT involved, I tried changing the functions around but kept
having issues).  So here's the full query, hopefully formatted better:

SELECT
    SUM(CASE WHEN (hit IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0
END) AS count_hits_console,
    SUM(CASE WHEN (hit IS TRUE AND logged_in IS FALSE) THEN 1 ELSE 0
END) AS count_hits_remote,
    SUM(CASE WHEN (played IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0
END) AS count_played_console,
    SUM(CASE WHEN (played IS TRUE AND logged_in IS FALSE) THEN 1 ELSE
0 END) AS count_played_remote,
    SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS TRUE) THEN 1
ELSE 0 END) AS count_downloaded_console,
    SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS FALSE) THEN 1
ELSE 0 END) AS count_downloaded_remote,
    SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND
logged_in IS TRUE) THEN assets.size ELSE 0 END) as
download_size_console,
    SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND
logged_in IS FALSE) THEN assets.size ELSE 0 END) AS
download_size_remote,
    videos.id,
    videos.title,
    videos.guid
FROM video_views
JOIN assets ON (video_views.asset_id=assets.id)
JOIN videos on (video_views.video_id=videos.id)
WHERE videos.company_id=1
GROUP BY  videos.id,
          videos.title,
          videos.guid
ORDER BY count_hits_remote DESC
LIMIT 100



and here's the EXPLAIN ANALYZE output:

 Limit  (cost=127072.90..127073.12 rows=87 width=64) (actual
time=2636.560..2636.567 rows=20 loops=1)
   ->  Sort  (cost=127072.90..127073.12 rows=87 width=64) (actual
time=2636.558..2636.562 rows=20 loops=1)
         Sort Key: sum(CASE WHEN ((video_views.hit IS TRUE) AND
(video_views.logged_in IS FALSE)) THEN 1 ELSE 0 END)
         ->  HashAggregate  (cost=127067.49..127070.10 rows=87
width=64) (actual time=2636.481..2636.506 rows=20 loops=1)
               ->  Hash Join  (cost=880.96..125995.46 rows=38983
width=64) (actual time=24.904..2635.719 rows=122 loops=1)
                     Hash Cond: (video_views.asset_id = assets.id)
                     ->  Hash Join  (cost=195.96..124433.01 rows=39009
width=60) (actual time=8.327..2618.982 rows=122 loops=1)
                           Hash Cond: (video_views.video_id = videos.id)
                           ->  Seq Scan on video_views
(cost=0.00..101352.70 rows=5998470 width=12) (actual
time=0.031..1410.231 rows=5998341 loops=1)
                           ->  Hash  (cost=194.87..194.87 rows=87
width=52) (actual time=1.001..1.001 rows=90 loops=1)
                                 ->  Bitmap Heap Scan on videos
(cost=4.93..194.87 rows=87 width=52) (actual time=0.111..0.840 rows=90
loops=1)
                                       Recheck Cond: (company_id = 1)
                                       ->  Bitmap Index Scan on
index_videos_on_company_id  (cost=0.00..4.90 rows=87 width=0) (actual
time=0.079..0.079 rows=90 loops=1)
                                             Index Cond: (company_id = 1)
                     ->  Hash  (cost=487.78..487.78 rows=15778
width=12) (actual time=16.527..16.527 rows=15778 loops=1)
                           ->  Seq Scan on assets  (cost=0.00..487.78
rows=15778 width=12) (actual time=0.023..9.601 rows=15778 loops=1)
 Total runtime: 2637.043 ms
(17 rows)


That one runs reasonably fine, because there are only 20 videos being
returned and a handful of video views associated with them.  In the
real query there are about 1k videos and a couple million views.  That
took about 80 minutes to run, according to logs.

Pat

Re: Need help optimizing this query

От
"Pat Maddox"
Дата:
On 7/18/07, Pat Maddox <pergesu@gmail.com> wrote:
> On 7/18/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
> >
> > On Jul 18, 2007, at 16:12 , Pat Maddox wrote:
> >
> > > ERROR:  invalid reference to FROM-clause entry for table "video_views"
> > > LINE 20: JOIN assets ON (video_views.video_id=videos.id)
> > >                         ^
> > > HINT:  There is an entry for table "video_views", but it cannot be
> > > referenced from this part of the query.
> >
> > It's because I mismatched the JOIN clauses during my copy-and-paste :(
> >
> > > On 7/18/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
> >
> > >> FROM video_views
> > >> JOIN assets ON (video_views.video_id=videos.id)
> > >> JOIN videos ON (video_views.asset_id=assets.id)
> >
> > This should be
> >
> > FROM video_views
> > JOIN assets ON (video_views.asset_id=assets.id)
> > JOIN videos ON (video_views.video_id=videos.id)
> >
> > Do you have the EXPLAIN ANALYE output of the query?
> >
> > Michael Glaesemann
> > grzm seespotcode net
> >
> >
> >
>
> For some reason the functions you wrote are giving me trouble (there's
> a BIGINT involved, I tried changing the functions around but kept
> having issues).  So here's the full query, hopefully formatted better:
>
> SELECT
>     SUM(CASE WHEN (hit IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0
> END) AS count_hits_console,
>     SUM(CASE WHEN (hit IS TRUE AND logged_in IS FALSE) THEN 1 ELSE 0
> END) AS count_hits_remote,
>     SUM(CASE WHEN (played IS TRUE AND logged_in IS TRUE) THEN 1 ELSE 0
> END) AS count_played_console,
>     SUM(CASE WHEN (played IS TRUE AND logged_in IS FALSE) THEN 1 ELSE
> 0 END) AS count_played_remote,
>     SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS TRUE) THEN 1
> ELSE 0 END) AS count_downloaded_console,
>     SUM(CASE WHEN (downloaded IS TRUE AND logged_in IS FALSE) THEN 1
> ELSE 0 END) AS count_downloaded_remote,
>     SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND
> logged_in IS TRUE) THEN assets.size ELSE 0 END) as
> download_size_console,
>     SUM(CASE WHEN ((played IS TRUE OR downloaded IS TRUE) AND
> logged_in IS FALSE) THEN assets.size ELSE 0 END) AS
> download_size_remote,
>     videos.id,
>     videos.title,
>     videos.guid
> FROM video_views
> JOIN assets ON (video_views.asset_id=assets.id)
> JOIN videos on (video_views.video_id=videos.id)
> WHERE videos.company_id=1
> GROUP BY  videos.id,
>           videos.title,
>           videos.guid
> ORDER BY count_hits_remote DESC
> LIMIT 100
>
>
>
> and here's the EXPLAIN ANALYZE output:
>
>  Limit  (cost=127072.90..127073.12 rows=87 width=64) (actual
> time=2636.560..2636.567 rows=20 loops=1)
>    ->  Sort  (cost=127072.90..127073.12 rows=87 width=64) (actual
> time=2636.558..2636.562 rows=20 loops=1)
>          Sort Key: sum(CASE WHEN ((video_views.hit IS TRUE) AND
> (video_views.logged_in IS FALSE)) THEN 1 ELSE 0 END)
>          ->  HashAggregate  (cost=127067.49..127070.10 rows=87
> width=64) (actual time=2636.481..2636.506 rows=20 loops=1)
>                ->  Hash Join  (cost=880.96..125995.46 rows=38983
> width=64) (actual time=24.904..2635.719 rows=122 loops=1)
>                      Hash Cond: (video_views.asset_id = assets.id)
>                      ->  Hash Join  (cost=195.96..124433.01 rows=39009
> width=60) (actual time=8.327..2618.982 rows=122 loops=1)
>                            Hash Cond: (video_views.video_id = videos.id)
>                            ->  Seq Scan on video_views
> (cost=0.00..101352.70 rows=5998470 width=12) (actual
> time=0.031..1410.231 rows=5998341 loops=1)
>                            ->  Hash  (cost=194.87..194.87 rows=87
> width=52) (actual time=1.001..1.001 rows=90 loops=1)
>                                  ->  Bitmap Heap Scan on videos
> (cost=4.93..194.87 rows=87 width=52) (actual time=0.111..0.840 rows=90
> loops=1)
>                                        Recheck Cond: (company_id = 1)
>                                        ->  Bitmap Index Scan on
> index_videos_on_company_id  (cost=0.00..4.90 rows=87 width=0) (actual
> time=0.079..0.079 rows=90 loops=1)
>                                              Index Cond: (company_id = 1)
>                      ->  Hash  (cost=487.78..487.78 rows=15778
> width=12) (actual time=16.527..16.527 rows=15778 loops=1)
>                            ->  Seq Scan on assets  (cost=0.00..487.78
> rows=15778 width=12) (actual time=0.023..9.601 rows=15778 loops=1)
>  Total runtime: 2637.043 ms
> (17 rows)
>
>
> That one runs reasonably fine, because there are only 20 videos being
> returned and a handful of video views associated with them.  In the
> real query there are about 1k videos and a couple million views.  That
> took about 80 minutes to run, according to logs.
>
> Pat
>

Okay so it doesn't normally take 80 minutes to run.  Something funky
just happened and it took that long once.  It usually takes between
90-100 seconds.  My coworker told me it takes 80 minutes but it
appears that's an anomaly.

One thing we were wondering is if all the aggregate calculations might
be slowing it down, and if it might be faster to do six separate
queries.  The real problem there is sorting and merging the data sets.
 Merging isn't tough, but making sure that all queries obey the
desired sort order is.  What do you think?

Pat

Re: Need help optimizing this query

От
Michael Glaesemann
Дата:
On Jul 18, 2007, at 16:48 , Pat Maddox wrote:

> For some reason the functions you wrote are giving me trouble (there's
> a BIGINT involved, I tried changing the functions around but kept
> having issues).

You might try these, if you're interested.

CREATE OR REPLACE FUNCTION ternary(BOOLEAN, BIGINT, BIGINT)
RETURNS BIGINT
LANGUAGE SQL AS $_$SELECT CASE WHEN $1 THEN $2 ELSE $3 END$_$;

CREATE OR REPLACE FUNCTION value_when(BOOLEAN, BIGINT)
RETURNS BIGINT
LANGUAGE SQL AS $_$SELECT ternary($1,$2,0)$_$;

CREATE OR REPLACE FUNCTION one_when(BOOLEAN)
RETURNS BIGINT
LANGUAGE SQL as $_$SELECT value_when($1,1)$_$;

> So here's the full query, hopefully formatted better:

I'm still curious about why the planner is choosing a hash join over
using the indexes on the foreign keys, but that might be because the
tables are relatively small.

> That one runs reasonably fine, because there are only 20 videos being
> returned and a handful of video views associated with them.  In the
> real query there are about 1k videos and a couple million views.  That
> took about 80 minutes to run, according to logs.

The planner will choose different plans based on, among other things,
what it estimates the size of the result to be, so while looking at a
small example query might seem like a way to go about looking at
what's going on, it's most likely not going to give you an accurate
representation of the situation. Are you looking at two different
systems (e.g., a development system versus a production system) or
just choosing a smaller query on the same system? If you can't run
the query on your production system, you may want to take a dump of
the production system and set it up on another box. Even with a
couple million rows in the video_views table, PostgreSQL shouldn't
really blink too much, as long as the server is tuned properly, the
hardware is adequate, and the database statistics are up to date.

Michael Glaesemann
grzm seespotcode net




Re: Need help optimizing this query

От
"Pat Maddox"
Дата:
On 7/18/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
>
> On Jul 18, 2007, at 16:48 , Pat Maddox wrote:
>
> > For some reason the functions you wrote are giving me trouble (there's
> > a BIGINT involved, I tried changing the functions around but kept
> > having issues).
>
> You might try these, if you're interested.
>
> CREATE OR REPLACE FUNCTION ternary(BOOLEAN, BIGINT, BIGINT)
> RETURNS BIGINT
> LANGUAGE SQL AS $_$SELECT CASE WHEN $1 THEN $2 ELSE $3 END$_$;
>
> CREATE OR REPLACE FUNCTION value_when(BOOLEAN, BIGINT)
> RETURNS BIGINT
> LANGUAGE SQL AS $_$SELECT ternary($1,$2,0)$_$;
>
> CREATE OR REPLACE FUNCTION one_when(BOOLEAN)
> RETURNS BIGINT
> LANGUAGE SQL as $_$SELECT value_when($1,1)$_$;
>
> > So here's the full query, hopefully formatted better:
>
> I'm still curious about why the planner is choosing a hash join over
> using the indexes on the foreign keys, but that might be because the
> tables are relatively small.
>
> > That one runs reasonably fine, because there are only 20 videos being
> > returned and a handful of video views associated with them.  In the
> > real query there are about 1k videos and a couple million views.  That
> > took about 80 minutes to run, according to logs.
>
> The planner will choose different plans based on, among other things,
> what it estimates the size of the result to be, so while looking at a
> small example query might seem like a way to go about looking at
> what's going on, it's most likely not going to give you an accurate
> representation of the situation. Are you looking at two different
> systems (e.g., a development system versus a production system) or
> just choosing a smaller query on the same system? If you can't run
> the query on your production system, you may want to take a dump of
> the production system and set it up on another box. Even with a
> couple million rows in the video_views table, PostgreSQL shouldn't
> really blink too much, as long as the server is tuned properly, the
> hardware is adequate, and the database statistics are up to date.
>
> Michael Glaesemann
> grzm seespotcode net
>
>
>
>

Sorry, I mentioned that it took 90 seconds to run the query but I
didn't show that EXPLAIN ANALYZE output.

Here it is, same query just with a different company_id:

 Limit  (cost=879283.07..879283.32 rows=100 width=64) (actual
time=92486.858..92486.891 rows=100 loops=1)
   ->  Sort  (cost=879283.07..879297.15 rows=5632 width=64) (actual
time=92486.856..92486.867 rows=100 loops=1)
         Sort Key: sum(CASE WHEN ((video_views.hit IS TRUE) AND
(video_views.logged_in IS FALSE)) THEN 1 ELSE 0 END)
         ->  GroupAggregate  (cost=803054.95..878932.21 rows=5632
width=64) (actual time=67145.471..92484.408 rows=730 loops=1)
               ->  Sort  (cost=803054.95..809363.98 rows=2523610
width=64) (actual time=67076.407..75441.274 rows=5799447 loops=1)
                     Sort Key: videos.id, videos.title, videos.guid
                     ->  Hash Join  (cost=1220.63..237115.16
rows=2523610 width=64) (actual time=31.230..11507.406 rows=5799447
loops=1)
                           Hash Cond: (video_views.asset_id = assets.id)
                           ->  Hash Join  (cost=535.62..179627.88
rows=2525294 width=60) (actual time=13.286..7621.950 rows=5799447
loops=1)
                                 Hash Cond: (video_views.video_id = videos.id)
                                 ->  Seq Scan on video_views
(cost=0.00..101352.70 rows=5998470 width=12) (actual
time=0.023..2840.718 rows=5998341 loops=1)
                                 ->  Hash  (cost=465.23..465.23
rows=5632 width=52) (actual time=13.216..13.216 rows=5712 loops=1)
                                       ->  Seq Scan on videos
(cost=0.00..465.23 rows=5632 width=52) (actual time=0.038..9.060
rows=5712 loops=1)
                                             Filter: (company_id = 11)
                           ->  Hash  (cost=487.78..487.78 rows=15778
width=12) (actual time=17.876..17.876 rows=15778 loops=1)
                                 ->  Seq Scan on assets
(cost=0.00..487.78 rows=15778 width=12) (actual time=0.032..10.880
rows=15778 loops=1)
 Total runtime: 92548.006 ms
(17 rows)

Re: Need help optimizing this query

От
Michael Glaesemann
Дата:
On Jul 18, 2007, at 17:34 , Pat Maddox wrote:


>               ->  Sort  (cost=803054.95..809363.98 rows=2523610
> width=64) (actual time=67076.407..75441.274 rows=5799447 loops=1)
>                     Sort Key: videos.id, videos.title, videos.guid

If I'm reading this right, it looks like a majority of the time
(about two-thirds) is spent in this sort step. This is a naive guess,
but perhaps you may want to increase your work_mem. What's your
current work_mem setting?

Michael Glaesemann
grzm seespotcode net