Re: faster way to calculate top "tags" for a "resource" based on a column

Поиск
Список
Период
Сортировка
От Jim Nasby
Тема Re: faster way to calculate top "tags" for a "resource" based on a column
Дата
Msg-id 54331003.1020505@BlueTreble.com
обсуждение исходный текст
Ответ на faster way to calculate top "tags" for a "resource" based on a column  (Jonathan Vanasco <postgres@2xlp.com>)
Ответы Re: faster way to calculate top "tags" for a "resource" based on a column  (Jonathan Vanasco <postgres@2xlp.com>)
Список pgsql-general
On 10/3/14, 11:21 AM, Jonathan Vanasco wrote:
> I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any
timeoff this one.  I'm hoping someone has another strategy. 
>
> I have 2 tables:
>     resource
>     resource_2_tag
>
> I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the
"resource"table. 
>
> both tables have around 1.6million records.
>
> If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
> If the database doesn't need to warm up, it averages 1.76seconds.
>
> The 1.76s time is troubling me.
> Searching for the discrete elements of this is pretty lightweight.
>
> here's an explain --  http://explain.depesz.com/s/PndC
>
> I tried a subquery instead of a join, and the query optimized the plan to the same.
>
> i'm hoping someone will see something that I just don't see.
>
>
>
>        Table "public.resource_2_tag"
>          Column         |  Type   | Modifiers
> -----------------------+---------+-----------
>   resource_id           | integer |
>   tag_id                | integer |
> Indexes:
>      "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
>      "_idx_speed_resource_2_tag__tag_id" btree (tag_id)
>
>                                                    Table "public.resource"
>                 Column                |            Type             |                        Modifiers
>
-------------------------------------+-----------------------------+----------------------------------------------------------
>   id                                  | integer                     | not null default
nextval('resource_id_seq'::regclass)
> resource_attribute1_id               | integer                     |
> lots of other columns                |                             |
> Indexes:
>     "resource_attribute1_idx" btree (resource_attribute1_id)
>
> --------------------------------------------------------------------------------
>
> select count(*) from resource;
> -- 1669729
>
> select count(*) from resource_2_tag;
> -- 1676594
>
> select count(*) from resource where resource_attribute1_id = 614;
> -- 5184
> -- 4.386ms
>
> select id from resource where resource_attribute1_id = 614;
> -- 5184
> -- 87.303ms
>
> popping the 5k elements into an "in" clause, will run the query in around 100ms.
>
>
> EXPLAIN ANALYZE
> SELECT
>     resource_2_tag.tag_id AS resource_2_tag_tag_id,
>     count(resource_2_tag.tag_id) AS counted
> FROM
>     resource_2_tag
> JOIN resource ON resource.id = resource_2_tag.resource_id
> WHERE
>     resource.resource_attribute1_id = 614
> GROUP BY resource_2_tag.tag_id
> ORDER BY counted DESC
> LIMIT 25 OFFSET 0;
Don't join to the resource table; there's no reason to because you're not pulling anything from it.

If for some reason you do need data out of the resource table, do the LIMIT 25 first, in a sub-select:

SELECT r.*, counted
     FROM resource r
         JOIN (
             SELECT tag_id, count(*)
                 FROM resource_2_tag
                 GROUP BY tag_id
                 ORDER BY tag_id
                 LIMIT 25
         ) t ON ...
;

--
--
Jim Nasby, Data Architect, Blue Treble
Data in Trouble? Get it in Treble! http://BlueTreble.com



В списке pgsql-general по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: table versioning approach (not auditing)
Следующее
От: Jim Nasby
Дата:
Сообщение: Re: Processor usage/tuning question