Обсуждение: Query performance

Поиск
Список
Период
Сортировка

Query performance

От
Joe Van Dyk
Дата:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)


Re: Query performance

От
Joe Van Dyk
Дата:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)



After removing the session with 400k events, I was able to do an explain analyze, here is one of them:

Re: Query performance

От
Joe Van Dyk
Дата:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)



After removing the session with 400k events, I was able to do an explain analyze, here is one of them:

Re: Query performance

От
Pavel Stehule
Дата:
Hi

this plan looks well

Regards

Pavel

2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)



After removing the session with 400k events, I was able to do an explain analyze, here is one of them:


Re: Query performance

От
Joe Van Dyk
Дата:


On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi

this plan looks well

Regards

Pavel

Here's one that's not quite as well: http://explain.depesz.com/s/SgT

Joe
 

2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)



After removing the session with 400k events, I was able to do an explain analyze, here is one of them:



Re: Query performance

От
Pavel Stehule
Дата:


2015-01-25 7:38 GMT+01:00 Joe Van Dyk <joe@tanga.com>:


On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi

this plan looks well

Regards

Pavel

Here's one that's not quite as well: http://explain.depesz.com/s/SgT

I see a possible issue

(product_id <> '81716'::citext) .. this operation is CPU expensive and maybe nonsense

product_id should be integer -- and if it isn't - it should not be on 4M rows extremly fast - mainly on citext

try to force a opposite cast - you will safe a case insensitive text comparation

product_id::int <> 81716

Regards

Pavel


 

Joe
 

2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)



After removing the session with 400k events, I was able to do an explain analyze, here is one of them:




Re: Query performance

От
Joe Van Dyk
Дата:
On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2015-01-25 7:38 GMT+01:00 Joe Van Dyk <joe@tanga.com>:


On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi

this plan looks well

Regards

Pavel

Here's one that's not quite as well: http://explain.depesz.com/s/SgT

I see a possible issue

(product_id <> '81716'::citext) .. this operation is CPU expensive and maybe nonsense

product_id should be integer -- and if it isn't - it should not be on 4M rows extremly fast - mainly on citext

try to force a opposite cast - you will safe a case insensitive text comparation

product_id::int <> 81716

It might not always be an integer, just happens to be so here. Should I try text instead? I don't have to have the case-insensitive matching.

Joe
 

Regards

Pavel


 

Joe
 

2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)



After removing the session with 400k events, I was able to do an explain analyze, here is one of them:





Re: Query performance

От
Pavel Stehule
Дата:


2015-01-25 8:20 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2015-01-25 7:38 GMT+01:00 Joe Van Dyk <joe@tanga.com>:


On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi

this plan looks well

Regards

Pavel

Here's one that's not quite as well: http://explain.depesz.com/s/SgT

I see a possible issue

(product_id <> '81716'::citext) .. this operation is CPU expensive and maybe nonsense

product_id should be integer -- and if it isn't - it should not be on 4M rows extremly fast - mainly on citext

try to force a opposite cast - you will safe a case insensitive text comparation

product_id::int <> 81716

It might not always be an integer, just happens to be so here. Should I try text instead? I don't have to have the case-insensitive matching.

text can be better

this design is unhappy, but you cannot to change ot probably

 

Joe
 

Regards

Pavel


 

Joe
 

2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. 

Is there anything I can do to make the query work better in cases like that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select  e1.product_id,  e2.site_id,  e2.product_id,  count(nullif(e2.type='viewed', false)) view_count,  count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere  e1.product_id = '82503' and  e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id;                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)  ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)        Sort Key: e1.product_id, e2.product_id, e2.site_id        ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)              ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)                    Recheck Cond: (product_id = '82503'::citext)                    ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)                          Index Cond: (product_id = '82503'::citext)              ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)                    Index Cond: ((session_id = e1.session_id) AND (type = e1.type))                    Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events                                   Table "public.events"  Column    |           Type           |                      Modifiers
-------------+--------------------------+-----------------------------------------------------id          | bigint                   | not null default nextval('events_id_seq'::regclass)user_id     | citext                   |session_id  | citext                   | not nullproduct_id  | citext                   | not nullsite_id     | citext                   | not nulltype        | text                     | not nullhappened_at | timestamp with time zone | not nullcreated_at  | timestamp with time zone | not null
Indexes:   "events_pkey" PRIMARY KEY, btree (id)   "events_product_id_site_id_idx" btree (product_id, site_id)   "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:   "events_session_id_check" CHECK (length(session_id::text) < 255)   "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))   "events_user_id_check" CHECK (length(user_id::text) < 255)



After removing the session with 400k events, I was able to do an explain analyze, here is one of them:






Re: Query performance

От
Tomas Vondra
Дата:
Hi,

On 25.1.2015 07:38, Joe Van Dyk wrote:
>
> Here's one that's not quite as well: http://explain.depesz.com/s/SgT

As Pavel already pointed out, the first problem is this part of the plan:

Seq Scan on events e2 (cost=0.00..120,179.60 rows=4,450,241 width=51)
(actual time=0.014..33,773.370 rows=4,450,865 loops=1)
    Filter: (product_id <> '81716'::citext)

Consuming ~33 seconds of the runtime. If you can make this faster
somehow (e.g. by getting rid of the citext cast), that'd be nice.

Another issue is that the hashjoin is batched:

   Buckets: 65536 Batches: 8 Memory Usage: 46085kB

The hash preparation takes ~40 seconds, so maybe try to give it a bit
more memory - I assume you have work_mem=64MB, so try doubling that
(ISTM 512MB should work with a single batch). Maybe this won't really
improve the performance, though. It still has to process ~4.5M rows.

Increasing the work mem could also result in switching to hash
aggregate, making the sort (~30 seconds) unnecessary.

Anyway, ISTM this works as expected, i.e.

(a) with rare product_id values the queries are fast
(b) with common product_id values the queries are slow

That's expected, because (b) needs to process much more data. I don't
think you can magically make it run as fast as (a). The best solution
might be to keep a pre-aggregated results - I don't think you really
need exact answers when recommending "similar" products.

I also wonder if you really need to join the tables? I mean, what if you
do something like this:

CREATE TABLE events_aggregated AS SELECT
   site_id,
   array_agg(product_id) AS product_ids,
   count(nullif(e2.type='viewed', false)) view_count,
   count(nullif(e2.type='purchased', false)) purchase_count
FROM events
GROUP BY 1;

and then using intarray with GIN indexes to query this table?
Something like this:

  CREATE products_agg_idx ON aggregated
                       USING GIN (product_ids gin__int_ops);

  SELECT * FROM events_aggregated WHERE product_ids @> ARRAY['82503'];

regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Query performance

От
Marc Mamin
Дата:

>I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".
>
>It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.
>
>Is there anything I can do to make the query work better in cases like that? Missing index, or better query?
>
>This is on 9.3.5.
>
>The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

Hello,

here  are 2 variations that should be somewhat faster.

 It seems you may have duplicate (site_id,session_id,product_id)
 which would false the result. In that case you'll need some more logic in the query.
 
 select
    '82503' as product_id,
    e2.site_id,
    e2.product_id,
    count(nullif(e2.type='viewed', false)) view_count,
    count(nullif(e2.type='purchased', false)) purchase_count
  from events e1
  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
  where
    e1.product_id = '82503' and
    e2.product_id != '82503'
 group by  e2.product_id, e2.site_id;
 
 
 OR:
 
 WITH SALL as(
  select
     e2.site_id,
     e2.product_id,
     count(nullif(e2.type='viewed', false)) view_count,
     count(nullif(e2.type='purchased', false)) purchase_count
   from events e1
   join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
   where
     e1.product_id = '82503'
  group by  e2.product_id, e2.site_id
 )
 SELECT
    '82503' as product_id_1,
    site_id,
    product_id,
    view_count,
    purchase_count
 FROM SALL
 WHERE product_id != '82503';
    

 regards,
 Marc Mamin
 


>explain  select
>   e1.product_id,
>   e2.site_id,
>   e2.product_id,
>   count(nullif(e2.type='viewed', false)) view_count,
>   count(nullif(e2.type='purchased', false)) purchase_count
> from events e1
> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
> where
>   e1.product_id = '82503' and
>   e1.product_id != e2.product_id
> group by e1.product_id, e2.product_id, e2.site_id;
>                                                         QUERY PLAN
>----------------------------------------------------------------------------------------------------------------------------
> GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>   ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)
>         Sort Key: e1.product_id, e2.product_id, e2.site_id
>         ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)
>               ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)
>                     Recheck Cond: (product_id = '82503'::citext)
>                     ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>                           Index Cond: (product_id = '82503'::citext)
>               ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)
>                     Index Cond: ((session_id = e1.session_id) AND (type = e1.type))
>                     Filter: (e1.product_id <> product_id)
>(11 rows)
>
>recommender_production=> \d events
>                                    Table "public.events"
>   Column    |           Type           |                      Modifiers
>-------------+--------------------------+-----------------------------------------------------
> id          | bigint                   | not null default nextval('events_id_seq'::regclass)
> user_id     | citext                   |
> session_id  | citext                   | not null
> product_id  | citext                   | not null
> site_id     | citext                   | not null
> type        | text                     | not null
> happened_at | timestamp with time zone | not null
> created_at  | timestamp with time zone | not null
>Indexes:
>    "events_pkey" PRIMARY KEY, btree (id)
>    "events_product_id_site_id_idx" btree (product_id, site_id)
>    "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
>Check constraints:
>    "events_session_id_check" CHECK (length(session_id::text) < 255)
>    "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))
>    "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>

Re: Query performance

От
Jim Nasby
Дата:
On 1/25/15 2:03 AM, Pavel Stehule wrote:
>     It might not always be an integer, just happens to be so here.
>     Should I try text instead? I don't have to have the case-insensitive
>     matching.
>
>
> text can be better

bytea would be even better yet, because that will always be a straight
binary comparison. text will worry about conversion and what not
(though, perhaps there's a way to force that to use C or SQL instead of
something like UTF8, short of changing the encoding of the whole database).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: Query performance

От
Pavel Stehule
Дата:


2015-01-31 2:40 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:
On 1/25/15 2:03 AM, Pavel Stehule wrote:
    It might not always be an integer, just happens to be so here.
    Should I try text instead? I don't have to have the case-insensitive
    matching.


text can be better

bytea would be even better yet, because that will always be a straight binary comparison. text will worry about conversion and what not (though, perhaps there's a way to force that to use C or SQL instead of something like UTF8, short of changing the encoding of the whole database).

true,

good idea

Regards

Pavel
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com