Обсуждение: Query performance
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".
It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.
Is there anything I can do to make the query work better in cases like that? Missing index, or better query?
This is on 9.3.5.
explain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:
I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.Is there anything I can do to make the query work better in cases like that? Missing index, or better query?This is on 9.3.5.The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txtexplain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)
After removing the session with 400k events, I was able to do an explain analyze, here is one of them:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviN
On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.Is there anything I can do to make the query work better in cases like that? Missing index, or better query?This is on 9.3.5.The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txtexplain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)After removing the session with 400k events, I was able to do an explain analyze, here is one of them:
Hi
this plan looks wellPavel
2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviNOn Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.Is there anything I can do to make the query work better in cases like that? Missing index, or better query?This is on 9.3.5.The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txtexplain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)After removing the session with 400k events, I was able to do an explain analyze, here is one of them:
On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
RegardsHithis plan looks well
Pavel
Here's one that's not quite as well: http://explain.depesz.com/s/SgT
Joe
2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviNOn Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.Is there anything I can do to make the query work better in cases like that? Missing index, or better query?This is on 9.3.5.The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txtexplain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)After removing the session with 400k events, I was able to do an explain analyze, here is one of them:
2015-01-25 7:38 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:RegardsHithis plan looks well
PavelHere's one that's not quite as well: http://explain.depesz.com/s/SgT
I see a possible issue
(product_id <> '81716'::citext) .. this operation is CPU expensive and maybe nonsense
(product_id <> '81716'::citext) .. this operation is CPU expensive and maybe nonsense
product_id should be integer -- and if it isn't - it should not be on 4M rows extremly fast - mainly on citext
try to force a opposite cast - you will safe a case insensitive text comparation
product_id::int <> 81716
Regards
Pavel
Joe2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviNOn Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.Is there anything I can do to make the query work better in cases like that? Missing index, or better query?This is on 9.3.5.The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txtexplain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)After removing the session with 400k events, I was able to do an explain analyze, here is one of them:
On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2015-01-25 7:38 GMT+01:00 Joe Van Dyk <joe@tanga.com>:On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:RegardsHithis plan looks well
PavelHere's one that's not quite as well: http://explain.depesz.com/s/SgTI see a possible issue
(product_id <> '81716'::citext) .. this operation is CPU expensive and maybe nonsenseproduct_id should be integer -- and if it isn't - it should not be on 4M rows extremly fast - mainly on citexttry to force a opposite cast - you will safe a case insensitive text comparationproduct_id::int <> 81716
It might not always be an integer, just happens to be so here. Should I try text instead? I don't have to have the case-insensitive matching.
Joe
RegardsPavelJoe2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviNOn Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.Is there anything I can do to make the query work better in cases like that? Missing index, or better query?This is on 9.3.5.The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txtexplain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)After removing the session with 400k events, I was able to do an explain analyze, here is one of them:
2015-01-25 8:20 GMT+01:00 Joe Van Dyk <joe@tanga.com>:
On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:2015-01-25 7:38 GMT+01:00 Joe Van Dyk <joe@tanga.com>:On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:RegardsHithis plan looks well
PavelHere's one that's not quite as well: http://explain.depesz.com/s/SgTI see a possible issue
(product_id <> '81716'::citext) .. this operation is CPU expensive and maybe nonsenseproduct_id should be integer -- and if it isn't - it should not be on 4M rows extremly fast - mainly on citexttry to force a opposite cast - you will safe a case insensitive text comparationproduct_id::int <> 81716It might not always be an integer, just happens to be so here. Should I try text instead? I don't have to have the case-insensitive matching.
text can be better
this design is unhappy, but you cannot to change ot probably
JoeRegardsPavelJoe2015-01-25 6:45 GMT+01:00 Joe Van Dyk <joe@tanga.com>:Oops, didn't run vacuum analyze after deleting the events. Here is another 'explain analyze': http://explain.depesz.com/s/AviNOn Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <joe@tanga.com> wrote:On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <joe@tanga.com> wrote:I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.Is there anything I can do to make the query work better in cases like that? Missing index, or better query?This is on 9.3.5.The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txtexplain select e1.product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_countfrom events e1join events e2 on e1.session_id = e2.session_id and e1.type = e2.typewhere e1.product_id = '82503' and e1.product_id != e2.product_idgroup by e1.product_id, e2.product_id, e2.site_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) Sort Key: e1.product_id, e2.product_id, e2.site_id -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) Recheck Cond: (product_id = '82503'::citext) -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) Index Cond: (product_id = '82503'::citext) -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) Filter: (e1.product_id <> product_id) (11 rows) recommender_production=> \d events Table "public.events" Column | Type | Modifiers -------------+--------------------------+-----------------------------------------------------id | bigint | not null default nextval('events_id_seq'::regclass)user_id | citext |session_id | citext | not nullproduct_id | citext | not nullsite_id | citext | not nulltype | text | not nullhappened_at | timestamp with time zone | not nullcreated_at | timestamp with time zone | not null Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_product_id_site_id_idx" btree (product_id, site_id) "events_session_id_type_product_id_idx" btree (session_id, type, product_id) Check constraints: "events_session_id_check" CHECK (length(session_id::text) < 255) "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) "events_user_id_check" CHECK (length(user_id::text) < 255)After removing the session with 400k events, I was able to do an explain analyze, here is one of them:
Hi, On 25.1.2015 07:38, Joe Van Dyk wrote: > > Here's one that's not quite as well: http://explain.depesz.com/s/SgT As Pavel already pointed out, the first problem is this part of the plan: Seq Scan on events e2 (cost=0.00..120,179.60 rows=4,450,241 width=51) (actual time=0.014..33,773.370 rows=4,450,865 loops=1) Filter: (product_id <> '81716'::citext) Consuming ~33 seconds of the runtime. If you can make this faster somehow (e.g. by getting rid of the citext cast), that'd be nice. Another issue is that the hashjoin is batched: Buckets: 65536 Batches: 8 Memory Usage: 46085kB The hash preparation takes ~40 seconds, so maybe try to give it a bit more memory - I assume you have work_mem=64MB, so try doubling that (ISTM 512MB should work with a single batch). Maybe this won't really improve the performance, though. It still has to process ~4.5M rows. Increasing the work mem could also result in switching to hash aggregate, making the sort (~30 seconds) unnecessary. Anyway, ISTM this works as expected, i.e. (a) with rare product_id values the queries are fast (b) with common product_id values the queries are slow That's expected, because (b) needs to process much more data. I don't think you can magically make it run as fast as (a). The best solution might be to keep a pre-aggregated results - I don't think you really need exact answers when recommending "similar" products. I also wonder if you really need to join the tables? I mean, what if you do something like this: CREATE TABLE events_aggregated AS SELECT site_id, array_agg(product_id) AS product_ids, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_count FROM events GROUP BY 1; and then using intarray with GIN indexes to query this table? Something like this: CREATE products_agg_idx ON aggregated USING GIN (product_ids gin__int_ops); SELECT * FROM events_aggregated WHERE product_ids @> ARRAY['82503']; regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".
>
>It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.
>
>Is there anything I can do to make the query work better in cases like that? Missing index, or better query?
>
>This is on 9.3.5.
>
>The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
Hello,
here are 2 variations that should be somewhat faster.
It seems you may have duplicate (site_id,session_id,product_id)
which would false the result. In that case you'll need some more logic in the query.
select
'82503' as product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
from events e1
join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
where
e1.product_id = '82503' and
e2.product_id != '82503'
group by e2.product_id, e2.site_id;
OR:
WITH SALL as(
select
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
from events e1
join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
where
e1.product_id = '82503'
group by e2.product_id, e2.site_id
)
SELECT
'82503' as product_id_1,
site_id,
product_id,
view_count,
purchase_count
FROM SALL
WHERE product_id != '82503';
regards,
Marc Mamin
>explain select
> e1.product_id,
> e2.site_id,
> e2.product_id,
> count(nullif(e2.type='viewed', false)) view_count,
> count(nullif(e2.type='purchased', false)) purchase_count
> from events e1
> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
> where
> e1.product_id = '82503' and
> e1.product_id != e2.product_id
> group by e1.product_id, e2.product_id, e2.site_id;
> QUERY PLAN
>----------------------------------------------------------------------------------------------------------------------------
> GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19)
> -> Sort (cost=828395.67..840117.89 rows=4688885 width=19)
> Sort Key: e1.product_id, e2.product_id, e2.site_id
> -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19)
> -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49)
> Recheck Cond: (product_id = '82503'::citext)
> -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0)
> Index Cond: (product_id = '82503'::citext)
> -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51)
> Index Cond: ((session_id = e1.session_id) AND (type = e1.type))
> Filter: (e1.product_id <> product_id)
>(11 rows)
>
>recommender_production=> \d events
> Table "public.events"
> Column | Type | Modifiers
>-------------+--------------------------+-----------------------------------------------------
> id | bigint | not null default nextval('events_id_seq'::regclass)
> user_id | citext |
> session_id | citext | not null
> product_id | citext | not null
> site_id | citext | not null
> type | text | not null
> happened_at | timestamp with time zone | not null
> created_at | timestamp with time zone | not null
>Indexes:
> "events_pkey" PRIMARY KEY, btree (id)
> "events_product_id_site_id_idx" btree (product_id, site_id)
> "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
>Check constraints:
> "events_session_id_check" CHECK (length(session_id::text) < 255)
> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))
> "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>
On 1/25/15 2:03 AM, Pavel Stehule wrote: > It might not always be an integer, just happens to be so here. > Should I try text instead? I don't have to have the case-insensitive > matching. > > > text can be better bytea would be even better yet, because that will always be a straight binary comparison. text will worry about conversion and what not (though, perhaps there's a way to force that to use C or SQL instead of something like UTF8, short of changing the encoding of the whole database). -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
2015-01-31 2:40 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:
On 1/25/15 2:03 AM, Pavel Stehule wrote:It might not always be an integer, just happens to be so here.
Should I try text instead? I don't have to have the case-insensitive
matching.
text can be better
bytea would be even better yet, because that will always be a straight binary comparison. text will worry about conversion and what not (though, perhaps there's a way to force that to use C or SQL instead of something like UTF8, short of changing the encoding of the whole database).
true,
good idea
Regards
Pavel
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com