separating improperly grouped page views
От | Jeff Frost |
---|---|
Тема | separating improperly grouped page views |
Дата | |
Msg-id | Pine.LNX.4.64.0706172157410.14506@discord.home.frostconsultingllc.com обсуждение исходный текст |
Ответы |
Re: separating improperly grouped page views
|
Список | pgsql-sql |
It seems a user visit cookie's expiration has been improperly set and so the tables which represent a user's clickstream through the website are grouped improperly. I'd like to run a set-wise query to regroup them based on periods of inactivity greather than 1 hour. So, what we want to do is adjust the visit_id's in the page_view table to point to separate visits whenever the user has been idle for more than 1 hour. The query I've come up with looks like this: CREATE TABLE visit_clean AS SELECT id , CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN nextval('tracking_cleanup_seq') ELSE currval('tracking_cleanup_seq') END AS visit_id , visit_id AS old_visit_id , uri , params , stamp , cindex , tindex , method , source_address , server_name FROM ( SELECT id , visit_id , uri , params , stamp , cindex , tindex , method , source_address , ( SELECT max(pv2.stamp) FROM page_view pv2 WHERE pv2.visit_id = pv1.visit_id AND pv2.stamp < pv1.stamp ) AS last_stamp , server_name FROM page_view pv1 ) x; It works reasonably well on a small data set. But, the plan for this is horrible on the real data. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- SeqScan on page_view pv1 (cost=0.00..2634763281.70 rows=3588374 width=239) SubPlan -> Result (cost=367.09..367.10rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1 width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431 width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND (visit_id = $0)) -> Result (cost=367.09..367.10 rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1 width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND(visit_id = $0)) The page_view table is 829MB in size and has 3,590,185 rows. I let the query run on my test server for about 26hrs without finishing, so I can't provide explain analyze output yet. Does anyone have a better method of separating this data out? --- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
В списке pgsql-sql по дате отправления: