Обсуждение: separating improperly grouped page views
It seems a user visit cookie's expiration has been improperly set and so the
tables which represent a user's clickstream through the website are grouped
improperly. I'd like to run a set-wise query to regroup them based on periods
of inactivity greather than 1 hour. So, what we want to do is adjust the
visit_id's in the page_view table to point to separate visits whenever the
user has been idle for more than 1 hour. The query I've come up with looks
like this:
CREATE TABLE visit_clean AS
SELECT id , CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN
nextval('tracking_cleanup_seq') ELSE currval('tracking_cleanup_seq') END AS visit_id , visit_id AS
old_visit_id , uri , params , stamp , cindex , tindex , method , source_address ,
server_name FROM ( SELECT id , visit_id , uri , params , stamp , cindex ,
tindex , method , source_address , ( SELECT max(pv2.stamp) FROM page_view pv2
WHERE pv2.visit_id = pv1.visit_id AND pv2.stamp < pv1.stamp ) AS last_stamp
, server_name FROM page_view pv1 ) x;
It works reasonably well on a small data set. But, the plan for this is
horrible on the real data.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1 (cost=0.00..2634763281.70 rows=3588374 width=239) SubPlan -> Result
(cost=367.09..367.10rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1 width=8)
-> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431 width=8)
Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
-> Result (cost=367.09..367.10 rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1
width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86
rows=431width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL)
AND(visit_id = $0))
The page_view table is 829MB in size and has 3,590,185 rows.
I let the query run on my test server for about 26hrs without finishing, so I
can't provide explain analyze output yet.
Does anyone have a better method of separating this data out?
---
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
Jeff Frost <jeff@frostconsultingllc.com> writes:
> [ umpteen million iterations of: ]
> -> Limit (cost=0.00..367.09 rows=1 width=8)
> -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86
rows=431width=8)
> Index Cond: (stamp < $1)
> Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
Perhaps an index on (visit_id, stamp) would help. This one is doing the
best it can, but if the visit_id's you want are thinly scattered, it'll
still suck...
regards, tom lane
On Mon, 18 Jun 2007, Tom Lane wrote:
> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> [ umpteen million iterations of: ]
>> -> Limit (cost=0.00..367.09 rows=1 width=8)
>> -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86
rows=431width=8)
>> Index Cond: (stamp < $1)
>> Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
>
> Perhaps an index on (visit_id, stamp) would help. This one is doing the
> best it can, but if the visit_id's you want are thinly scattered, it'll
> still suck...
Good idea Tom! In fact the planner seems to like that much better:
Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239)
vs Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237)
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239) SubPlan -> Result (cost=1.58..1.59
rows=1width=0) InitPlan -> Limit (cost=0.00..1.58 rows=1 width=8) -> Index
ScanBackward using page_view_visit_id_stamp_idx on page_view pv2 (cost=0.00..625.39 rows=397 width=8)
Index Cond: ((visit_id = $0) AND (stamp < $1)) Filter: (stamp IS NOT NULL) -> Result
(cost=1.58..1.59 rows=1 width=0) InitPlan -> Limit (cost=0.00..1.58 rows=1 width=8)
-> Index Scan Backward using page_view_visit_id_stamp_idx on page_view pv2 (cost=0.00..625.39 rows=397 width=8)
Index Cond: ((visit_id = $0) AND (stamp < $1)) Filter: (stamp IS NOT
NULL)
(14 rows)
Compared to:
-------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237) SubPlan -> Result
(cost=364.56..364.57rows=1 width=0) InitPlan -> Limit (cost=0.00..364.56 rows=1 width=8)
-> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..153481.58 rows=421 width=8)
Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
-> Result (cost=364.56..364.57 rows=1 width=0) InitPlan -> Limit (cost=0.00..364.56 rows=1
width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..153481.58
rows=421width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL)
AND(visit_id = $0))
(14 rows)
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
On Sun, 17 Jun 2007, Jeff Frost wrote:
>
-------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239)
> SubPlan
> -> Result (cost=1.58..1.59 rows=1 width=0)
> InitPlan
> -> Limit (cost=0.00..1.58 rows=1 width=8)
> -> Index Scan Backward using page_view_visit_id_stamp_idx
> on page_view pv2 (cost=0.00..625.39 rows=397 width=8)
> Index Cond: ((visit_id = $0) AND (stamp < $1))
> Filter: (stamp IS NOT NULL)
> -> Result (cost=1.58..1.59 rows=1 width=0)
> InitPlan
> -> Limit (cost=0.00..1.58 rows=1 width=8)
> -> Index Scan Backward using page_view_visit_id_stamp_idx
> on page_view pv2 (cost=0.00..625.39 rows=397 width=8)
> Index Cond: ((visit_id = $0) AND (stamp < $1))
> Filter: (stamp IS NOT NULL)
> (14 rows)
>
> Compared to:
>
>
>
-------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237)
> SubPlan
> -> Result (cost=364.56..364.57 rows=1 width=0)
> InitPlan
> -> Limit (cost=0.00..364.56 rows=1 width=8)
> -> Index Scan Backward using page_view_stamp_idx on
> page_view pv2 (cost=0.00..153481.58 rows=421 width=8)
> Index Cond: (stamp < $1)
> Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
> -> Result (cost=364.56..364.57 rows=1 width=0)
> InitPlan
> -> Limit (cost=0.00..364.56 rows=1 width=8)
> -> Index Scan Backward using page_view_stamp_idx on
> page_view pv2 (cost=0.00..153481.58 rows=421 width=8)
> Index Cond: (stamp < $1)
> Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
> (14 rows)
And throwing the ORDER BY back in reduces the cost even more!
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
SubqueryScan x (cost=0.00..5815824.15 rows=3629753 width=1186) -> Index Scan using page_view_visit_idx on page_view
pv1 (cost=0.00..5743229.09 rows=3629753 width=237) SubPlan -> Result (cost=1.51..1.52 rows=1
width=0) InitPlan -> Limit (cost=0.00..1.51 rows=1 width=8)
-> Index Scan Backward using page_view_visit_id_stamp_idx on page_view pv2 (cost=0.00..608.41 rows=402 width=8)
Index Cond: ((visit_id = $0) AND (stamp < $1)) Filter: (stamp IS
NOTNULL)
(9 rows)
Now we only have to do that index scan once. :-) I had foolishly taken that
out to see if the sort was killing me and forgot to put it back in.
So now it's:
Subquery Scan x (cost=0.00..5815824.15 rows=3629753 width=1186) vs
Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239) vs
Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237)
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954
On Sun, 17 Jun 2007, Jeff Frost wrote:
> On Mon, 18 Jun 2007, Tom Lane wrote:
>
>> Jeff Frost <jeff@frostconsultingllc.com> writes:
>>> [ umpteen million iterations of: ]
>>> -> Limit (cost=0.00..367.09 rows=1 width=8)
>>> -> Index Scan Backward using page_view_stamp_idx on
>>> page_view pv2 (cost=0.00..158215.86 rows=431 width=8)
>>> Index Cond: (stamp < $1)
>>> Filter: ((stamp IS NOT NULL) AND (visit_id =
>>> $0))
>>
>> Perhaps an index on (visit_id, stamp) would help. This one is doing the
>> best it can, but if the visit_id's you want are thinly scattered, it'll
>> still suck...
>
> Good idea Tom! In fact the planner seems to like that much better:
>
> Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239)
> vs
> Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237)
Now that this is working in a reasonable amount of time to process the entire
data set, I need to work out how to process the new information that comes in
every so often and still assign it a correct visit_id until the fix for this
can get through QA. The cleanup query looks like this:
CREATE TEMP TABLE tmpmaxpvid AS SELECT MAX(id) AS id FROM
reporting.page_view_clean;
INSERT INTO reporting.page_view_clean
SELECT id , CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN
nextval('reporting.tracking_cleanup_seq') ELSE currval('reporting.tracking_cleanup_seq') END AS visit_id
,uri , params , stamp , visit_id AS old_visit_id FROM ( SELECT id , visit_id , uri
,params , stamp , ( SELECT MAX(pv2.stamp) FROM page_view pv2 WHERE pv2.visit_id
= pv1.visit_id AND pv2.stamp < pv1.stamp ) AS last_stamp FROM page_view pv1
WHERE pv1.stamp < now() - INTERVAL '1 hour' AND pv1.id > ( SELECT CASE
WHEN id IS NULL THEN 0 ELSE id END AS id FROM
tmpmaxpvid ) ORDER BY pv1.visit_id , pv1.stamp ) x;
The problem is the page_views that straddle the time before and after the
query get an artificially different visit_id.
This case statement is what kills me:
CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN
nextval('reporting.tracking_cleanup_seq') ELSE currval('reporting.tracking_cleanup_seq') END AS visit_id
If I change it to this:
CASE WHEN last_stamp IS NULL THEN visit_id WHEN stamp - last_stamp > INTERVAL '1 hour'
THEN nextval('reporting.tracking_cleanup_seq') ELSE currval('reporting.tracking_cleanup_seq') END AS
visit_id
Then it gives the first page_view in the sequence the correct visit_id, but
then continues on with a different visit_id for the rest.
I've also tried changing how I select which data to act on like so:
WHERE pv1.id > ( SELECT CASE WHEN id IS NULL THEN 0 ELSE id
END AS id FROM tmpmaxpvid ) AND pv1.visit_id IN ( SELECT visit_id FROM
page_viewpv3 GROUP BY visit_id HAVING max(stamp) < now() - INTERVAL '1 hour' )
ORDERBY pv1.visit_id , pv1.stamp ) x;
But that leaves me skipping some page views because they haven't reached their
conclusion and because of the id > clause, I'll never go back to see them.
Anyone have any suggestions?
--
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954