Обсуждение: Possible infinite loop in query using bitmap scans
I have this report query that runs daily on a table with several hundred million rows total using pg 8.1.3 on Debian Linux on hw with dual opteron processors: SELECT count(*) FROM webhits WHERE path LIKE '/radio/tuner_%.swf' AND status = 200 AND date_recorded >= '3/10/2006'::TIMESTAMP AND date_recorded < '3/11/2006'::TIMESTAMP; Here is the explain output: QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ---------------------------------- Aggregate (cost=794775.08..794775.09 rows=1 width=0) -> Bitmap Heap Scan on webhits (cost=315820.45..794771.74 rows=1337 width=0) Recheck Cond: ((date_recorded >= '2006-03-10 00:00:00'::timestamp without time zone) AND (date_recorded < '2006-03-11 00:00:00'::timestamp without time zone)) Filter: (((path)::text ~~ '/radio/tuner_%.swf'::text) AND (status = 200)) -> BitmapAnd (cost=315820.45..315820.45 rows=249152 width=0) -> Bitmap Index Scan on webhits_date_idx1 (cost=0.00..140407.45 rows=15379741 width=0) Index Cond: ((date_recorded >= '2006-03-10 00:00:00'::timestamp without time zone) AND (date_recorded < '2006-03-11 00:00:00'::timestamp without time zone)) -> Bitmap Index Scan on webhits_path_idx2 (cost=0.00..175412.76 rows=15343959 width=0) Index Cond: (((path)::text >= '/radio/tuner'::character varying) AND ((path)::text < '/radio/tunes'::character varying)) According to the planner it should take <15 minutes which is typical in practice. About half the times it runs, however, it never terminates (even after days) and just spins consuming 99+% of CPU with no disk activity. This query was never a problem in postgres versions < 8.1.2, however the data has grown substantially since that time. I notice it uses the recent in-memory bitmap feature, so I wondered if it was exposing a bug. If I restart the postmaster, the query will complete in the expected time. -Casey
On Sun, Mar 12, 2006 at 11:36:23PM -0800, Casey Duncan wrote: > SELECT count(*) FROM webhits > WHERE path LIKE '/radio/tuner_%.swf' AND status = 200 > AND date_recorded >= '3/10/2006'::TIMESTAMP > AND date_recorded < '3/11/2006'::TIMESTAMP; [...] > Aggregate (cost=794775.08..794775.09 rows=1 width=0) [...] > According to the planner it should take <15 minutes which is typical in > practice. The planner's cost estimate is in units of disk page fetches, not time. The above estimate isn't 794775.09 ms (~13.25 min) but rather 794775.09 times the cost of a single page fetch, however much that is. See "Using EXPLAIN" in the "Performance Tips" chapter of the documentation. http://www.postgresql.org/docs/8.1/interactive/performance-tips.html#USING-EXPLAIN > About half the times it runs, however, it never terminates > (even after days) and just spins consuming 99+% of CPU with no disk > activity. This query was never a problem in postgres versions < 8.1.2, > however the data has grown substantially since that time. I notice it > uses the recent in-memory bitmap feature, so I wondered if it was > exposing a bug. If the problem happens half the time then you have a somewhat repeatable test case. Do you get more consistent performance if you set enable_bitmapscan to off? What's the query plan if you do that? If you narrow the search criteria so the query returns fewer rows, do you still see the problem? Can you identify a "sour spot" where the problem starts to happen? > If I restart the postmaster, the query will complete in the expected > time. Does the problem eventually start happening again? If so, after how long? How did you determine that the restart is relevant? Do you consistently see different (presumably better) performance after a restart than if you don't restart? -- Michael Fuhr
On Mar 13, 2006, at 9:50 AM, Michael Fuhr wrote: > On Sun, Mar 12, 2006 at 11:36:23PM -0800, Casey Duncan wrote: >> SELECT count(*) FROM webhits >> WHERE path LIKE '/radio/tuner_%.swf' AND status = 200 >> AND date_recorded >= '3/10/2006'::TIMESTAMP >> AND date_recorded < '3/11/2006'::TIMESTAMP; > [...] >> Aggregate (cost=794775.08..794775.09 rows=1 width=0) > [...] >> According to the planner it should take <15 minutes which is typical >> in >> practice. > > The planner's cost estimate is in units of disk page fetches, not > time. The above estimate isn't 794775.09 ms (~13.25 min) but rather > 794775.09 times the cost of a single page fetch, however much that > is. See "Using EXPLAIN" in the "Performance Tips" chapter of the > documentation. > > http://www.postgresql.org/docs/8.1/interactive/performance- > tips.html#USING-EXPLAIN Doh! I should've known that. >> About half the times it runs, however, it never terminates >> (even after days) and just spins consuming 99+% of CPU with no disk >> activity. This query was never a problem in postgres versions < 8.1.2, >> however the data has grown substantially since that time. I notice it >> uses the recent in-memory bitmap feature, so I wondered if it was >> exposing a bug. > > If the problem happens half the time then you have a somewhat > repeatable test case. Do you get more consistent performance if > you set enable_bitmapscan to off? What's the query plan if you do > that? Here's the plan with bitmap scans off: QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ---------------------- Aggregate (cost=891363.71..891363.72 rows=1 width=0) -> Index Scan using webhits_date_idx1 on webhits (cost=0.00..891360.30 rows=1362 width=0) Index Cond: ((date_recorded >= '2006-03-10 00:00:00'::timestamp without time zone) AND (date_recorded < '2006-03-11 00:00:00'::timestamp without time zone)) Filter: (((path)::text ~~ '/radio/tuner_%.swf'::text) AND (status = 200)) (4 rows) The query runs to completion this way in about 40 minutes. I turned bitmap scans back on and it hangs again (I ran it for about 5 hours). > If you narrow the search criteria so the query returns fewer rows, > do you still see the problem? Can you identify a "sour spot" where > the problem starts to happen? I'll do that tomorrow, and let you know. >> If I restart the postmaster, the query will complete in the expected >> time. > > Does the problem eventually start happening again? If so, after > how long? How did you determine that the restart is relevant? Do > you consistently see different (presumably better) performance after > a restart than if you don't restart? This is a production box, so I can't restart it whenever I want and I haven't yet reproduced it elsewhere -- the data base size makes that cumbersome at best -- but once after it hung up, I restarted postgres and the report ran to completion for a couple of days then started hanging again. Today it seems to pretty consistently hang, I'll see if I can restart it overnight and test it again. -Casey
On Mar 13, 2006, at 5:25 PM, Casey Duncan wrote: [..] >>> If I restart the postmaster, the query will complete in the expected >>> time. >> >> Does the problem eventually start happening again? If so, after >> how long? How did you determine that the restart is relevant? Do >> you consistently see different (presumably better) performance after >> a restart than if you don't restart? I restarted postgres this morning and this time it didn't seem to help. That query has been running for several hours now. I'm going to let it go a while longer to see if it eventually completes, but I suspect it won't. So perhaps the prior restart was just dumb luck. I'll try some narrower queries as well. -Casey
On Mar 14, 2006, at 1:31 PM, Casey Duncan wrote: > On Mar 13, 2006, at 5:25 PM, Casey Duncan wrote: > [..] >>>> If I restart the postmaster, the query will complete in the expected >>>> time. >>> >>> Does the problem eventually start happening again? If so, after >>> how long? How did you determine that the restart is relevant? Do >>> you consistently see different (presumably better) performance after >>> a restart than if you don't restart? > > I restarted postgres this morning and this time it didn't seem to > help. That query has been running for several hours now. I'm going to > let it go a while longer to see if it eventually completes, but I > suspect it won't. So perhaps the prior restart was just dumb luck. > > I'll try some narrower queries as well. Interestingly, I can only narrow the query by about an hour (i.e., a 23 hour span) before the plan changes to a simple index scan. That may explain why this only started hanging recently since it probably only started using the bitmap scan for a day interval recently. Let me know if there is anything I should do or any info you need to further pin this down. In the mean time I will disable bitmap scans for this query. Thanks. -Casey