Обсуждение: performance of count(*)
I need to optimize queries that deal with some aggregates regarding resource availability. My specific problem is, I think,very closely analogous to select count(*)... where... I know roughly how to do it, aggregated stats table, triggers appending to it, occasional updates to coalesce entries. I'djust like to see an example to confirm my own plan and see if I'm missing any details. I'm sure I've seen references to articles on ways to do this, but all google is getting me is generic complaints about count(*)performance and suggestions to use stats for estimated total rows in a table, nothing useful for this. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
Dne 6.5.2011 20:45, Scott Ribe napsal(a): > I need to optimize queries that deal with some aggregates regarding > resource availability. My specific problem is, I think, very closely > analogous to select count(*)... where... > > I know roughly how to do it, aggregated stats table, triggers > appending to it, occasional updates to coalesce entries. I'd just like > to see an example to confirm my own plan and see if I'm missing any > details. > > I'm sure I've seen references to articles on ways to do this, but all > google is getting me is generic complaints about count(*) performance > and suggestions to use stats for estimated total rows in a table, > nothing useful for this. > Well I guess you got most of that right - just create a table to hold aggregated values, and then a bunch of triggers to update it. The structute and implementation of the triggers really depend on your needs, but in general there are two approaches - eager and lazy. Eager - the triggers immediately update the aggregates (increment a count, add a value to the sum etc.). This is a 'primitive' and less complex solution, suitable when the update is cheap and the aggregated table is often read. Lazy - just mark the row in the aggregated table as 'dirty' and recompute it only if it's read. This is suitable if the table is read only occasionaly and/or when computing the aggregate is complex (e.g. needs to reread the whole dataset - as for example computing variance). The lazy approach is not that usual, a great example how to implement that is available here: http://www.pgcon.org/2008/schedule/events/69.en.html Anyway I'd recommend to start with the eager approach, it's much easier to implement. You can implement the lazy approach later, if you find out it's needed. And you should strive to use HOT feature (if you're on >= 8.4), especially with the eager approach - it often does a lot of updates and leads to bloat of the aggregated table. So decrease the fillfactor and do not index the columns that are updated by the triggers. regards Tomas
On Fri, May 06, 2011 at 12:45:23PM -0600, Scott Ribe wrote: > I need to optimize queries that deal with some aggregates regarding resource availability. My specific problem is, I think, very closely analogous to select count(*)... where... If the WHERE clause is fairly selective and indexed, that should be fast. Not as fast as estimates based on trigger-written values in another table, of course, but reasonably fast. So the first order of business is usually to find or create indexes that will make SELECT on the same criteria fast. It's only unqualified "SELECT count(*)" that is slow. Generally, the system table is good enough for that, I find. (Someone: "How long will this take?" Me: "There are about 400 million rows to go through." Even if you're off by 50 million at that point, it doesn't matter.) A -- Andrew Sullivan ajs@crankycanuck.ca
On May 6, 2011, at 1:39 PM, Tomas Vondra wrote: > Anyway I'd recommend to start with the eager approach, it's much easier > to implement. You can implement the lazy approach later, if you find out > it's needed. With the eager approach, I think I'm too likely to get write conflicts. Thanks for the reference to the paper, I believethat's what I was looking for. > And you should strive to use HOT feature (if you're on >= 8.4), > especially with the eager approach - it often does a lot of updates and > leads to bloat of the aggregated table. So decrease the fillfactor and > do not index the columns that are updated by the triggers. See, that's the kind of info I'm looking for ;-) On May 6, 2011, at 1:59 PM, Andrew Sullivan wrote: > If the WHERE clause is fairly selective and indexed, that should be > fast. Not as fast as estimates based on trigger-written values in > another table, of course, but reasonably fast. So the first order of > business is usually to find or create indexes that will make SELECT on > the same criteria fast. In this case, it depends on the result of a pretty complex join that involves some gnarly time calculations, and findingthe unmatched rows from one side of an outer join. I really don't think there's a way to optimize the straight-upquery to be faster than it is, I looked at that for a good long time, explain/analyze and all. Postgres is usingthe appropriate index to narrow things down as much as it can at the very beginning, it just then has to perform a heckof a lot of work to finish the join... And it's not taking ***that*** long--it's just that I want it faster! > It's only unqualified "SELECT count(*)" that is slow. Generally, the > system table is good enough for that, I find. (Someone: "How long > will this take?" Me: "There are about 400 million rows to go > through." Even if you're off by 50 million at that point, it doesn't > matter.) FYI, I have no need for unqualified select count(*) in this app--just doesn't happen, ever ;-) Thanks. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On Fri, May 06, 2011 at 03:43:02PM -0600, Scott Ribe wrote: > In this case, it depends on the result of a pretty complex join that > involves some gnarly time calculations, and finding the unmatched > rows from one side of an outer join. Yeah, in that case the HOT suggestions are very important. I strongly recomment you experiment in a test system with real data and pathological cases in particular, in order to see what happens when the outlier cases inevitably, Murphy willing, crop up. That's not to say you should arrange your plans for them, but forewarned is forearmed. A -- Andrew Sullivan ajs@crankycanuck.ca
On May 6, 2011, at 4:15 PM, Andrew Sullivan wrote: > Yeah, in that case the HOT suggestions are very important. I strongly > recomment you experiment in a test system with real data and > pathological cases in particular, in order to see what happens when > the outlier cases inevitably, Murphy willing, crop up. That's not to > say you should arrange your plans for them, but forewarned is > forearmed. Again thanks. The HOT tip led me down the road of paying attention to my indexes, which led me to a nice realization abouthow to shrink the overall footprint of the materialized aggregates ;-) Which led me to a technique to seriously minimizeupdates... I didn't have to worry about bloat too much--overall activity level is not huge; the possibility of collisions on updatesis mostly because users tend to work on the same very small (but ever-shifting) subset of the data at the same time,but now I think I'm really set! -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice