Обсуждение: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
However much of the time that I see the CPU at 65%-90% Wait, there is very little disk access, so it's not the disk subsystem (in my opinion). I've also moved CPU's around and the sql seems to stall regardless of what CPU the job has been provided with. Memory I pulled memory to test and again, other than this data set consuming 10gigs of data, 700K free (will add more memory), but since the disks are not a bottleneck and I don't appear to be swapping, I keep coming back to the data or sql.
I'm providing the data that I think is requested when a performance issue is observed.
There is an autovac running, the queries are run on static data so INSERTS/UPDATES/DELETES
The query seems to have gotten slower as the data set grew.
Redhat
Postgres 8.3.4
8 cpu box
10gig of ram
Number of rows in the table= 100million
· Size of table including indices =21GB
· Time to create a combined index on 2 columns (tagged boolean , makeid text) = more than 1 hr 30 minutes
· Data distribution = In the 98mill records, there are 7000 unique makeid's, and 21mill unique UID's. About 41mill of the records have tagged=true
· Time to execute the following query with indices on makeid and tagged = 90-120 seconds. The planner uses the webid index and filters on tagged and then rechecks the webid index
SELECT COUNT(DISTINCT uid ) AS active_users FROM pixelpool.userstats WHERE makeid ='bmw-ferman' AND tagged =true
· Time to execute the the same query with a combined index on makeid and tagged = 60-100 seconds. The planner uses the combined index and then filters tagged.
SELECT COUNT(DISTINCT uid ) AS active_users FROM pixelpool.userstats WHERE makeid ='bmw-ferman' AND tagged =true
Plan
"Aggregate (cost=49467.00..49467.01 rows=1 width=8)"
" -> Bitmap Heap Scan on userstats (cost=363.49..49434.06 rows=13175 width=8)"
" Recheck Cond: (makeid = 'b1mw-ferman'::text)"
" Filter: tagged"
" -> Bitmap Index Scan on idx_retargetuserstats_makeidtag (cost=0.00..360.20 rows=13175 width=0)"
" Index Cond: ((makeid = 'b1mw-ferman'::text) AND (tagged = true))"
Any assistance would be appreciated, don't worry about slapping me around I need to figure this out. Otherwise I'm buying new hardware where it may not be required.
Thanks
Tory
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
Tory M Blue wrote: > Any assistance would be appreciated, don't worry about slapping me > around I need to figure this out. Otherwise I'm buying new hardware > where it may not be required. What is the reporting query that takes 26 hours? You didn't seem to include it, or any query plan information for it (EXPLAIN or EXPLAIN ANALYZE results). What sort of activity is happening on the db concurrently with your tests? What's your max connection limit? What're your shared_buffers and effective_cache_size settings? Could sorts be spilling to disk? Check work_mem size and enable logging of tempfiles (see the manual). Does an explicit ANALYZE of the problem table(s) help? -- Craig Ringer
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
On 21/01/10 22:15, Tory M Blue wrote: > · Data distribution = In the 98mill records, there are 7000 unique > makeid's, and 21mill unique UID's. About 41mill of the records have > tagged=true > > · Time to execute the following query with indices on makeid and > tagged = 90-120 seconds. The planner uses the webid index and filters on > tagged and then rechecks the webid index > > * SELECT COUNT(DISTINCT uid ) AS active_users FROM > pixelpool.userstats WHERE makeid ='bmw-ferman' AND tagged =true* > > · Time to execute the the same query with a combined index on makeid > and tagged = 60-100 seconds. The planner uses the combined index and then > filters tagged. Two things: 1. You have got the combined index on (makeid, tagged) and not (tagged, makeid) haven't you? Just checking. 2. If it's mostly tagged=true you are interested in you can always use a partial index: CREATE INDEX ... (makeid) WHERE tagged This might be a win even if you need a second index with WHERE NOT tagged. Also, either I've not had enough cofee yet, or a bitmap scan is an odd choice for only ~ 13000 rows out of 100 million. > * " -> Bitmap Index Scan on > idx_retargetuserstats_makeidtag (cost=0.00..360.20 rows=13175 width=0)"* > > * " Index Cond: ((makeid = 'b1mw-ferman'::text) > AND (tagged = true))"* Otherwise, see what Craig said. I'm assuming this isn't the query that is CPU bound for a long time. Unless your table is horribly bloated, there's no reason for that judging by this plan. -- Richard Huxton Archonet Ltd
What is the reporting query that takes 26 hours? You didn't seem to
> Any assistance would be appreciated, don't worry about slapping me
> around I need to figure this out. Otherwise I'm buying new hardware
> where it may not be required.
include it, or any query plan information for it (EXPLAIN or EXPLAIN
ANALYZE results).
SELECT COUNT(DISTINCT uid ) AS active_users FROM pixelpool.userstats WHERE makeid ='bmw-ferman' AND tagged =true
Plan
"Aggregate (cost=49467.00..49467.01 rows=1 width=8)"
" -> Bitmap Heap Scan on userstats (cost=363.49..49434.06 rows=13175 width=8)"
" Recheck Cond: (makeid = 'b1mw-ferman'::text)"
" Filter: tagged"
" -> Bitmap Index Scan on idx_retargetuserstats_makeidtag (cost=0.00..360.20 rows=13175 width=0)"
" Index Cond: ((makeid = 'b1mw-ferman'::text) AND (tagged = true))"
What sort of activity is happening on the db concurrently with your
tests? What's your max connection limit?
50 max and there is nothing, usually one person connected if that, otherwise it's a cron job that bulk inserts and than jobs later on run that generate the reports off the static data. No deletes or updates happening.
What're your shared_buffers and effective_cache_size settings?
shared_buffers = 1028MB (Had this set at 128 and 256 and just recently bumped it higher, didn't buy me anything)
maintenance_work_mem = 128MB
fsync=on
random_page_cost = 4.0
effective_cache_size = 7GB
default vac settings
Could sorts be spilling to disk? Check work_mem size and enable logging
of tempfiles (see the manual).
work_mem = 100MB # min 64kB
Will do and I guess it's possible but during the queries, reports I don't see a ton of writes, mostly reads
Does an explicit ANALYZE of the problem table(s) help?
It didn't.
Thanks
Tory
On 21/01/10 22:15, Tory M Blue wrote:· Data distribution = In the 98mill records, there are 7000 unique
makeid's, and 21mill unique UID's. About 41mill of the records have
tagged=true
· Time to execute the following query with indices on makeid and
tagged = 90-120 seconds. The planner uses the webid index and filters on
tagged and then rechecks the webid index
* SELECT COUNT(DISTINCT uid ) AS active_users FROM
pixelpool.userstats WHERE makeid ='bmw-ferman' AND tagged =true*
· Time to execute the the same query with a combined index on makeid
and tagged = 60-100 seconds. The planner uses the combined index and then
filters tagged.
Two things:
1. You have got the combined index on (makeid, tagged) and not (tagged, makeid) haven't you? Just checking.
Yes we do
2. If it's mostly tagged=true you are interested in you can always use a partial index: CREATE INDEX ... (makeid) WHERE tagged
This might be a win even if you need a second index with WHERE NOT tagged.
Partial index doesn't seem to fit here due to the fact that there are 35-40% Marked True.
Didn't think about creating a second index for false, may give that a shot.
Also, either I've not had enough cofee yet, or a bitmap scan is an odd choice for only ~ 13000 rows out of 100 million.* " -> Bitmap Index Scan onAND (tagged = true))"*
idx_retargetuserstats_makeidtag (cost=0.00..360.20 rows=13175 width=0)"*
* " Index Cond: ((makeid = 'b1mw-ferman'::text)
Otherwise, see what Craig said.
I'm assuming this isn't the query that is CPU bound for a long time. Unless your table is horribly bloated, there's no reason for that judging by this plan.
It is, but not always, only when there are 10K more matches. And the explain unfortunately is sometimes way high or way low, so the expalin is hit and miss.
But the same sql that returns maybe 500 rows is pretty fast, it's the return of 10K+ rows that seems to stall and is CPU Bound.
Thanks
Tory
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
On 22/01/10 18:03, Tory M Blue wrote: > On Fri, Jan 22, 2010 at 1:42 AM, Richard Huxton<dev@archonet.com> wrote: > >> On 21/01/10 22:15, Tory M Blue wrote: >> 2. If it's mostly tagged=true you are interested in you can always use a >> partial index: CREATE INDEX ... (makeid) WHERE tagged >> This might be a win even if you need a second index with WHERE NOT tagged. >> > > Partial index doesn't seem to fit here due to the fact that there are 35-40% > Marked True. > > Didn't think about creating a second index for false, may give that a shot. If you're mostly search tagged=true, try the partial index - it'll mean the planner is just scanning the index for the one term. >> Also, either I've not had enough cofee yet, or a bitmap scan is an odd >> choice for only ~ 13000 rows out of 100 million. >> >> * " -> Bitmap Index Scan on >>> >>> idx_retargetuserstats_makeidtag (cost=0.00..360.20 rows=13175 width=0)"* >>> >>> * " Index Cond: ((makeid = >>> 'b1mw-ferman'::text) >>> AND (tagged = true))"* >>> >> >> Otherwise, see what Craig said. >> >> I'm assuming this isn't the query that is CPU bound for a long time. Unless >> your table is horribly bloated, there's no reason for that judging by this >> plan. > > It is, but not always, only when there are 10K more matches. And the explain > unfortunately is sometimes way high or way low, so the expalin is hit and > miss. > > But the same sql that returns maybe 500 rows is pretty fast, it's the return > of 10K+ rows that seems to stall and is CPU Bound. Hmm - might be able to push that cross-over point up a bit by tweaking various costs, but you've got to be careful you don't end up making all your other queries worse. It'd be good to figure out what the problem is first. Looking at the query there are four stages: 1. Scan the index, build a bitmap of heap pages with matching rows 2. Scan those pages, find the rows that match 3. Run DISTINCT on the uids 4. Count them I wonder if it could be the DISTINCT. What happens with a count(*) or count(uid) instead? Also - you might find EXPLAIN ANALYZE more useful than straight EXPLAIN here. That will show actual times for each stage. On Craig's branch of this thread, you say you call it 6000 times with different "makeid"s. Any reason why you can't join to a temp table and just do it in one query? -- Richard Huxton Archonet Ltd
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
On Fri, 22 Jan 2010, Tory M Blue wrote: > But the same sql that returns maybe 500 rows is pretty fast, it's the return > of 10K+ rows that seems to stall and is CPU Bound. Okay, so you have two differing cases. Show us the EXPLAIN ANALYSE for both of them, and we will see what the difference is. Matthew -- The third years are wandering about all worried at the moment because they have to hand in their final projects. Please be sympathetic to them, say things like "ha-ha-ha", but in a sympathetic tone of voice -- Computer Science Lecturer
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
On Fri, Jan 22, 2010 at 10:59 AM, Tory M Blue <tmblue@gmail.com> wrote: > On Thu, Jan 21, 2010 at 7:46 PM, Craig Ringer <craig@postnewspapers.com.au> > wrote: >> > Any assistance would be appreciated, don't worry about slapping me >> > around I need to figure this out. Otherwise I'm buying new hardware >> > where it may not be required. >> >> What is the reporting query that takes 26 hours? You didn't seem to >> include it, or any query plan information for it (EXPLAIN or EXPLAIN >> ANALYZE results). > > It's this query, run 6000 times with a diff makeid's > > SELECT COUNT(DISTINCT uid ) AS active_users FROM pixelpool.userstats > WHERE makeid ='bmw-ferman' AND tagged =true Any chance of trying this instead: select makeid, count(distinct uid) as active_users from pixelpool.userstats where tagged=true group by makeid And seeing how long it takes? If you're limiting the total number of makeids then you could add and makeid in (biglistofmakeidsgoeshere) Note that a partial index of create index xyz on pixelpool.userstats (makeid) where tagged; might help both the original and this query.
On Fri, Jan 22, 2010 at 10:26 AM, Matthew Wakeling <matthew@flymine.org> wrote: > > On Fri, 22 Jan 2010, Tory M Blue wrote: >> >> But the same sql that returns maybe 500 rows is pretty fast, it's the return >> of 10K+ rows that seems to stall and is CPU Bound. > > Okay, so you have two differing cases. Show us the EXPLAIN ANALYSE for both of them, and we will see what the differenceis. > > Matthew Okay, understood Here is the explain plan for the query. Actual rows that the query returns is 6369 SLOW This is for this SQL SELECT COUNT(distinct uid ) AS active_users FROM pixelpool.userstats WHERE makeid = 'gmps-armen-chevy' and tagged=true Explain "Aggregate (cost=118883.96..118883.97 rows=1 width=8)" " -> Bitmap Heap Scan on userstats (cost=797.69..118850.46 rows=13399 width=8)" " Recheck Cond: (makeid = 'gmps-armen-chevy'::text)" " Filter: tagged" " -> Bitmap Index Scan on idx_retargetuserstats_makeid (cost=0.00..794.34 rows=33276 width=0)" " Index Cond: (makeid = 'gmps-armen-chevy'::text)" Explain Analyze "Aggregate (cost=118883.96..118883.97 rows=1 width=8) (actual time=31219.376..31219.376 rows=1 loops=1)" " -> Bitmap Heap Scan on userstats (cost=797.69..118850.46 rows=13399 width=8) (actual time=281.604..31190.290 rows=19799 loops=1)" " Recheck Cond: (makeid = 'gmps-armen-chevy'::text)" " Filter: tagged" " -> Bitmap Index Scan on idx_retargetuserstats_makeid (cost=0.00..794.34 rows=33276 width=0) (actual time=258.506..258.506 rows=23242 loops=1)" " Index Cond: (makeid = 'gmps-armen-chevy'::text)" "Total runtime: 31219.536 ms" FAST "Explain "explain SELECT a.makeid, COUNT(DISTINCT CASE WHEN tagged=true THEN a.uid END) "AS active_users, "COUNT(DISTINCT CASE WHEN tagged=false THEN a.uid END) as unassociated "FROM pixelpool.userstats a "WHERE makeid ='gmps-oden' GROUP BY a.makeid Explain Analyze "GroupAggregate (cost=802.66..119105.01 rows=1 width=23) (actual time=3813.550..3813.551 rows=1 loops=1)" " -> Bitmap Heap Scan on userstats a (cost=802.66..118855.43 rows=33276 width=23) (actual time=55.400..3807.908 rows=2606 loops=1)" " Recheck Cond: (makeid = 'gmps-oden'::text)" " -> Bitmap Index Scan on idx_retargetuserstats_makeid (cost=0.00..794.34 rows=33276 width=0) (actual time=51.748..51.748 rows=2677 loops=1)" " Index Cond: (makeid = 'gmps-oden'::text)" "Total runtime: 3813.626 ms" Thanks again Tory
On Fri, Jan 22, 2010 at 11:06 AM, Tory M Blue <tmblue@gmail.com> wrote: > On Fri, Jan 22, 2010 at 10:26 AM, Matthew Wakeling <matthew@flymine.org> wrote: >> >> On Fri, 22 Jan 2010, Tory M Blue wrote: >>> >>> But the same sql that returns maybe 500 rows is pretty fast, it's the return >>> of 10K+ rows that seems to stall and is CPU Bound. >> >> Okay, so you have two differing cases. Show us the EXPLAIN ANALYSE for both of them, and we will see what the differenceis. >> >> Matthew > > Okay, understood > > Here is the explain plan for the query. Actual rows that the query > returns is 6369 > > SLOW > > This is for this SQL > SELECT COUNT(distinct uid ) AS active_users FROM > pixelpool.userstats WHERE makeid = > 'gmps-armen-chevy' and tagged=true > > Explain > "Aggregate (cost=118883.96..118883.97 rows=1 width=8)" > " -> Bitmap Heap Scan on userstats (cost=797.69..118850.46 > rows=13399 width=8)" > " Recheck Cond: (makeid = 'gmps-armen-chevy'::text)" > " Filter: tagged" > " -> Bitmap Index Scan on idx_retargetuserstats_makeid > (cost=0.00..794.34 rows=33276 width=0)" > " Index Cond: (makeid = 'gmps-armen-chevy'::text)" > > > Explain Analyze > "Aggregate (cost=118883.96..118883.97 rows=1 width=8) (actual > time=31219.376..31219.376 rows=1 loops=1)" > " -> Bitmap Heap Scan on userstats (cost=797.69..118850.46 > rows=13399 width=8) (actual time=281.604..31190.290 rows=19799 > loops=1)" > " Recheck Cond: (makeid = 'gmps-armen-chevy'::text)" > " Filter: tagged" > " -> Bitmap Index Scan on idx_retargetuserstats_makeid > (cost=0.00..794.34 rows=33276 width=0) (actual time=258.506..258.506 > rows=23242 loops=1)" > " Index Cond: (makeid = 'gmps-armen-chevy'::text)" > "Total runtime: 31219.536 ms" Sorry mis-pasted the FAST!!! Explain "GroupAggregate (cost=802.66..119105.01 rows=1 width=23)" " -> Bitmap Heap Scan on userstats a (cost=802.66..118855.43 rows=33276 width=23)" " Recheck Cond: (makeid = 'gmps-oden'::text)" " -> Bitmap Index Scan on idx_retargetuserstats_makeid (cost=0.00..794.34 rows=33276 width=0)" " Index Cond: (makeid = 'gmps-oden'::text)" Explain Analyze "GroupAggregate (cost=802.66..119105.01 rows=1 width=23) (actual time=3813.550..3813.551 rows=1 loops=1)" " -> Bitmap Heap Scan on userstats a (cost=802.66..118855.43 rows=33276 width=23) (actual time=55.400..3807.908 rows=2606 loops=1)" " Recheck Cond: (makeid = 'gmps-oden'::text)" " -> Bitmap Index Scan on idx_retargetuserstats_makeid (cost=0.00..794.34 rows=33276 width=0) (actual time=51.748..51.748 rows=2677 loops=1)" " Index Cond: (makeid = 'gmps-oden'::text)" "Total runtime: 3813.626 ms"
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
On 23/01/2010 1:59 AM, Tory M Blue wrote: > It's this query, run 6000 times with a diff makeid's / > / > > /SELECT COUNT(DISTINCT uid ) AS active_users FROM > pixelpool.userstats WHERE makeid ='bmw-ferman' AND tagged =true/ > > / Plan/ > > / "Aggregate (cost=49467.00..49467.01 rows=1 width=8)"/ > > / " -> Bitmap Heap Scan on userstats (cost=363.49..49434.06 > rows=13175 width=8)"/ > > / " Recheck Cond: (makeid = 'b1mw-ferman'::text)"/ > > / " Filter: tagged"/ > > / " -> Bitmap Index Scan on idx_retargetuserstats_makeidtag > (cost=0.00..360.20 rows=13175 width=0)"/ > > / " Index Cond: ((makeid = 'b1mw-ferman'::text) AND (tagged > = true))"/ Try: - Adding a partial index on makeid, eg: CREATE INDEX userstats_makeid_where_tagged_idx ON userstats (makeid) WHERE (tagged); - Instead of repeating the query 6000 times in a loop, collect the data in one pass by joining against a temp table containing the makeids of interest. SELECT COUNT(DISTINCT u.uid) AS active_users FROM pixelpool.userstats u INNER JOIN temp_makeids m ON (u.makeid = m.makeid) WHERE u.tagged = true; (If the 6000 repeats are really a correlated subquery part of a bigger query you still haven't shown, then you might be able to avoid 6000 individual passes by adjusting your outer query instead). -- Craig Ringer
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
On 22/01/10 19:06, Tory M Blue wrote: > Here is the explain plan for the query. Actual rows that the query > returns is 6369 Actually, it processes 19,799 rows (see the actual rows= below). > SLOW > " -> Bitmap Heap Scan on userstats (cost=797.69..118850.46 > rows=13399 width=8) (actual time=281.604..31190.290 rows=19799 > loops=1)" > "Total runtime: 31219.536 ms" > FAST > " -> Bitmap Heap Scan on userstats a (cost=802.66..118855.43 > rows=33276 width=23) (actual time=55.400..3807.908 rows=2606 loops=1)" > "Total runtime: 3813.626 ms" OK - so the first query processes 19,799 rows in 31,219 ms (about 1.5ms per row) The second processes 2,606 rows in 3,813 ms (about 1.3ms per row). You are asking for DISTINCT user-ids, so it's seems reasonable that it will take slightly longer to check a larger set of user-ids. Otherwise, both queries are the same. I'm still a little puzzled by the bitmap scan, but the planner probably knows more about your data than I do. The main time is spent in the "bitmap heap scan" which is where it's grabbing actual row data (and presumably building a hash over the uid column). you can see how long in the "actual time" the first number (e.g. 281.604) is the time spent before it starts, and the second is the total time at finish (31190.290). If "loops" was greater than 1 you would multiply the times by the number of loops to get a total. So - there's nothing "wrong" in the sense that the second query does the same as the first. Let's take a step back. What you really want is your reports to be faster. You mentioned you were running this query thousands of times with a different "makeid" each time. Running it once for all possible values and stashing the results in a temp table will probably be *much* faster. The planner can just scan the whole table once and build up its results as it goes. -- Richard Huxton Archonet Ltd
Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL
On Mon, 25 Jan 2010, Richard Huxton wrote: > OK - so the first query processes 19,799 rows in 31,219 ms (about 1.5ms per > row) > > The second processes 2,606 rows in 3,813 ms (about 1.3ms per row). Agreed. One query is faster than the other because it has to do an eighth the amount of work. Matthew -- I wouldn't be so paranoid if you weren't all out to get me!!
On Mon, Jan 25, 2010 at 3:59 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Mon, 25 Jan 2010, Richard Huxton wrote: >> >> OK - so the first query processes 19,799 rows in 31,219 ms (about 1.5ms >> per row) >> >> The second processes 2,606 rows in 3,813 ms (about 1.3ms per row). > > Agreed. One query is faster than the other because it has to do an eighth > the amount of work. > > Matthew Thanks guys, ya this has dropped the time by half. The process is manageable now. Thanks again. For some reason we thought this method would make it take more time, vs less. So again appreciate the help :) Tory