Обсуждение: How to track down inconsistent performance?
We've got some queries that occasionally experience long run times (> 1 minute), and sometimes they're very quick (< 1 second). Our theory is that when the queries are fast it's because all the "right stuff" is already in memory, but we don't know how to actually prove that. (We think we've successfully ruled out disk contention -- iostat shows disk activity during these queries, but nothing that's excessive.) We've turned on statistics collection-- does anybody have any hints about what things we should specifically be looking for? Here's what explain says: bash-2.05$ time psql quickview -c "explain select distinct os,compiler,stdlibtype,threadlib from builds where product='sourcepro_db' and visible=true order by 1 asc;" NOTICE: QUERY PLAN: Unique (cost=197093.74..197588.02 rows=4943 width=50) -> Sort (cost=197093.74..197093.74 rows=49428 width=50) -> Index Scan using builds_visible_product on builds (cost=0.00..192225.34 rows=49428 width=50) quickview=> \d builds Table "builds" Column | Type | Modifiers -------------------+--------------------------+----------- id | integer | not null visible | boolean | state | character(1) | evaluated | boolean | product | character varying(30) | compiler | character varying(30) | os | character varying(30) | stdlibtype | character varying(30) | linktype | character varying(30) | threadlib | character varying(30) | exportlevel | character varying(30) | usermode | character varying(30) | postbuildclean | character varying(30) | prebuildclean | character varying(30) | submitted | timestamp with time zone | started | timestamp with time zone | finished | timestamp with time zone | machine | character varying(100) | errors | integer | warnings | integer | testsattempted | integer | testspassed | integer | testsfailed | integer | examplesattempted | integer | examplespassed | integer | examplesfailed | integer | ping | timestamp with time zone | start_count | integer | user1 | character varying(50) | user2 | character varying(50) | user3 | character varying(50) | user4 | character varying(50) | user5 | character varying(50) | user6 | character varying(50) | debug | character varying(30) | Indexes: builds_compiler, builds_compiler_finished, builds_compiler_state, builds_compiler_submitted, builds_machine, builds_machine_finished, builds_machine_state, builds_machine_submitted, builds_os, builds_os_finished, builds_os_state, builds_os_submitted, builds_ping_finished, builds_pr_os_comp_std_thr_u1, builds_product, builds_product_finished, builds_product_state, builds_product_submitted, builds_started, builds_state, builds_visible_finished_product, builds_visible_product Primary key: builds_pkey Triggers: RI_ConstraintTrigger_25192277, RI_ConstraintTrigger_25192279
Well, perhaps I'm on the right track-- quickview=> select indexrelname,idx_blks_read,idx_blks_hit from pg_statio_user_indexes where indexrelname='builds_visible_product'; indexrelname | idx_blks_read | idx_blks_hit ------------------------+---------------+-------------- builds_visible_product | 625760 | 522572 (1 row) Based on the numbers that the other indices report, I'm guessing that idx_blks_read is "cache misses"-- if so, it's missing over 50% of the time. If this is in fact what's causing the problems, then maybe there's a way to reduce the size of the index? And the index definition: quickview=> select * from pg_indexes where indexname='builds_visible_product'; tablename | indexname | indexdef -----------+------------------------+--------------------------------------- --------------------------------------- builds | builds_visible_product | CREATE INDEX builds_visible_product ON builds USING btree (visible, product) (1 row) 'visible' is a boolean, 'product' is a varchar(30), and there are about 210K records in the builds table. (I don't know if it's relevant, but there are about 39 distinct product values. -ron > -----Original Message----- > From: Ron Snyder [mailto:snyder@roguewave.com] > Sent: Saturday, April 27, 2002 5:01 PM > To: pgsql General List > Subject: [GENERAL] How to track down inconsistent performance? > > > We've got some queries that occasionally experience long run > times (> 1 > minute), and sometimes they're very quick (< 1 second). Our > theory is that > when the queries are fast it's because all the "right stuff" > is already in > memory, but we don't know how to actually prove that. (We think we've > successfully ruled out disk contention -- iostat shows disk > activity during > these queries, but nothing that's excessive.) > > We've turned on statistics collection-- does anybody have any > hints about > what things we should specifically be looking for? > > Here's what explain says: > bash-2.05$ time psql quickview -c "explain select distinct > os,compiler,stdlibtype,threadlib from builds where > product='sourcepro_db' > and visible=true order by 1 asc;" > NOTICE: QUERY PLAN: > > Unique (cost=197093.74..197588.02 rows=4943 width=50) > -> Sort (cost=197093.74..197093.74 rows=49428 width=50) > -> Index Scan using builds_visible_product on builds > (cost=0.00..192225.34 rows=49428 width=50) > > quickview=> \d builds > Table "builds" > Column | Type | Modifiers > -------------------+--------------------------+----------- > id | integer | not null > visible | boolean | > state | character(1) | > evaluated | boolean | > product | character varying(30) | > compiler | character varying(30) | > os | character varying(30) | > stdlibtype | character varying(30) | > linktype | character varying(30) | > threadlib | character varying(30) | > exportlevel | character varying(30) | > usermode | character varying(30) | > postbuildclean | character varying(30) | > prebuildclean | character varying(30) | > submitted | timestamp with time zone | > started | timestamp with time zone | > finished | timestamp with time zone | > machine | character varying(100) | > errors | integer | > warnings | integer | > testsattempted | integer | > testspassed | integer | > testsfailed | integer | > examplesattempted | integer | > examplespassed | integer | > examplesfailed | integer | > ping | timestamp with time zone | > start_count | integer | > user1 | character varying(50) | > user2 | character varying(50) | > user3 | character varying(50) | > user4 | character varying(50) | > user5 | character varying(50) | > user6 | character varying(50) | > debug | character varying(30) | > Indexes: builds_compiler, > builds_compiler_finished, > builds_compiler_state, > builds_compiler_submitted, > builds_machine, > builds_machine_finished, > builds_machine_state, > builds_machine_submitted, > builds_os, > builds_os_finished, > builds_os_state, > builds_os_submitted, > builds_ping_finished, > builds_pr_os_comp_std_thr_u1, > builds_product, > builds_product_finished, > builds_product_state, > builds_product_submitted, > builds_started, > builds_state, > builds_visible_finished_product, > builds_visible_product > Primary key: builds_pkey > Triggers: RI_ConstraintTrigger_25192277, > RI_ConstraintTrigger_25192279 > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: Have you searched our list archives? > http://archives.postgresql.org
Ron Snyder <snyder@roguewave.com> writes: > 'visible' is a boolean, 'product' is a varchar(30), and there are about 210K > records in the builds table. (I don't know if it's relevant, but there are > about 39 distinct product values. Hmm, a rough guess would suggest that the query will be retrieving about 1/78th of the rows (unless there are statistics you haven't mentioned that affect that). I would think that a seqscan would be faster for this. I would definitely think that the planner would think so --- are you forcing enable_seqscan off? Since you're evidently running 7.2, I'd ask for EXPLAIN ANALYZE rather than just EXPLAIN output. It'd be even more helpful if you are able to catch both the fast and slow cases under EXPLAIN ANALYZE. regards, tom lane
I learned last night that 'visible' is false in about 95% of the records. The products are not evenly distributed either-- sourcepro_db (the previous example) accounts for 1/2 of the records, and math (today's example) account for 3%. Here's the output of explain analyze-- these were executed less than 30 seconds apart. These are: "explain analyze" the query; do the query; "explain analyze" the query again. bash-2.05$ time psql quickview -c "explain analyze select distinct os,compiler,stdlibtype,threadlib from builds where product='math' and visible=true order by 1 asc;" NOTICE: QUERY PLAN: Unique (cost=33579.50..33664.14 rows=846 width=50) (actual time=27453.33..27478.75 rows=202 loops=1) -> Sort (cost=33579.50..33579.50 rows=8463 width=50) (actual time=27453.32..27462.24 rows=5934 loops=1) -> Index Scan using builds_visible_product on builds (cost=0.00..32868.90 rows=8463 width=50) (actual time=0.30..27266.32 rows=5934 loops=1) Total runtime: 27491.93 msec EXPLAIN real 0m27.528s user 0m0.000s sys 0m0.000s bash-2.05$ time psql quickview -c "select distinct os,compiler,stdlibtype,threadlib from builds where product='math' and visible=true order by 1 asc;" > /dev/null real 0m0.330s user 0m0.000s sys 0m0.010s bash-2.05$ time psql quickview -c "explain analyze select distinct os,compiler,stdlibtype,threadlib from builds where product='math' and visible=true order by 1 asc;" NOTICE: QUERY PLAN: Unique (cost=33579.50..33664.14 rows=846 width=50) (actual time=270.59..297.45 rows=202 loops=1) -> Sort (cost=33579.50..33579.50 rows=8463 width=50) (actual time=270.58..280.75 rows=5934 loops=1) -> Index Scan using builds_visible_product on builds (cost=0.00..32868.90 rows=8463 width=50) (actual time=0.08..90.87 rows=5934 loops=1) Total runtime: 307.74 msec EXPLAIN real 0m0.336s user 0m0.000s sys 0m0.010s Thanks for your help! -ron > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Sunday, April 28, 2002 7:59 AM > To: Ron Snyder > Cc: pgsql General List > Subject: Re: [GENERAL] How to track down inconsistent performance? > > > Ron Snyder <snyder@roguewave.com> writes: > > 'visible' is a boolean, 'product' is a varchar(30), and > there are about 210K > > records in the builds table. (I don't know if it's > relevant, but there are > > about 39 distinct product values. > > Hmm, a rough guess would suggest that the query will be > retrieving about > 1/78th of the rows (unless there are statistics you haven't > mentioned that > affect that). I would think that a seqscan would be faster for this. > I would definitely think that the planner would think so --- are you > forcing enable_seqscan off? > > Since you're evidently running 7.2, I'd ask for EXPLAIN ANALYZE rather > than just EXPLAIN output. It'd be even more helpful if you are able > to catch both the fast and slow cases under EXPLAIN ANALYZE. > > regards, tom lane >
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > affect that). I would think that a seqscan would be faster for this. > I would definitely think that the planner would think so --- are you > forcing enable_seqscan off? Forgot to answer this-- no, we're not forcing it off. Maybe we need to force it to use a seqscan for this query? > Since you're evidently running 7.2, I'd ask for EXPLAIN ANALYZE rather We are still running 7.2 on this database-- we're in the middle of acceptance testing for the 7.2.1 upgrade. -ron
Ron Snyder <snyder@roguewave.com> writes: >> affect that). I would think that a seqscan would be faster for this. >> I would definitely think that the planner would think so --- are you >> forcing enable_seqscan off? > Forgot to answer this-- no, we're not forcing it off. Maybe we need to > force it to use a seqscan for this query? It would be interesting to try that just for comparison purposes. I'd like to know the difference in the planner's cost estimate, as well as the actual runtime. Your results do make it look like the difference between "fast" and "slow" cases is just whether the table and index data are in disk buffer cache or not. What shared_buffers setting are you using for Postgres? Have you tried experimenting with other values? What's the total RAM in the box? The ultimate answer might just be "you need to buy more RAM" ... regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > >> I would definitely think that the planner would think so > --- are you > >> forcing enable_seqscan off? > > > Forgot to answer this-- no, we're not forcing it off. > > Maybe we need to > > force it to use a seqscan for this query? > > It would be interesting to try that just for comparison purposes. > I'd like to know the difference in the planner's cost estimate, > as well as the actual runtime. Well, assuming that I did this correctly (just turning off indexscan), the results are below (I ran it twice to see what benefits from having some it it in memory might have, but the results were almost identical): bash-2.05$ time psql quickview -c "set enable_indexscan=off; explain analyze select distinct os,compiler,stdlibtype,threadlib from builds where product='math' and visible=true order by 1 asc;" NOTICE: QUERY PLAN: Unique (cost=273438.18..273522.81 rows=846 width=50) (actual time=140567.87..140593.49 rows=202 loops=1) -> Sort (cost=273438.18..273438.18 rows=8463 width=50) (actual time=140567.86..140576.85 rows=5934 loops=1) -> Seq Scan on builds (cost=0.00..272727.58 rows=8463 width=50) (actual time=56.38..140347.16 rows=5934 loops=1) Total runtime: 140620.56 msec EXPLAIN real 2m20.672s user 0m0.010s sys 0m0.000s > > Your results do make it look like the difference between "fast" and > "slow" cases is just whether the table and index data are in > disk buffer > cache or not. What shared_buffers setting are you using for Postgres? > Have you tried experimenting with other values? What's the total RAM > in the box? The ultimate answer might just be "you need to > buy more RAM" ... Shared_buffers is set to 8096, the machine has 1 gig of ram. I don't know if it matters, but there are about 170 connections at any one time, although most of them are idle at any moment. About 10 of them are typically idle inside of a transaction (if that matters). Does the following information help?: bash-2.05$ psql quickview -c "select relid,indexrelname from pg_statio_user_indexes where indexrelname='builds_visible_product';" relid | indexrelname ----------+------------------------ 25192249 | builds_visible_product (1 row) bash-2.05$ find . -type f -name 25192249 -exec ls -ald {} \; -rw------- 1 pgqv postgres 1073741824 Apr 28 15:43 ./base/23527426/25192249 We do have more memory on order, but I'd like to know of a more definite method of predicting what the settings and memory need to be. (based on the size of the builds table in the filesystem above, I'm guessing that I've got my answer.) If we can segment the data so that the builds table isn't so big, could that lead to a more efficient use of memory? What about deleting some indices that exist but are not being used? -ron > > regards, tom lane >