Обсуждение: slower merge join on sorted data chosen over nested loop
In both the 8.1beta2 and using a build from this morning's dev snapshot, this query ran slower than expected: select count(*) from "DbTranRepository" AS "dtr" inner join "DbTranLogRecord" AS "dtlr" on ( "dtlr"."countyNo" = "dtr"."countyNo" and "dtlr"."tranImageSeqNo" = "dtr"."tranImageSeqNo" and "dtlr"."tableName" = 'Cal' ) where "dtr"."countyNo" = 40 and "dtr"."timestampValue" between '2005-09-30 00:00:00' and '2005-10-01 00:00:00'; Fresh restore from dump. ANALYZE run. No other activity. Autovacuum was disabled prior to the restore and test. 8GB dual Xeon SuSE 9.3 system hitting RAID 5. With the OS cache flushed by dd: Aggregate (cost=176218.64..176218.65 rows=1 width=0) (actual time=32081.623..32081.626 rows=1 loops=1) -> Merge Join (cost=175627.39..176218.09 rows=218 width=0) (actual time=31774.573..32071.776 rows=3007 loops=1) Merge Cond: ("outer"."?column3?"= "inner"."?column3?") -> Sort (cost=48913.06..48975.72 rows=25067 width=17) (actual time=5828.984..5954.706rows=39690 loops=1) Sort Key: (dtr."tranImageSeqNo")::numeric -> BitmapHeap Scan on "DbTranRepository" dtr (cost=297.07..47081.47 rows=25067 width=17) (actual time=69.056..5560.895 rows=39690loops=1) Recheck Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp withtime zone >= '2005-09-30 00:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone <='2005-10-01 00:00:00-05'::timestamp with time zone)) -> Bitmap Index Scan on "DbTranRepository_timestamp" (cost=0.00..297.07 rows=25067 width=0) (actual time=52.675..52.675 rows=39690 loops=1) Index Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp with time zone >= '2005-09-3000:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone <= '2005-10-01 00:00:00-05'::timestampwith time zone)) -> Sort (cost=126714.34..126945.94 rows=92640 width=17) (actual time=25130.457..25551.291rows=109605 loops=1) Sort Key: (dtlr."tranImageSeqNo")::numeric -> BitmapHeap Scan on "DbTranLogRecord" dtlr (cost=1037.84..119071.84 rows=92640 width=17) (actual time=203.528..23588.122rows=114113 loops=1) Recheck Cond: ((40 = ("countyNo")::smallint) AND (("tableName")::text= 'Cal'::text)) -> Bitmap Index Scan on "DbTranLogRecord_tableName" (cost=0.00..1037.84rows=92640 width=0) (actual time=146.412..146.412 rows=114113 loops=1) IndexCond: ((40 = ("countyNo")::smallint) AND (("tableName")::text = 'Cal'::text))Total runtime: 32091.211 ms With the OS cache populated by previous run of this plan: Aggregate (cost=176218.64..176218.65 rows=1 width=0) (actual time=4510.590..4510.593 rows=1 loops=1) -> Merge Join (cost=175627.39..176218.09rows=218 width=0) (actual time=4203.451..4500.785 rows=3007 loops=1) Merge Cond: ("outer"."?column3?"= "inner"."?column3?") -> Sort (cost=48913.06..48975.72 rows=25067 width=17) (actual time=686.388..812.160rows=39690 loops=1) Sort Key: (dtr."tranImageSeqNo")::numeric -> Bitmap HeapScan on "DbTranRepository" dtr (cost=297.07..47081.47 rows=25067 width=17) (actual time=30.432..427.463 rows=39690 loops=1) Recheck Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp with time zone>= '2005-09-30 00:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone <= '2005-10-0100:00:00-05'::timestamp with time zone)) -> Bitmap Index Scan on "DbTranRepository_timestamp" (cost=0.00..297.07 rows=25067 width=0) (actual time=19.089..19.089 rows=39690 loops=1) Index Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp with time zone >= '2005-09-3000:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone <= '2005-10-01 00:00:00-05'::timestampwith time zone)) -> Sort (cost=126714.34..126945.94 rows=92640 width=17) (actual time=2705.007..3124.966rows=109605 loops=1) Sort Key: (dtlr."tranImageSeqNo")::numeric -> BitmapHeap Scan on "DbTranLogRecord" dtlr (cost=1037.84..119071.84 rows=92640 width=17) (actual time=96.046..1195.302 rows=114113loops=1) Recheck Cond: ((40 = ("countyNo")::smallint) AND (("tableName")::text = 'Cal'::text)) -> Bitmap Index Scan on "DbTranLogRecord_tableName" (cost=0.00..1037.84 rows=92640 width=0)(actual time=54.917..54.917 rows=114113 loops=1) Index Cond: ((40 = ("countyNo")::smallint)AND (("tableName")::text = 'Cal'::text))Total runtime: 4520.130 ms Without the option to use merge join, it picked a plan which ran in one-third to one-half the time, depending on caching. dtr=> set enable_mergejoin=off With the OS cache flushed by dd: Aggregate (cost=298741.48..298741.49 rows=1 width=0) (actual time=14289.207..14289.210 rows=1 loops=1) -> Nested Loop (cost=0.00..298740.94 rows=218 width=0) (actual time=143.436..14277.124 rows=3007 loops=1) -> Index Scan using"DbTranRepository_timestamp" on "DbTranRepository" dtr (cost=0.00..49419.45 rows=25067 width=17) (actual time=33.625..11510.723rows=39690 loops=1) Index Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestampwith time zone >= '2005-09-30 00:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestampwith time zone <= '2005-10-01 00:00:00-05'::timestamp with time zone)) -> Index Scanusing "DbTranLogRecordPK" on "DbTranLogRecord" dtlr (cost=0.00..9.93 rows=1 width=17) (actual time=0.062..0.062 rows=0loops=39690) Index Cond: ((40 = (dtlr."countyNo")::smallint) AND ((dtlr."tranImageSeqNo")::numeric = ("outer"."tranImageSeqNo")::numeric)) Filter: (("tableName")::text = 'Cal'::text)Total runtime: 14289.305 ms With the OS cache populated by previous run of this plan: Aggregate (cost=298741.48..298741.49 rows=1 width=0) (actual time=1476.593..1476.596 rows=1 loops=1) -> Nested Loop (cost=0.00..298740.94rows=218 width=0) (actual time=1.373..1464.720 rows=3007 loops=1) -> Index Scan using "DbTranRepository_timestamp"on "DbTranRepository" dtr (cost=0.00..49419.45 rows=25067 width=17) (actual time=0.083..412.268rows=39690 loops=1) Index Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestampwith time zone >= '2005-09-30 00:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestampwith time zone <= '2005-10-01 00:00:00-05'::timestamp with time zone)) -> Index Scanusing "DbTranLogRecordPK" on "DbTranLogRecord" dtlr (cost=0.00..9.93 rows=1 width=17) (actual time=0.019..0.019 rows=0loops=39690) Index Cond: ((40 = (dtlr."countyNo")::smallint) AND ((dtlr."tranImageSeqNo")::numeric = ("outer"."tranImageSeqNo")::numeric)) Filter: (("tableName")::text = 'Cal'::text)Total runtime: 1476.681 ms listen_addresses = '*' # what IP interface(s) to listen on; max_connections = 100 # note: increasing max_connections costs shared_buffers = 20000 # min 16 or max_connections*2, 8KB each work_mem = 10240 # min 64, size in KB max_fsm_pages = 100000 # min max_fsm_relations*16, 6 bytes each wal_buffers = 20 # min 4, 8KB each effective_cache_size = 917504 # typically 8KB each random_page_cost = 2 # units are one sequential page fetch log_line_prefix = '[%m] %p %q<%u %d %r> ' # Special values: stats_start_collector = on stats_row_level = on autovacuum = false # enable autovacuum subprocess? autovacuum_naptime = 10 # time between autovacuum runs, in secs autovacuum_vacuum_threshold = 1 # min # of tuple updates before autovacuum_analyze_threshold = 1 # min # of tuple updates before autovacuum_vacuum_scale_factor = 0.2 # fraction of rel size before autovacuum_analyze_scale_factor = 0.1 # fraction of rel size before lc_messages = 'C' # locale for system error message lc_monetary = 'C' # locale for monetary formatting lc_numeric = 'C' # locale for number formatting lc_time = 'C' # locale for time formatting sql_inheritance = off If random_page_cost is reduced to 1.2, index scans are used instead of bitmap index scans. That makes little difference for the run time with this query -- at least when its data is cached. Table "public.DbTranRepository" Column | Type | Modifiers ------------------+------------------------+-----------countyNo | "CountyNoT" | not nulltranImageSeqNo | "TranImageSeqNoT" | not nulltimestampValue | "TimestampT" | not nulltransactionImage| "ImageT" |status | character(1) | not nullqueryName | "QueryNameT" |runDuration | numeric(15,0) |userId | "UserIdT" |functionalArea | character varying(50) |sourceRef | character varying(255) |url | "URLT" |tranImageSize | numeric(15,0) | Indexes: "DbTranRepositoryPK" PRIMARY KEY, btree ("countyNo", "tranImageSeqNo") "DbTranRepository_timestamp" btree ("countyNo","timestampValue") "DbTranRepository_userId" btree ("countyNo", "userId", "timestampValue") Table "public.DbTranLogRecord" Column | Type | Modifiers ----------------+-------------------+-----------countyNo | "CountyNoT" | not nulltranImageSeqNo | "TranImageSeqNoT"| not nulllogRecordSeqNo | "LogRecordSeqNoT" | not nulloperation | "OperationT" | not nulltableName | "TableNameT" | not null Indexes: "DbTranLogRecordPK" PRIMARY KEY, btree ("countyNo", "tranImageSeqNo", "logRecordSeqNo") "DbTranLogRecord_tableName"btree ("countyNo", "tableName", operation) Need any other info? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > In both the 8.1beta2 and using a build from this morning's > dev snapshot, this query ran slower than expected: There's a known issue that the planner tends to overestimate the cost of inner-index-scan nestloops, because it doesn't allow for the strong caching effects associated with repeated scans of the same index (in particular, that all the upper index levels tend to stay in cache). See the archives for past inconclusive discussions about how to fix this. However, something else caught my eye: > -> Bitmap Heap Scan on "DbTranRepository" dtr (cost=297.07..47081.47 rows=25067 width=17) (actual time=69.056..5560.895rows=39690 loops=1) > -> Index Scan using "DbTranRepository_timestamp" on "DbTranRepository" dtr (cost=0.00..49419.45 rows=25067 width=17)(actual time=33.625..11510.723 rows=39690 loops=1) I don't understand why the second case chose a plain index scan when there was no need for sorted output; the bitmap scan is faster both per cost estimate and in reality. Are you sure you turned off only enable_mergejoin and not enable_bitmapscan? Also, when you say "slower than expected", what is setting your expectation? regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>> >> There's a known issue that the planner tends to overestimate the cost of >> inner-index-scan nestloops, because it doesn't allow for the strong >> caching effects associated with repeated scans of the same index (in >> particular, that all the upper index levels tend to stay in cache). >> See the archives for past inconclusive discussions about how to fix >> this. > I spent a few hours trying different searches in the archives, and > found three very interesting threads on the topic. All were from > 2003. Should I keep digging for more recent threads, or would > these probably represent the current state of the issue? No, I don't recall that anything much has happened with it. The place where the rubber hits the road is this passage in genericcostestimate(): /* * Estimate the number of index pages that will be retrieved. * * For all currently-supported index types,the first page of the index * is a metadata page, and we should figure on fetching that plus a * pro-rated fractionof the remaining pages. */ if (index->pages > 1 && index->tuples > 0) { numIndexPages = (numIndexTuples/ index->tuples) * (index->pages - 1); numIndexPages += 1; /* count the metapage too */ numIndexPages = ceil(numIndexPages); } else numIndexPages = 1.0; /* * Compute the index access cost. * * Disk cost: our generic assumption is that the index pages will be read * sequentially, so they have cost 1.0 each, not random_page_cost. */ *indexTotalCost = numIndexPages; An important point here: in all the currently supported index types, that last assumption is ridiculously optimistic. Except maybe in a freshly-built btree, logically consecutive index pages won't be physically adjacent, and so a honest cost accounting would require charging random_page_cost per page. However, that's not the direction we want the estimate to go in :-( We could do something based on effective_cache_size to estimate the probability that the index page is already in cache and hence need not be re-read. The main trouble with this is estimating what fraction of the effective_cache_size can fairly be assumed to be populated by each index --- without some global knowledge about what other indexes are in heavy use, it's difficult to see how to get there from here. (OTOH, one could also level the same charge against the existing uses of effective_cache_size... maybe we could redefine it as cache available per query, or something like that, to put part of the problem on the DBA's shoulders.) One simple tweak we could make is to not count the index metapage in the pages-to-be-fetched estimate, which could be justified on the grounds that the metapage is almost certain to be in cache. (Note that the existing estimation technique is already logically wrong for btrees, because it's considering only the metapage and the leaf page(s) that need to be visited, and not the internal pages that need to be descended through to get to the right leaf. However, again we can say that the upper internal pages are probably in cache and so it's fair not to count them as needing to be read.) Whether this tweak would make the overall behavior better or worse is really hard to say. It'd be interesting to look at some test cases though. regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > A couple questions occur to me, though. I'm not clear on why > ceil is called -- do we need to eliminate the fraction here? Well, you don't get to read part of a page. In particular, fetching 1.0 index tuples requires fetching 1.0 pages, not (say) 0.01 page. We consistently use ceil() in these sorts of estimates, and I think it's correct. > Also, to be clear, I thought someone said that only the leaf level > was counted -- it looks like the other levels are being factored in, > but not as heavily as they would be accessed with logical reads, > because we're taking the fraction of tuples for the index multiplied > by the total number of index pages except for the metadata page. > This should bias the cost slightly higher for additional index levels, > shouldn't it? True, I overstated a little in claiming that upper pages weren't counted at all. However, the current formula drastically undercounts them. Consider a simple int4 index: this will require 16 bytes per entry (assuming MAXALIGN 4) so there could be up to 500-some index entries per page. At the traditional rule-of-thumb load factor of 2/3rds, that's say 340 entries per page, which will also be the fanout factor for upper index pages. Assume a fully populated 3-level index:1 root page 340 entries340 2nd-level pages 340^2 = 115600entries115600 leaf pages 340^3 = 39304000 entries Total index size: 115942 pages, 39304000 leaf entries. If we try to calculate the cost to fetch 1000 index tuples, the current numIndexPages calculation isceil((1000 / 39304000) * (115942 - 1) + 1) which is ceil(3.95) or 4 --- including the metapage. So we're basically just counting the three leaf pages that the tuples occupy; the root and second level pages are too small a fraction of the index to affect the result. (The correct estimate of course would be six or seven pages visited: metapage, root, one 2nd-level page, and 3 or 4 leaf pages depending on boundary effects.) And actually that example still understates the problem. You have to touch one page in each of the upper index levels to descend, regardless of how many leaf pages you are going to touch in the scan. Do the math for the estimate of the cost to read just *one* index entry --- in this example it comes out to 1.003 which ceil()'s up to 2, ie, the metapage and the single leaf page. With wider index entries, the fanout of upper pages decreases and so the upper pages take a larger fraction of the index --- but the tree depth increases, too, so more upper pages have to be visited to get to the first leaf page. You still end up with a big undercount. I recall thinking about changing the formula to more accurately count the number of pages touched; but I desisted when I realized that it would drastically increase the cost estimates for index searches, and that's surely the wrong direction to be going in. We really can't do that until we have some usable infrastructure to allow estimating the probability that those pages are already in cache. In the meantime, the tweaks under discussion here amount to assuming that the metapage and all upper pages are always in cache. The current cost estimate to fetch a single tuple via indexscan is basically random_page_cost + 2, plus some near-negligible cpu costs. Not counting the metapage would take that down to random_page_cost + 1. This would definitely move the goalposts, particularly for people who run with smaller-than-default random_page_cost, but I'm not sure if it's enough to solve the problem. Also, all this is really just a sideshow; I think the main problem for join estimation is that because we cost an inner-indexscan nestloop as taking N times the cost of one execution of the inner scan, we fail to account for cacheing effects in the table itself as well as the index. That would cut into the random_page_cost part of the cost estimate as well as the index cost. For all the reasons I've cited, it's pretty hard to justify reducing the estimate for an indexscan standing on its own --- but in the context of a nestloop join, it's easier to make a case. regards, tom lane
On Mon, Oct 10, 2005 at 09:10:38PM -0400, Tom Lane wrote: > I recall thinking about changing the formula to more accurately count > the number of pages touched; but I desisted when I realized that it > would drastically increase the cost estimates for index searches, and > that's surely the wrong direction to be going in. We really can't do > that until we have some usable infrastructure to allow estimating the > probability that those pages are already in cache. In the meantime, > the tweaks under discussion here amount to assuming that the metapage > and all upper pages are always in cache. > > The current cost estimate to fetch a single tuple via indexscan is > basically random_page_cost + 2, plus some near-negligible cpu costs. > Not counting the metapage would take that down to random_page_cost + 1. > This would definitely move the goalposts, particularly for people > who run with smaller-than-default random_page_cost, but I'm not sure > if it's enough to solve the problem. > > Also, all this is really just a sideshow; I think the main problem for > join estimation is that because we cost an inner-indexscan nestloop as > taking N times the cost of one execution of the inner scan, we fail to > account for cacheing effects in the table itself as well as the index. > That would cut into the random_page_cost part of the cost estimate as > well as the index cost. For all the reasons I've cited, it's pretty > hard to justify reducing the estimate for an indexscan standing on its > own --- but in the context of a nestloop join, it's easier to make a > case. One thing I noticed the last time I looked at all of this was that index correlation seems to be severely mis-weighted in scan calculations. http://archives.postgresql.org/pgsql-performance/2005-04/msg00669.php has more info on this. I suspect that until that issue is addressed other changes to the cost estimates won't make any useful difference. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461