Re: slower merge join on sorted data chosen over
От | Kevin Grittner |
---|---|
Тема | Re: slower merge join on sorted data chosen over |
Дата | |
Msg-id | s34ab442.057@gwmta.wicourts.gov обсуждение исходный текст |
Список | pgsql-hackers |
Hmmm... With that much direction, I really have no excuse not to try a change and provide some test cases, do I? A couple questions occur to me, though. I'm not clear on why ceil is called -- do we need to eliminate the fraction here? It seems to me that it wouldn't matter much except when the index is small, in which case the odds are greater that pages would be cached. It seems like it would be a less radical change to eliminate the ceil call than the increment. As a first pass, I'm inclined to try that in combination with a reduction in the configured cpu_index_tuple_cost. Does that seem reasonable? Also, to be clear, I thought someone said that only the leaf level was counted -- it looks like the other levels are being factored in, but not as heavily as they would be accessed with logical reads, because we're taking the fraction of tuples for the index multiplied by the total number of index pages except for the metadata page. This should bias the cost slightly higher for additional index levels, shouldn't it? I toyed with the idea of eliminating both the ceil and the increment, and just having a minimum of 1.0, but that seemed extreme. -Kevin >>> Tom Lane <tgl@sss.pgh.pa.us> 10/10/05 4:23 PM >>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>> >> There's a known issue that the planner tends to overestimate the cost of >> inner-index-scan nestloops, because it doesn't allow for the strong >> caching effects associated with repeated scans of the same index (in >> particular, that all the upper index levels tend to stay in cache). >> See the archives for past inconclusive discussions about how to fix >> this. > I spent a few hours trying different searches in the archives, and > found three very interesting threads on the topic. All were from > 2003. Should I keep digging for more recent threads, or would > these probably represent the current state of the issue? No, I don't recall that anything much has happened with it. The place where the rubber hits the road is this passage in genericcostestimate(): /* * Estimate the number of index pages that will be retrieved. * * For all currently-supported index types,the first page of the index * is a metadata page, and we should figure on fetching that plus a * pro-rated fractionof the remaining pages. */ if (index->pages > 1 && index->tuples > 0) { numIndexPages = (numIndexTuples/ index->tuples) * (index->pages - 1); numIndexPages += 1; /* count the metapage too */ numIndexPages = ceil(numIndexPages); } else numIndexPages = 1.0; /* * Compute the index access cost. * * Disk cost: our generic assumption is that the index pages will be read * sequentially, so they have cost 1.0 each, not random_page_cost. */ *indexTotalCost = numIndexPages; An important point here: in all the currently supported index types, that last assumption is ridiculously optimistic. Except maybe in a freshly-built btree, logically consecutive index pages won't be physically adjacent, and so a honest cost accounting would require charging random_page_cost per page. However, that's not the direction we want the estimate to go in :-( We could do something based on effective_cache_size to estimate the probability that the index page is already in cache and hence need not be re-read. The main trouble with this is estimating what fraction of the effective_cache_size can fairly be assumed to be populated by each index --- without some global knowledge about what other indexes are in heavy use, it's difficult to see how to get there from here. (OTOH, one could also level the same charge against the existing uses of effective_cache_size... maybe we could redefine it as cache available per query, or something like that, to put part of the problem on the DBA's shoulders.) One simple tweak we could make is to not count the index metapage in the pages-to-be-fetched estimate, which could be justified on the grounds that the metapage is almost certain to be in cache. (Note that the existing estimation technique is already logically wrong for btrees, because it's considering only the metapage and the leaf page(s) that need to be visited, and not the internal pages that need to be descended through to get to the right leaf. However, again we can say that the upper internal pages are probably in cache and so it's fair not to count them as needing to be read.) Whether this tweak would make the overall behavior better or worse is really hard to say. It'd be interesting to look at some test cases though. regards, tom lane
В списке pgsql-hackers по дате отправления: