Re: slower merge join on sorted data chosen over

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: slower merge join on sorted data chosen over
Дата
Msg-id s34bea4b.071@gwmta.wicourts.gov
обсуждение исходный текст
Список pgsql-hackers
You don't get to read part of a page, but you may be dealing with
probabilities.  For example, consider a case where there are ten data
pages, and you are going to read 15% of the tuples.  There is a 50%
chance that your scan will start in the first half of a leaf page and
only need two leaf pages.  There is a 50% chance that it will start in
the second half and need to access three leaf pages.  Wouldn't it be
better to cost this at 2.5 pages rather than forcing it to either of the
possible values?
That said, a minimum of 1.0 is appropriate; that is why I was thinking
of adding the one and not using ceil.
I have gotten approval from my client to work on this costing issue.
Before I try modifying any code, however, I want to make sure I'm going
down the right path.  The more I think about it, the more it seems like
fudging the fundamental cost of an index scan down so that it isn't so
far off when used in a nested context (which is, as you've pointed out,
what we're talking about here) should be avoided if at all possible.
The thought occurs to me that I could do a heck of a lot better
solution if the genericcostestimate function included an iteration
count parameter.  I could generate a fairly accurate cost for the whole
set of iterations, then divide by the iteration count, so that the
value coming out could still be used as it has been.  (The iteration
count would essentially be a "clue" to help determine how many physical
I/O might be needed.)  It is not immediately obvious to me how this
context info could be passed in, which is probably why it hasn't been
done already -- but if you have any suggestions in this regard, I'd be
happy to hear them.
Also, I am wondering if there is any chance that the sort/merge
technique is somehow UNDERestimated.  My experience, and what I've seen
in the archives, establishes that the relative cost of the nested index
scan can be too high compared to the cost of the scan/sort/mergejoin
technique; but I haven't seen totally convincing evidence yet regarding
which one is off.  You've stated that the caching for the upper index
levels isn't factored in with the nested technique, but when running
with everything totally cached for both techniques, our real-life app
runs three times as fast with the nested index scan versus only twice
as fast when neither technique starts with anything cached.  Something
isn't sitting right for me when I match those results to the
explanation.  What am I missing here?
Clearly, a good set of repeatable tests is critical.  I'm still
thinking about that.  Any suggestions on this welcome.
-Kevin
>>> Tom Lane <tgl@sss.pgh.pa.us> 10/10/05 8:10 PM >>>

Well, you don't get to read part of a page.  In particular, fetching 1.0
index tuples requires fetching 1.0 pages, not (say) 0.01 page.  We
consistently use ceil() in these sorts of estimates, and I think it's
correct.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jaime Casanova
Дата:
Сообщение: Re: avoid pulling up subquerys that contain volatile functions?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: database vacuum from cron hanging