Re: hint bit cache v6
От | Robert Haas |
---|---|
Тема | Re: hint bit cache v6 |
Дата | |
Msg-id | BANLkTimyKEyR16vT_0uM+PfBPDNqTspozQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: hint bit cache v6 (Merlin Moncure <mmoncure@gmail.com>) |
Ответы |
Re: hint bit cache v6
|
Список | pgsql-hackers |
On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> I think the basic problem is that the cache pages are so large. It's >> hard to make them smaller because that increases the cost of accessing >> the cache, as you already noted, but at the same time, making an >> eviction decision on a cache that holds 64K entries based on 100 data >> points seems like waving around a pretty large-caliber weapon in a >> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but >> it's hard to avoid the niggling fear that someone might accidentally >> get shot. > > Right...it's essentially a rolling statistical sampling of transaction > IDs based on past acceses. Going from say, 100 to 1000 will probably > drop your errror margin quite a bit but I really wonder how benefit > you can get from going past that. An interesting idea might be to forcibly cause a replacement every 100 tuples (or every 1000 tuples) and see if that causes a noticeable performance regression. If it doesn't, that's a good data point. I think the sour spot for this whole idea is likely to be the case where you have a workload that is 90% read and 10% write, or something like that. On an all-read workload (like pgbench -S), you're hopefully going to converge to a state where all the hint-bits are set, once VACUUM kicks in. And on an all-write workload I think that you might lose the effect in the noise. Not sure if we have an easy way to set up such a workload with pgbench, though. >> Just for the sake of argument, let's suppose we made an array with 64K >> elements, each one representing one 64K chunk of the XID space. Each >> element is a 4-byte unsigned integer, so the array takes up 256kB of >> memory... probably not good, but I'm just thinking out loud here. >> Every time we're asked about an XID, we bump the corresponding >> counter. Periodically (say, every 64K probes) we zip through all the >> counters and consider performing a cache replacement. We look at the >> not-already-cached page that has the highest counter value and compare >> it to the counter value for the cached pages. If it is higher and the >> difference exceeds some safety margin (to protect against hysteresis), >> then we do the replacement. I think something like this would allow >> you to have a lot more information available to make a direct >> apples-to-apples comparison at cache replacement time, as compared >> with what you have now. > > yeah -- what you've done here is basically two things: 1. by mapping > static ranges you get to skip the sort (but not the scan) during the > replacement, and 2. by vastly expanding the sampling size you > eliminate thrashing via noise in the rolling sample. This comes at a > significant memory cost and loss of some nimbleness in the cache (i'm > not completely sure more aggressive replacement is 'bad') -- although > that could be mitigated by replacing at more frequent intervals. Well, that gets at an interesting question: how often SHOULD we replace cache pages? My gut is that it's 10K-100K and your gut is that it's 100-1000, which is a hundred-fold difference. Seems like we need some kind of emprical data to decide what actually works well in real life. > Your range counting strategy might work -- I think to do it right so > that you don't have to map the entire range is going to require two > levels of ranges such that you activate a very wide range first before > looking at 64k subranges. That way you could reduce memory consumption > significantly without sorting during the replacement. I don't think you need two levels of ranges. Just keep track of the minimum and maximum XID covered by the array (these must always be 64M transactions apart, say, if the array has 1024 entries, and each cache page covers 64K XIDs). When you see a given XID, and it's between the minimum and the maximum, then bump the counter in bucket XID/64K. If you see an XID that is older than the minimum, then ignore it; we won't consider devoting cache pages to really old XIDs. If you see an XID that is newer than the maximum, then just increase the minimum and maximum by enough to put the new XID in range. For every 64K increment by which you advance the minimum and maximum, you need to zero one entry in the array (since it's no longer covering the same portion of the XID space) but all the other entries remain intact. (There is a small problem here of how to work out how to initially set the minimum and maximum to something sensible, and to make sure that they don't get out of whack if a backend sits idle for a few billion transactions, but that seems like it should be solvable.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления:
Следующее
От: "David E. Wheeler"Дата:
Сообщение: Re: Range Types, constructors, and the type system