Re: our buffer replacement strategy is kind of lame
От | Robert Haas |
---|---|
Тема | Re: our buffer replacement strategy is kind of lame |
Дата | |
Msg-id | CA+TgmoZsc9HPXqct65OK6vorw7bJFKBx0mYu_EWrO=+crRGuwg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: our buffer replacement strategy is kind of lame (Simon Riggs <simon@2ndQuadrant.com>) |
Ответы |
Re: our buffer replacement strategy is kind of lame
(Simon Riggs <simon@2ndQuadrant.com>)
Re: our buffer replacement strategy is kind of lame (Simon Riggs <simon@2ndQuadrant.com>) |
Список | pgsql-hackers |
On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote: >>> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> and possibly we ought to put them all in a >>>> linked list so that the next guy who needs a buffer can just pop one >>> >>> The whole point of the clock sweep algorithm is to approximate an LRU >>> without needing to maintain a linked list. The problem with a linked >>> list is that you need to serialize access to it so every time you >>> reference a buffer you need to wait on a lock for the list so you can >>> move that buffer around in the list. >> >> Right, but the same doesn't apply to what I'm talking about. You just >> put each guy into the linked list when his reference count goes down >> to 0. When you want to allocate a buffer, you pop somebody off. If >> his reference count is still 0, you forget about him and pop the next >> guy, and keep going until you find a suitable victim. >> >> The problem with just running the clock sweep every time you need a >> victim buffer is that you may have to pass over a large number of >> unevictable buffers before you find someone you can actually kick out. >> That's work that should really be getting done in the background, not >> at the absolute last moment when we discover, hey, we need a buffer. > > I think Greg is right here. Those suggested changes just bring back the LRU. Since the problem with LRU is that it requires moving each buffer to the front of the list every time it's touched, and since the idea that I proposed does not require that, I don't know what you mean by this. > If you keep a separate list then you need to serialize access to it. That is true. However, there are some reasons to think it would be better than what we're doing right now. Right now, we acquire the BufFreelistLock, and then take and release some number of spinlocks. It seems fairly easy to construct cases where that number is routinely over 100; even on my laptop, I could construct cases where it was routinely 50-100 range. A linked list that we can just dequeue things from could be protected by a single spinlock, which would hopefully be somewhat faster. (In the interest of credit where credit is due, this is pretty much the point Jim Nasby has been making every time this argument comes up, and I've been dismissing it, but now that I understand how much work gets done in that clock sweep code, I'm starting to think he might be right.) However, if it turns out that that there's still too much contention over that spinlock, we can probably fix that by maintaining N lists instead of one, distributing buffers between them in round-robin fashion, and having each backend choose a list based on its backend ID modulo N. The testing I've done so far seems to indicate that spinlock contention doesn't really become noticeable until you have 32-40 CPUs pounding hard on the same spinlock, so even N = 4 is probably enough to make the problem go away. But I don't even think we're going to have to go that far right at the outset, because there's another major source of contention in the buffer eviction path: the buffer mapping locks. > usage_count == 0 and "unevictable" are dynamic states, so if the > bgwriter sees those states they can change before anyone uses the > buffer. Yep. That's already a problem. When we pin the buffer to be evicted, we're not yet holding the relevant buffer mapping locks. By the time we do, someone else can have pinned the page. So there's already retry logic here. In theory, you can imagine a backend getting starved for an arbitrarily long period of time, because it keeps picking a victim buffer that someone else wrenches away before we actually manage to kick it out. I haven't yet seen any evidence that that's a problem in practice, but if it is, this idea will make it worse. It seems hard to know without testing it, though. The big problem with this idea is that it pretty much requires that the work you mentioned in one of your other emails - separating the background writer and checkpoint machinery into two separate processes - to happen first. So I'm probably going to have to code that up to see whether this works. If you're planning to post that patch soon I'll wait for it. Otherwise, I'm going to have to cobble together something that is at least good enough for testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления:
Следующее
От: Martijn van OosterhoutДата:
Сообщение: Re: our buffer replacement strategy is kind of lame