Обсуждение: Buffer Allocation Concurrency Limits
In December, Metin (a coworker of mine) discussed an inability to scale a simple task (parallel scans of many independent tables) to many cores (it’s here). As a ramp-up task at Citus I was tasked to figure out what the heck was going on here.
I have a pretty extensive writeup here (whose length is more a result of my inexperience with the workings of PostgreSQL than anything else) and was looking for some feedback.
In short, my conclusion is that a working set larger than memory results in backends piling up on BufFreelistLock. As much as possible I removed anything that could be blamed for this:
- Hyper-Threading is disabled
- zone reclaim mode is disabled
- numactl was used to ensure interleaved allocation
- kernel.sched_migration_cost was set to highly disable migration
- kernel.sched_autogroup_enabled was disabled
- transparent hugepage support was disabled
For a way forward, I was thinking the buffer allocation sections could use some of the atomics Andres added here. Rather than workers grabbing BufFreelistLock to iterate the clock hand until they find a victim, the algorithm could be rewritten in a lock-free style, allowing workers to move the clock hand in tandem.
Alternatively, the clock iteration could be moved off to a background process, similar to what Amit Kapila proposed here.
Is this assessment accurate? I know 9.4 changes a lot about lock organization, but last I looked I didn’t see anything that could alleviate this contention: are there any plans to address this?
—Jason
On Tue, Apr 8, 2014 at 10:38 PM, Jason Petersen <jason@citusdata.com> wrote: > In December, Metin (a coworker of mine) discussed an inability to scale a > simple task (parallel scans of many independent tables) to many cores (it's > here). As a ramp-up task at Citus I was tasked to figure out what the heck > was going on here. > > I have a pretty extensive writeup here (whose length is more a result of my > inexperience with the workings of PostgreSQL than anything else) and was > looking for some feedback. At this moment, I am not able to open the above link (here), may be some problem (it's showing Service Unavailable); I will try it later. > In short, my conclusion is that a working set larger than memory results in > backends piling up on BufFreelistLock. Here when you say that working set larger than memory, do you mean to refer *memory* as shared_buffers? I think if the data is more than total memory available, anyway the effect of I/O can over shadow the effect of BufFreelistLock contention. > As much as possible I removed > anything that could be blamed for this: > > Hyper-Threading is disabled > zone reclaim mode is disabled > numactl was used to ensure interleaved allocation > kernel.sched_migration_cost was set to highly disable migration > kernel.sched_autogroup_enabled was disabled > transparent hugepage support was disabled > > > For a way forward, I was thinking the buffer allocation sections could use > some of the atomics Andres added here. Rather than workers grabbing > BufFreelistLock to iterate the clock hand until they find a victim, the > algorithm could be rewritten in a lock-free style, allowing workers to move > the clock hand in tandem. > > Alternatively, the clock iteration could be moved off to a background > process, similar to what Amit Kapila proposed here. I think both of the above ideas can be useful, but not sure if they are sufficient for scaling shared buffer's. > Is this assessment accurate? I know 9.4 changes a lot about lock > organization, but last I looked I didn't see anything that could alleviate > this contention: are there any plans to address this? I am planing to work on this for 9.5. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com