Обсуждение: Postgres 8.4 memory related parameters
hi, We recently bought a 4 8core 128G memory database server and I am setting it up to replace our old 4 4cores 128G memory database server as a master. The memory related settings that we use on the old machine seem a bit wrong according to the experts on IRC: max_connections = 600 shared_buffers = 32GB effective_cache_size = 64GB work_mem = 5MB maintenance_work_mem = 1GB wal_buffers = 64kB we are using ubuntu 10 /etc/sysctl.conf kernel.shmmax=35433480192 kernel.shmall=8650752 Can anyone suggest better values? thanks, Claire |
Claire Chang <yenhsiac@yahoo.com> wrote: > hi, We recently bought a 4 8core 128G memory database server and I > am setting it up to replace our old 4 4cores 128G memory database > server as a master. The memory related settings that we use on > the old machine seem a bit wrong according to the experts on IRC: > max_connections = 600 You're probably going to get better performance by setting that to 2 to 3 times the number of actual cores (don't county hyperthreading for this purpose), and using a connection pooler to funnel the 600 user connections down to a smaller number of database connections. > shared_buffers = 32GB I seem to remember seeing some benchmarks showing that performance falls off after 10GB or 20GB on that setting. > effective_cache_size = 64GB Seems sane. > work_mem = 5MB You could bump that up, especially if you go to the connection pool. > maintenance_work_mem = 1GB OK, but I might double that. > wal_buffers = 64kB This should definitely be set to 16MB. -Kevin
On 08/04/2011 03:38 PM, Kevin Grittner wrote: > You're probably going to get better performance by setting that to 2 > to 3 times the number of actual cores (don't county hyperthreading > for this purpose), and using a connection pooler to funnel the 600 > user connections down to a smaller number of database connections. Your note about Hyperthreading *used* to be true. I'm not sure exactly what they did to the Intel nehalem cores, but hyperthreading actually seems to be much better now. It's not a true multiplier, but our pgbench scores were 40% to 60% higher with HT enabled up to at least 5x the number of cores. I was honestly shocked at those results, but they were consistent across multiple machines from two separate vendors. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
On Thu, Aug 4, 2011 at 2:38 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Claire Chang <yenhsiac@yahoo.com> wrote: > >> hi, We recently bought a 4 8core 128G memory database server and I >> am setting it up to replace our old 4 4cores 128G memory database >> server as a master. The memory related settings that we use on >> the old machine seem a bit wrong according to the experts on IRC: > >> max_connections = 600 > > You're probably going to get better performance by setting that to 2 > to 3 times the number of actual cores (don't county hyperthreading > for this purpose), and using a connection pooler to funnel the 600 > user connections down to a smaller number of database connections. > >> shared_buffers = 32GB > > I seem to remember seeing some benchmarks showing that performance > falls off after 10GB or 20GB on that setting. > >> effective_cache_size = 64GB > > Seems sane. > >> work_mem = 5MB > > You could bump that up, especially if you go to the connection pool. > >> maintenance_work_mem = 1GB > > OK, but I might double that. > >> wal_buffers = 64kB > > This should definitely be set to 16MB. Agreed with everything so far. A few other points. If you're doing a LOT of writing, and the immediate working set will fit in less shared_buffers then lower it down to something in the 1 to 4G range max. Lots of write and a large shared_buffer do not mix well. I have gotten much better performance from lowering shared_buffers on machines that need to write a lot. I run Ubuntu 10.04 for my big postgresql servers right now. With that in mind, here's some pointers. I'd recommend adding this to rc.local: # turns off swap /sbin/swapoff -a I had a few weird kswapd storms where the kernel just seems to get confused about having 128G of ram and swap space. Machine was lagging very hard at odd times of the day until I just turned off swap. and if you have a caching RAID controller with battery backup then I'd add a line like this: echo noop > /sys/block/sda/queue/scheduler for every RAID drive you have. Any other scheduler really just gets in the way of a good caching RAID controller. There's also some parameters that affect how fast dirty caches are written out by the OS, worth looking into, but they made no big difference on my 128G 48 core 34 15krpm drive system. If you do a lot of inserts / updates / deletes then look at making vacuum more aggressive. Also look at making the bgwriter a bit more aggressive and cranking up the timeout and having lots of checkpoint segments.
Shaun Thomas <sthomas@peak6.com> wrote: > On 08/04/2011 03:38 PM, Kevin Grittner wrote: > >> You're probably going to get better performance by setting that >> to 2 to 3 times the number of actual cores (don't county >> hyperthreading for this purpose), and using a connection pooler >> to funnel the 600 user connections down to a smaller number of >> database connections. > > Your note about Hyperthreading *used* to be true. I'm not sure > exactly what they did to the Intel nehalem cores, but > hyperthreading actually seems to be much better now. It's not a > true multiplier, but our pgbench scores were 40% to 60% higher > with HT enabled up to at least 5x the number of cores. Note that I didn't recommend not *using* HT, just not counting it toward the core count for purposes of calculating how many active connections to allow. The important question for this purpose isn't whether you ran faster with HT enabled, but where you hit the knee in the performance graph. Did you actually run faster with 5x more active connections than cores than with 3x more connections than cores? Was the load hitting disk or running entirely from cache? If the former, what was your cache hit rate and how many spindles did you have in what configuration. I oversimplified slightly from the formula I actually have been using based on benchmarks here, which is ((2 * core_count) + effective_spindle_count. Mostly I simplified because it's so hard to explain how to calculate an "effective_spindle_count". ;-) Anyway, I'm always willing to take advantage of the benchmarking work of others, so I'm very curious about where performance topped out for you with HT enabled, and whether disk waits were part of the mix. -Kevin
On 08/04/2011 04:36 PM, Kevin Grittner wrote: > Anyway, I'm always willing to take advantage of the benchmarking > work of others, so I'm very curious about where performance topped > out for you with HT enabled, and whether disk waits were part of the > mix. Hah. Well, it peaked at 2x physical cores, where it ended up being 60% faster than true cores. It started to fall after that, until I hit 64 concurrent connections and it dropped down to 36% faster. I should also note that this is with core turbo enabled and performance mode BIOS settings so it never goes into power saving mode. Without those, our results were inconsistent, with variance of up to 40% per run, on top of 40% worse performance at concurrency past 2x core count. I tested along a scale from 1 to 64 concurrent connections at a scale of 100 so it would fit in memory. I was trying to test some new X5675s cores against our old E7450s. The scary part was that a dual X5675 ended up being 2.5x faster than a quad E7450 at 24-user concurrency. It's unreal. It's a great way to save on per-core licensing fees. We're also on an 8.2 database. We're upgrading soon, I promise. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Shaun Thomas <sthomas@peak6.com> wrote: > it peaked at 2x physical cores, where it ended up being 60% > faster than true cores. Not sure I understand the terminology here -- "physical cores" is counting HT or not? Thanks, -Kevin
On 08/04/2011 04:54 PM, Kevin Grittner wrote: >> it peaked at 2x physical cores, where it ended up being 60% >> faster than true cores. > > Not sure I understand the terminology here -- "physical cores" is > counting HT or not? No. So with a dual X5675, that's 12 cores. My numbers peaked at 24-concurrency. At that concurrency, HT was 60% faster than non-HT. Sorry if I mixed my terminology. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Shaun Thomas <sthomas@peak6.com> wrote: > So with a dual X5675, that's 12 cores. My numbers peaked at > 24-concurrency. At that concurrency, HT was 60% faster than > non-HT. Sorry if I mixed my terminology. :) No problem -- I appreciate the information. I just wanted to be sure I was understanding it correctly. So, with hyperthreading turned on, the optimal number of active connections was twice the actual cores. And since the active data set was fully cached, disk spindles were not a resource which played any significant role in the test, making the "effective spindle count" zero. So this is one more data point confirming the overall accuracy of the formula I use, and providing evidence that it is not affected by use of hyperthreading if you base your numbers on actual cores. optimal pool size = ((2 * actual core count) + effective spindle count) optimal pool size = ((2 * 12) + 0) optimal pool size = 24 Thanks! -Kevin
On 08/05/2011 09:00 AM, Kevin Grittner wrote: > optimal pool size = ((2 * actual core count) + effective spindle > count) How does that work? If your database fits in memory, your optimal TPS is only constrained by CPU. Any fetches from disk reduce your throughput from IO Waits. How do you account for SSDs/PCIe cards which act as an effective spindle multiplier? I've seen Java apps that, through the use of several systems using Java Hibernate pool sharing, are not compatible with connection poolers such as PGBouncer. As such, they had 50x CPU count and still managed 12,000TPS because everything in use was cached. Throw a disk seek or two in there, and it drops down to 2000 or less. Throw in a PCIe card, and pure streams of "disk" reads remain at 12,000TPS. It just seems a little counter-intuitive. I totally agree that it's not optimal to have connections higher than effective threads, but *adding* spindles? I'd be more inclined to believe this: optimal pool size = 3*cores - cores/spindles Then, as your spindles increase, you're subtracting less and less until you reach optimal 3x. One disk, but on a 4-cpu system? 12 - 4 = 8. So you'd have the classic 2x cores. On a RAID 1+0 with 4 disk pairs (still 4 cpu)? 12 - 1 = 11. On a giant SAN with couple dozen disks or a PCIe card that tests an order of magnitude faster than a 6-disk RAID? 12 - [small fraction] = 12 It still fits your 3x rule, but seems to actually account for the fact disks suck. :p -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Shaun Thomas <sthomas@peak6.com> wrote: > On 08/05/2011 09:00 AM, Kevin Grittner wrote: > >> optimal pool size = ((2 * actual core count) + effective spindle >> count) > > How does that work? If your database fits in memory, your optimal > TPS is only constrained by CPU. Any fetches from disk reduce your > throughput from IO Waits. I think you're misunderstanding the purpose of the formula. I'm not saying that causing processes to wait for disk speeds things up; clearly things will run faster if the active data set is cached. What I'm saying is that if processes are blocked waiting for disk they are not going to be using CPU, and there is room for that many additional processes to be useful, as the CPUs and other drives would otherwise be sitting idle. > How do you account for SSDs/PCIe cards which act as an > effective spindle multiplier? The "effective spindle count" is basically about "how many resources are reads typically waiting on with this hardware and workload". Perhaps a better name for that could be chosen, but it's the best I've come up with. > I'd be more inclined to believe this: > > optimal pool size = 3*cores - cores/spindles > > Then, as your spindles increase, you're subtracting less and less > until you reach optimal 3x. Well, to take an extreme example in another direction, let's hypothesize a machine with one CPU and 24 drives, where random disk access is the bottleneck for the workload. My formula would have 26 processes, which would typically be running with 26 blocked waiting on a read for a cache miss, while the other two processes would be serving up responses for cache hits and getting requests for the next actual disk reads ready. Your formula would have two processes struggling to get reads going on 24 spindles while also serving up cached data. Just because disk speed sucks doesn't mean you don't want to do your disk reads in parallel; quite the opposite! -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > which would typically be running with 26 blocked waiting > on a read for a cache miss, I meant 24 there. -Kevin
On 08/05/2011 09:58 AM, Kevin Grittner wrote: > What I'm saying is that if processes are blocked waiting for disk > they are not going to be using CPU, and there is room for that many > additional processes to be useful, as the CPUs and other drives > would otherwise be sitting idle. Haha. The way you say that made me think back on the scenario with 4 cpus and 1 disk. Naturally that kind of system is IO starved, and will probably sit at IO-wait at 10% or more on any kind of notable activity level. I was like... "Well, so long as everything is waiting anyway, why not just increase it to 100?" Now, typically you want to avoid context switching. Certain caveats need to be made for anything with less than two, or even four cpus because of various system and Postgres monitoring/maintenance threads. My own benchmarks illustrate (to me, anyway) that generally, performance peaks when PG threads equal CPU threads *however they're supplied*. Never minding fudge factor for idling connections waiting on IO, which you said yourself can be problematic the more of them there are. :) I'd say just put it at 2x, maybe 3x, and call it good. Realistically you won't really notice further tweaking, and a really active system would converge to cpu count through a pooler and be cached to the gills anyway. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Kevin Grittner wrote: > Claire Chang <yenhsiac@yahoo.com> wrote: > >> shared_buffers = 32GB >> > > I seem to remember seeing some benchmarks showing that performance > falls off after 10GB or 20GB on that setting. > Not even quite that high. I've never heard of a setting over 10GB being anything other than worse than a smaller setting, and that was on Solaris. At this point I never consider a value over 8GB, and even that needs to be carefully matched against how heavy the writes on the server are. You just can't set shared_buffers to a huge value in PostgreSQL yet, and "huge" means ">8GB" right now. Note that the problems you can run into with too much buffer cache are much worse with a low setting for checkpoint_segments...and this configuration doesn't change it at all from the tiny default. That should go to at least 64 on a server this size. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us