Обсуждение: sniff test on some PG 8.4 numbers
I was hoping to just get a "gut reaction" on some pgbench numbers I have, to see if I'm in the ballpark. OS: ScientificLinux 6.3, x86_64 Hardware: 4x real disks (not SSD) behind an LSI 9260 in raid10, Xeon E5-2680 with hyperthreading OFF, 128GB of RAM. Setup: postgresql 8.4.13, ext4, barriers ON, disk write cache *off*, write- back enabled on the LSI. I initialized with sizes of 100, 200, and 400. I've done some tuning of the postgresql config, but mostly I'm just trying to find out if I'm in the right ballpark. I ran pgbench from another (similar) host: pgbench -h BLAH -c 32 -M prepared -t 100000 -S I get 95,000 to 100,000 tps. pgbench -h BLAH -c 32 -M prepared -t 100000 seems to hover around 6,200 tps (size 100) to 13,700 (size 400) Do these basically sniff right? (NOTE: with barriers off, I get a slight increase - 10% - in the read-write test, and a larger *decrease* - 15% - with the read-only test @ 400. No change @ 100) -- Jon
> Do these basically sniff right? Well, the read test seems reasonable. I'm impressed by the speed of the write test ... how large is the raid card cache? And why 8.4? Can you try 9.2? > (NOTE: with barriers off, I get a slight increase - 10% - in the > read-write test, and a larger *decrease* - 15% - with the read-only > test @ 400. No change @ 100) Oh, interesting. Can you reproduce that? I wonder what would cause read-only to drop without barriers ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Mar 5, 2013 at 7:02 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> Do these basically sniff right? > > Well, the read test seems reasonable. I'm impressed by the speed of the > write test ... how large is the raid card cache? > > And why 8.4? Can you try 9.2? 8.4 because it's what I've got, basically. I might be able to try 9.2 later, but I'm targeting 8.4 right now. 512MB of memory on the card. >> (NOTE: with barriers off, I get a slight increase - 10% - in the >> read-write test, and a larger *decrease* - 15% - with the read-only >> test @ 400. No change @ 100) > > Oh, interesting. Can you reproduce that? I wonder what would cause > read-only to drop without barriers ... I'll try to test again soon. I know that if I use writethrough instead of writeback mode the performance nosedives. Does anybody have suggestions for stripe size? (remember: *4* disks) -- Jon
On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > > pgbench -h BLAH -c 32 -M prepared -t 100000 -S > I get 95,000 to 100,000 tps. > > pgbench -h BLAH -c 32 -M prepared -t 100000 > seems to hover around 6,200 tps (size 100) to 13,700 (size 400) Some followup: The read test goes (up to) 133K tps, and the read-write test to 22k tps when performed over localhost. -- Jon
On 3/5/13 10:00 PM, Jon Nelson wrote: > On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> >> pgbench -h BLAH -c 32 -M prepared -t 100000 -S >> I get 95,000 to 100,000 tps. >> >> pgbench -h BLAH -c 32 -M prepared -t 100000 >> seems to hover around 6,200 tps (size 100) to 13,700 (size 400) > > Some followup: > The read test goes (up to) 133K tps, and the read-write test to 22k > tps when performed over localhost. All your write numbers are inflated because the test is too short. This hardware will be lucky to sustain 7500 TPS on writes. But you're only writing 100,000 transactions, which means the entire test run isn't even hitting the database--only the WAL writes are. When your test run is finished, look at /proc/meminfo I'd wager a large sum you'll find "Dirty:" has hundreds of megabytes, if not gigabytes, of unwritten information. Basically, 100,000 writes on this sort of server can all be cached in Linux's write cache, and pgbench won't force them out of there. So you're not simulating sustained database writes, only how fast of a burst the server can handle for a little bit. For a write test, you must run for long enough to start and complete a checkpoint before the numbers are of any use, and 2 checkpoints are even better. The minimum useful length is a 10 minute run, so "-T 600" instead of using -t. If you want something that does every trick possible to make it hard to cheat at this, as well as letting you graph size and client data, try my pgbench-tools: https://github.com/gregs1104/pgbench-tools (Note that there is a bug in that program right now, it spawns vmstat and iostat processes but they don't get killed at the end correctly. "killall vmstat iostat" after running is a good idea until I fix that). Your read test numbers are similarly inflated, but read test errors aren't as large. Around 133K TPS on select-only is probably accurate. For a read test, use "-T 30" to let it run for 30 seconds to get a more accurate number. The read read bottleneck on your hardware is going to be the pgbench client itself, which on 8.4 is running as a single thread. On 9.0+ you can have multiple pgbench workers. It normally takes 4 to 8 of them to saturate a larger server. I hope you're not considering deploying a new application with 8.4. Take a look at http://www.postgresql.org/support/versioning/ and you'll see 8.4 only has a little over a year before it won't get bug fixes anymore. Also, your server would really appreciate the performance gains added to 9.2. If that's a bit too leading edge for you, I don't recommend deploying at version below 9.1 anymore. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Mar 10, 2013 at 10:46 AM, Greg Smith <greg@2ndquadrant.com> wrote: > On 3/5/13 10:00 PM, Jon Nelson wrote: >> >> On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson <jnelson+pgsql@jamponi.net> >> wrote: >>> >>> >>> pgbench -h BLAH -c 32 -M prepared -t 100000 -S >>> I get 95,000 to 100,000 tps. >>> >>> pgbench -h BLAH -c 32 -M prepared -t 100000 >>> seems to hover around 6,200 tps (size 100) to 13,700 (size 400) >> >> >> Some followup: >> The read test goes (up to) 133K tps, and the read-write test to 22k >> tps when performed over localhost. > > > All your write numbers are inflated because the test is too short. This > hardware will be lucky to sustain 7500 TPS on writes. But you're only > writing 100,000 transactions, which means the entire test run isn't even > hitting the database--only the WAL writes are. When your test run is > finished, look at /proc/meminfo I'd wager a large sum you'll find "Dirty:" > has hundreds of megabytes, if not gigabytes, of unwritten information. > Basically, 100,000 writes on this sort of server can all be cached in > Linux's write cache, and pgbench won't force them out of there. So you're > not simulating sustained database writes, only how fast of a burst the > server can handle for a little bit. > > For a write test, you must run for long enough to start and complete a > checkpoint before the numbers are of any use, and 2 checkpoints are even > better. The minimum useful length is a 10 minute run, so "-T 600" instead > of using -t. If you want something that does every trick possible to make > it hard to cheat at this, as well as letting you graph size and client data, > try my pgbench-tools: https://github.com/gregs1104/pgbench-tools (Note that > there is a bug in that program right now, it spawns vmstat and iostat > processes but they don't get killed at the end correctly. "killall vmstat > iostat" after running is a good idea until I fix that). I (briefly!) acquired an identical machine as last but this time with an Areca instead of an LSI (4 drives). The following is with ext4, nobarrier, and noatime. As noted in the original post, I have done a fair bit of system tuning. I have the dirty_bytes and dirty_background_bytes set to 3GB and 2GB, respectively. I built 9.2 and using 9.2 and the following pgbench invocation: pgbench -j 8 -c 32 -M prepared -T 600 transaction type: TPC-B (sort of) scaling factor: 400 query mode: prepared number of clients: 32 number of threads: 8 duration: 600 s number of transactions actually processed: 16306693 tps = 27176.566608 (including connections establishing) tps = 27178.518841 (excluding connections establishing) > Your read test numbers are similarly inflated, but read test errors aren't > as large. Around 133K TPS on select-only is probably accurate. For a read > test, use "-T 30" to let it run for 30 seconds to get a more accurate > number. The read read bottleneck on your hardware is going to be the > pgbench client itself, which on 8.4 is running as a single thread. On 9.0+ > you can have multiple pgbench workers. It normally takes 4 to 8 of them to > saturate a larger server. The 'select-only' test (same as above with '-S'): starting vacuum...end. transaction type: SELECT only scaling factor: 400 query mode: prepared number of clients: 32 number of threads: 8 duration: 600 s number of transactions actually processed: 127513307 tps = 212514.337971 (including connections establishing) tps = 212544.392278 (excluding connections establishing) These are the *only* changes I've made to the config file: shared_buffers = 32GB wal_buffers = 16MB checkpoint_segments = 1024 I can run either or both of these again with different options, but mostly I'm looking for a sniff test. However, I'm a bit confused, now. It seems as though you say the write numbers are not believable, suggesting a value of 7,500 (roughly 1/4 what I'm getting). If I run the read test for 30 seconds I get - highly variable - between 300K and 400K tps. Why are these tps so high compared to your expectations? Note: I did get better results with HT on vs. with HT off, so I've left HT on for now. -- Jon
On Sun, Mar 10, 2013 at 7:18 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > It seems as though you say the write numbers are not believable, > suggesting a value of 7,500 (roughly 1/4 what I'm getting). If I run > the read test for 30 seconds I get - highly variable - between 300K > and 400K tps. Why are these tps so high compared to your expectations? > Note: I did get better results with HT on vs. with HT off, so I've > left HT on for now. go back and re-read greg's post. He explains why he thinks you'll sustain less. Basically it's caching effects because no pg_xlog / wal log writing happening. Once you get a feel for how fast it is, run the test for 30 minutes to several hours to see how it goes. Then when you have a weekend just leave it running a couple days. Still pretty good numbers so far.
On 3/10/13 9:18 PM, Jon Nelson wrote: > The following is with ext4, nobarrier, and noatime. As noted in the > original post, I have done a fair bit of system tuning. I have the > dirty_bytes and dirty_background_bytes set to 3GB and 2GB, > respectively. That's good, but be aware those values are still essentially unlimited write caches. A server with 4 good but regular hard drives might do as little as 10MB/s of random writes on a real workload. If 2GB of data ends up dirty, the flushing that happens at the end of a database checkpoint will need to clear all of that out of RAM. When that happens, you're looking at a 3 minute long cache flush to push out 2GB. It's not unusual for pgbench tests to pause for over a minute straight when that happens. With your setup, where checkpoints happen every 5 minutes, this is only happening once per test run. The disruption isn't easily visible if you look at the average rate; it's outweighed by the periods where writes happen very fast because the cache isn't full yet. You have to get pgbench to plot latency over time to see them and then analyze that data. This problem is the main reason I put together the pgbench-tools set for running things, because once you get to processing the latency files and make graphs from them it starts to be a pain to look at the results. > I built 9.2 and using 9.2 and the following pgbench invocation: > > pgbench -j 8 -c 32 -M prepared -T 600 > > transaction type: TPC-B (sort of) > scaling factor: 400 I misread this completely in your message before; I thought you wrote 4000. A scaling factor of 400 is making a database that's 6GB in size. Your test is basically seeing how fast the system memory and the RAID cache can move things around. In that situation, your read and write numbers are reasonable. They aren't actually telling you anything useful about the disks though, because they're barely involved here. You've sniffed the CPU, memory, and RAID controller and they smell fine. You'll need at least an order of magnitude increase in scale to get a whiff of the disks. pgbench scale numbers give approximately 16MB per scale factor. You don't actually stress the drives until that total number is at least 2X as big as RAM. We had to raise the limit on the pgbench scales recently because it only goes up to ~20,000 on earlier versions, and that's not a big enough scale to test many servers now. On the select-only tests, much of the increase from ~100K to ~200K is probably going from 8.4 to 9.2. There's two major and several minor tuning changes that make it much more efficient at that specific task. > These are the *only* changes I've made to the config file: > > shared_buffers = 32GB > wal_buffers = 16MB > checkpoint_segments = 1024 Note that these are the only changes that actually impact pgbench results. The test doesn't stress very many parts of the system, such as the query optimizer. Also be aware these values may not be practical to use in production. You can expect bad latency issues due to having shared_buffers so large. All that memory has to be reconciled and written to disk if it's been modified at each checkpoint, and 32GB of such work is a lot. I have systems where we can't make shared_buffers any bigger than 4GB before checkpoint pauses get too bad. Similarly, setting checkpoint_segments to 1024 means that you might go through 16GB of writes before a checkpoint happens. That's great for average performance...but when that checkpoint does hit, you're facing a large random I/O backlog. There's not much you can do about all this on the Linux side. If you drop the dirty_* parameters too much, maintenance operations like VACUUM start to get slow. Really all you can do is avoid setting shared_buffers and checkpoint_segments too high, so the checkpoint backlog never gets gigantic. The tuning you've done is using higher values than we normally recommend because it's not quite practical to deploy like that. That and the very small database are probably why your numbers are so high. > Note: I did get better results with HT on vs. with HT off, so I've > left HT on for now. pgbench select-only in particular does like hyper-threading. We get occasional reports of more memory-bound workloads actually slowing when it's turned on. I think it's a wash and leave it on. Purchasing and management people tend to get annoyed if they discover the core count of the server is half what they thought they were buying. The potential downside of HT isn't so big that its worth opening that can of worms, unless you've run real application level tests to prove it hurts. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Mar 10, 2013 at 11:28 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 3/10/13 9:18 PM, Jon Nelson wrote: > >> The following is with ext4, nobarrier, and noatime. As noted in the >> original post, I have done a fair bit of system tuning. I have the >> dirty_bytes and dirty_background_bytes set to 3GB and 2GB, >> respectively. > > > That's good, but be aware those values are still essentially unlimited write > caches. A server with 4 good but regular hard drives might do as little as > 10MB/s of random writes on a real workload. If 2GB of data ends up dirty, > the flushing that happens at the end of a database checkpoint will need to > clear all of that out of RAM. When that happens, you're looking at a 3 > minute long cache flush to push out 2GB. It's not unusual for pgbench tests > to pause for over a minute straight when that happens. With your setup, > where checkpoints happen every 5 minutes, this is only happening once per > test run. The disruption isn't easily visible if you look at the average > rate; it's outweighed by the periods where writes happen very fast because > the cache isn't full yet. You have to get pgbench to plot latency over time > to see them and then analyze that data. This problem is the main reason I > put together the pgbench-tools set for running things, because once you get > to processing the latency files and make graphs from them it starts to be a > pain to look at the results. I'll try to find time for this, but it may need to wait until the weekend again. >> I built 9.2 and using 9.2 and the following pgbench invocation: >> >> pgbench -j 8 -c 32 -M prepared -T 600 >> >> transaction type: TPC-B (sort of) >> scaling factor: 400 > > > I misread this completely in your message before; I thought you wrote 4000. > A scaling factor of 400 is making a database that's 6GB in size. Your test > is basically seeing how fast the system memory and the RAID cache can move > things around. In that situation, your read and write numbers are > reasonable. They aren't actually telling you anything useful about the > disks though, because they're barely involved here. You've sniffed the CPU, > memory, and RAID controller and they smell fine. You'll need at least an > order of magnitude increase in scale to get a whiff of the disks. LOL! Your phrasing is humourous and the information useful. I ran for 8.0 hours and go this: transaction type: TPC-B (sort of) scaling factor: 400 query mode: prepared number of clients: 32 number of threads: 8 duration: 28800 s number of transactions actually processed: 609250619 tps = 21154.058025 (including connections establishing) tps = 21154.075922 (excluding connections establishing) > pgbench scale numbers give approximately 16MB per scale factor. You don't > actually stress the drives until that total number is at least 2X as big as > RAM. We had to raise the limit on the pgbench scales recently because it > only goes up to ~20,000 on earlier versions, and that's not a big enough > scale to test many servers now. > > On the select-only tests, much of the increase from ~100K to ~200K is > probably going from 8.4 to 9.2. There's two major and several minor tuning > changes that make it much more efficient at that specific task. > > >> These are the *only* changes I've made to the config file: >> >> shared_buffers = 32GB >> wal_buffers = 16MB >> checkpoint_segments = 1024 > > > Note that these are the only changes that actually impact pgbench results. > The test doesn't stress very many parts of the system, such as the query > optimizer. > > Also be aware these values may not be practical to use in production. You > can expect bad latency issues due to having shared_buffers so large. All > that memory has to be reconciled and written to disk if it's been modified > at each checkpoint, and 32GB of such work is a lot. I have systems where we > can't make shared_buffers any bigger than 4GB before checkpoint pauses get > too bad. > > Similarly, setting checkpoint_segments to 1024 means that you might go > through 16GB of writes before a checkpoint happens. That's great for > average performance...but when that checkpoint does hit, you're facing a > large random I/O backlog. I thought the bgwriter mitigated most of the problems here? Often I'll see the actual checkpoints with 'sync' times typically below a few seconds (when there is anything to do at all). I can't say I've seen checkpoint pauses in my workloads. > There's not much you can do about all this on the Linux side. If you drop > the dirty_* parameters too much, maintenance operations like VACUUM start to > get slow. Really all you can do is avoid setting shared_buffers and > checkpoint_segments too high, so the checkpoint backlog never gets gigantic. > The tuning you've done is using higher values than we normally recommend > because it's not quite practical to deploy like that. That and the very > small database are probably why your numbers are so high. Mostly I do data warehouse type of workloads with very little (if any) data modification after initial load time. Extensive benchmarking of the actual applications involved has shown that - for me - a large (but not too large) shared_buffers (32GB is right about the sweet spot for me, perhaps a bit on the high side) works well. Additionally, the large checkpoint_segments value really appears to help as well (again, this is very workload dependent). >> Note: I did get better results with HT on vs. with HT off, so I've >> left HT on for now. > > > pgbench select-only in particular does like hyper-threading. We get > occasional reports of more memory-bound workloads actually slowing when it's > turned on. I think it's a wash and leave it on. Purchasing and management > people tend to get annoyed if they discover the core count of the server is > half what they thought they were buying. The potential downside of HT isn't > so big that its worth opening that can of worms, unless you've run real > application level tests to prove it hurts. Glad to get an "it's a wash" confirmation here. -- Jon