Обсуждение: Performance on new 64bit server compared to my 32bit desktop
Hi, I'm having a strange performance result on a new database server compared to my simple desktop. The configuration of the new server : - OS : GNU/Linux Debian Etch x86_64 - kernel : Linux 2.6.26-2-vserver-amd64 #1 SMP Sun Jun 20 20:40:33 UTC 2010 x86_64 GNU/Linux (tests are on the "real server", not on a vserver) - CPU : 2 x Six-Core AMD Opteron(tm) Processor 2427 @ 2.20GHz - RAM : 32 Go The configuration of my desktop pc : - OS : GNU/Linux Debian Testing i686 - kernel : Linux 2.6.32-5-686 #1 SMP Tue Jun 1 04:59:47 UTC 2010 i686 GNU/Linux - CPU : Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz - RAM : 2 Go On each configuration, i've compiled Postgresql 8.4.4 (simple ./configuration && make && make install). On each configuration, i've restore a little database (the compressed dump is 33Mo), here is the output of "\d+" : Schema | Name | Type | Owner | Size | Description --------+----------------------------+----------+-------------+------------+------------- public | article | table | indexwsprem | 77 MB | public | article_id_seq | sequence | indexwsprem | 8192 bytes | public | evt | table | indexwsprem | 8192 bytes | public | evt_article | table | indexwsprem | 17 MB | public | evt_article_id_seq | sequence | indexwsprem | 8192 bytes | public | evt_id_seq | sequence | indexwsprem | 8192 bytes | public | firm | table | indexwsprem | 1728 kB | public | firm_article | table | indexwsprem | 17 MB | public | firm_article_id_seq | sequence | indexwsprem | 8192 bytes | public | firm_id_seq | sequence | indexwsprem | 8192 bytes | public | publication | table | indexwsprem | 64 kB | public | publication_article | table | indexwsprem | 0 bytes | public | publication_article_id_seq | sequence | indexwsprem | 8192 bytes | public | publication_id_seq | sequence | indexwsprem | 8192 bytes | (14 rows) On each configuration, postgresql.conf are the same and don't have been modified (the shared_buffer seems enought for my simple tests). I've enabled timing on psql, and here is the result of different "simple" query (executed twice to use cache) : 1- select count(*) from firm; server x64 : 48661 (1 row) Time: 14,412 ms desk i686 : 48661 (1 row) Time: 4,845 ms 2- select * from pg_settings; server x64 : Time: 3,898 ms desk i686 : Time: 1,517 ms 3- I've run "time pgbench -c 50" : server x64 : starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 50 number of transactions per client: 10 number of transactions actually processed: 500/500 tps = 523.034437 (including connections establishing) tps = 663.511008 (excluding connections establishing) real 0m0.984s user 0m0.088s sys 0m0.096s desk i686 : starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 50 number of transactions per client: 10 number of transactions actually processed: 500/500 tps = 781.986778 (including connections establishing) tps = 862.809792 (excluding connections establishing) real 0m0.656s user 0m0.028s sys 0m0.052s Do you think it's a 32bit/64bit difference ?
On Thu, Aug 19, 2010 at 2:07 AM, Philippe Rimbault <primbault@edd.fr> wrote: > Hi, > > I'm having a strange performance result on a new database server compared to > my simple desktop. > > The configuration of the new server : > - OS : GNU/Linux Debian Etch x86_64 > - kernel : Linux 2.6.26-2-vserver-amd64 #1 SMP Sun Jun 20 20:40:33 UTC > 2010 x86_64 GNU/Linux > (tests are on the "real server", not on a vserver) > - CPU : 2 x Six-Core AMD Opteron(tm) Processor 2427 @ 2.20GHz > - RAM : 32 Go > The configuration of my desktop pc : > - OS : GNU/Linux Debian Testing i686 > - kernel : Linux 2.6.32-5-686 #1 SMP Tue Jun 1 04:59:47 UTC 2010 i686 > GNU/Linux > - CPU : Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz > - RAM : 2 Go PERFORMANCE STUFF DELETED FOR BREVITY > Do you think it's a 32bit/64bit difference ? No, it's likely that your desktop has much faster CPU cores than your server, and it has drives that may or may not be obeying fsync commands. Your server, OTOH, has more cores, so it's likely to do better under a real load. And assuming it has more disks on a better controller it will also do better under heavier loads. So how are the disks setup anyway?
On 19/08/2010 11:51, Scott Marlowe wrote: > On Thu, Aug 19, 2010 at 2:07 AM, Philippe Rimbault<primbault@edd.fr> wrote: > >> Hi, >> >> I'm having a strange performance result on a new database server compared to >> my simple desktop. >> >> The configuration of the new server : >> - OS : GNU/Linux Debian Etch x86_64 >> - kernel : Linux 2.6.26-2-vserver-amd64 #1 SMP Sun Jun 20 20:40:33 UTC >> 2010 x86_64 GNU/Linux >> (tests are on the "real server", not on a vserver) >> - CPU : 2 x Six-Core AMD Opteron(tm) Processor 2427 @ 2.20GHz >> - RAM : 32 Go >> The configuration of my desktop pc : >> - OS : GNU/Linux Debian Testing i686 >> - kernel : Linux 2.6.32-5-686 #1 SMP Tue Jun 1 04:59:47 UTC 2010 i686 >> GNU/Linux >> - CPU : Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz >> - RAM : 2 Go >> > PERFORMANCE STUFF DELETED FOR BREVITY > > >> Do you think it's a 32bit/64bit difference ? >> > No, it's likely that your desktop has much faster CPU cores than your > server, and it has drives that may or may not be obeying fsync > commands. Your server, OTOH, has more cores, so it's likely to do > better under a real load. And assuming it has more disks on a better > controller it will also do better under heavier loads. > > So how are the disks setup anyway? > Thanks for your reply ! The server use a HP Smart Array P410 with a Raid 5 array on Sata 133 disk. My desktop only use one Sata 133 disk. I was thinking that my simples queries didn't use disk but only memory. I've launch a new pgbench with much more client and transactions : Server : postgres$ pgbench -c 400 -t 100 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 400 number of transactions per client: 100 number of transactions actually processed: 40000/40000 tps = 115.054386 (including connections establishing) tps = 115.617186 (excluding connections establishing) real 5m47.706s user 0m27.054s sys 0m59.804s Desktop : postgres$ time pgbench -c 400 -t 100 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 400 number of transactions per client: 100 number of transactions actually processed: 40000/40000 tps = 299.456785 (including connections establishing) tps = 300.590503 (excluding connections establishing) real 2m13.604s user 0m5.304s sys 0m13.469s
On 19/08/2010 12:23, Philippe Rimbault wrote: > On 19/08/2010 11:51, Scott Marlowe wrote: >> On Thu, Aug 19, 2010 at 2:07 AM, Philippe Rimbault<primbault@edd.fr> >> wrote: >>> Hi, >>> >>> I'm having a strange performance result on a new database server >>> compared to >>> my simple desktop. >>> >>> The configuration of the new server : >>> - OS : GNU/Linux Debian Etch x86_64 >>> - kernel : Linux 2.6.26-2-vserver-amd64 #1 SMP Sun Jun 20 >>> 20:40:33 UTC >>> 2010 x86_64 GNU/Linux >>> (tests are on the "real server", not on a vserver) >>> - CPU : 2 x Six-Core AMD Opteron(tm) Processor 2427 @ 2.20GHz >>> - RAM : 32 Go >>> The configuration of my desktop pc : >>> - OS : GNU/Linux Debian Testing i686 >>> - kernel : Linux 2.6.32-5-686 #1 SMP Tue Jun 1 04:59:47 UTC 2010 >>> i686 >>> GNU/Linux >>> - CPU : Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz >>> - RAM : 2 Go >> PERFORMANCE STUFF DELETED FOR BREVITY >> >>> Do you think it's a 32bit/64bit difference ? >> No, it's likely that your desktop has much faster CPU cores than your >> server, and it has drives that may or may not be obeying fsync >> commands. Your server, OTOH, has more cores, so it's likely to do >> better under a real load. And assuming it has more disks on a better >> controller it will also do better under heavier loads. >> >> So how are the disks setup anyway? > Thanks for your reply ! > > The server use a HP Smart Array P410 with a Raid 5 array on Sata 133 > disk. > My desktop only use one Sata 133 disk. > I was thinking that my simples queries didn't use disk but only memory. > I've launch a new pgbench with much more client and transactions : > > Server : > postgres$ pgbench -c 400 -t 100 > starting vacuum...end. > transaction type: TPC-B (sort of) > scaling factor: 1 > query mode: simple > number of clients: 400 > number of transactions per client: 100 > number of transactions actually processed: 40000/40000 > tps = 115.054386 (including connections establishing) > tps = 115.617186 (excluding connections establishing) > > real 5m47.706s > user 0m27.054s > sys 0m59.804s > > Desktop : > postgres$ time pgbench -c 400 -t 100 > starting vacuum...end. > transaction type: TPC-B (sort of) > scaling factor: 1 > query mode: simple > number of clients: 400 > number of transactions per client: 100 > number of transactions actually processed: 40000/40000 > tps = 299.456785 (including connections establishing) > tps = 300.590503 (excluding connections establishing) > > real 2m13.604s > user 0m5.304s > sys 0m13.469s > > > > > I've re-init the pgbench with -s 400 and now server work (very) better than desktop. So ... my desktop cpu is faster if i only work with small query but server handle better heavier loads. I was just suprise about the difference on my small database. Thx
On Thu, Aug 19, 2010 at 4:23 AM, Philippe Rimbault <primbault@edd.fr> wrote: >> So how are the disks setup anyway? >> > > Thanks for your reply ! > > The server use a HP Smart Array P410 with a Raid 5 array on Sata 133 disk. If you can change that to RAID-10 do so now. RAID-5 is notoriously slow for database use, unless you're only gonna do reporting type queries with few updates. > My desktop only use one Sata 133 disk. > I was thinking that my simples queries didn't use disk but only memory. No, butt pgbench has to write to the disk. > I've launch a new pgbench with much more client and transactions : > > Server : > postgres$ pgbench -c 400 -t 100 -c 400 is HUGE. (and as you mentioned in your later email, you need to -s -i 400 for -c 400 to make sense) Try values in the 4 to 40 range and the server should REALLY outshine your desktop as you pass 12 or 16 or so.
Philippe Rimbault wrote: > I've run "time pgbench -c 50" : > server x64 : > starting vacuum...end. > transaction type: TPC-B (sort of) > scaling factor: 1 > query mode: simple > number of clients: 50 > number of transactions per client: 10 > number of transactions actually processed: 500/500 > tps = 523.034437 (including connections establishing) > tps = 663.511008 (excluding connections establishing) > As mentioned already, most of the difference you're seeing is simply that your desktop system has faster individual processor cores in it, so jobs where only a single core are being used are going to be faster on it. The above isn't going to work very well either because the database scale is too small, and you're not running the test for very long. The things the bigger server is better at, you're not testing. Since your smaller system has 2GB of RAM and the larger one 32GB, try this instead: pgbench -i -s 2000 pgbench -c 24 -T 60 -S pgbench -c 24 -T 300 That will create a much larger database, run some simple SELECT-only tests on it, and then run a write intensive one. Expect to see the server system crush the results of the desktop here. Note that this will take quite a while to run--the pgbench initialization step in particular is going to take a good fraction of an hour or more, and then the actual tests will run for 6 minutes after that. You can run more tests after that without doing the initialization step again, but if you run a lot of the write-heavy tests eventually performance will start to degrade. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith wrote: > Since your smaller system has 2GB of RAM and the larger one 32GB, try > this instead: > > pgbench -i -s 2000 > pgbench -c 24 -T 60 -S > pgbench -c 24 -T 300 Oh, and to at least give a somewhat more normal postgresql.conf I'd recommend you at least make the following two changes before doing the above: shared_buffers=256MB checkpoint_segments=32 Those are the two parameters the pgbench test is most sensitive to, so setting to higher values will give more realistic results. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Aug 19, 2010, at 11:25 AM, Greg Smith wrote: > Philippe Rimbault wrote: >> I've run "time pgbench -c 50" : >> server x64 : >> starting vacuum...end. >> transaction type: TPC-B (sort of) >> scaling factor: 1 >> query mode: simple >> number of clients: 50 >> number of transactions per client: 10 >> number of transactions actually processed: 500/500 >> tps = 523.034437 (including connections establishing) >> tps = 663.511008 (excluding connections establishing) >> > > As mentioned already, most of the difference you're seeing is simply > that your desktop system has faster individual processor cores in it, so > jobs where only a single core are being used are going to be faster on it. > But the select count(*) query, cached in RAM is 3x faster in one system than the other. The CPUs aren't 3x different performancewise. Something else may be wrong here. An individual Core2 Duo 2.93Ghz should be at most 50% faster than a 2.2Ghz Opteron for such a query. Unless there are somecompile options that are set wrong. I would check the compile options.
Scott Carey wrote: > But the select count(*) query, cached in RAM is 3x faster in one system than the other. The CPUs aren't 3x different performancewise. Something else may be wrong here. > > An individual Core2 Duo 2.93Ghz should be at most 50% faster than a 2.2Ghz Opteron for such a query. Unless there aresome compile options that are set wrong. I would check the compile options. > Sure, it might be. But I've seen RAM on an Intel chip like the E7500 here (DDR3-1066 or better, around 10GB/s possible) run almost 3X as fast as what you'll find paired with an Opteron 2427 (DDR2-800, closer to 3.5GB/s). Throw in the clock differences and there you go. I've been wandering around for years warning that the older Opterons on DDR2 running a single PostgreSQL process are dog slow compared to the same thing on Intel. So that alone might actually be enough to account for the difference. Ultimately the multi-processor stuff is what's more important to most apps, though, which is why I was hinting to properly run that instead. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Performance on new 64bit server compared to my 32bit desktop
От
Jose Ildefonso Camargo Tolosa
Дата:
Hi! On Fri, Aug 27, 2010 at 12:55 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Scott Carey wrote: >> >> But the select count(*) query, cached in RAM is 3x faster in one system >> than the other. The CPUs aren't 3x different performance wise. Something >> else may be wrong here. >> >> An individual Core2 Duo 2.93Ghz should be at most 50% faster than a 2.2Ghz >> Opteron for such a query. Unless there are some compile options that are >> set wrong. I would check the compile options. >> > > Sure, it might be. But I've seen RAM on an Intel chip like the E7500 here > (DDR3-1066 or better, around 10GB/s possible) run almost 3X as fast as what > you'll find paired with an Opteron 2427 (DDR2-800, closer to 3.5GB/s). > Throw in the clock differences and there you go. Precisely! CPU core clock is not all that matters, specially when it comes to work with large datasets. CPU core clock will only make a difference with relatively small (ie, that fits on cpu cache) code that works with a relatively small (ie, that *also* fits on cpu cache) dataset, for example, a series PI calculation, or a simple prime number generation algorithm, but when it comes to large amounts of data/code, the RAM starts to play a vital role, and not just "raw" RAM speed, but latency!!! (a combination of them both) some people just go for the "fastest" RAM around, but they don't pay attention to latency numbers, you need to get the fastest RAM with the slowest latency. Also, nowadays, Intel has better performance than AMD, at least when comparing Athlon 64 vs Core2, I'm still saving to get a Phenom II system in order to benchmark them and see how it goes (does anyone have one of these for testing?). > > I've been wandering around for years warning that the older Opterons on DDR2 > running a single PostgreSQL process are dog slow compared to the same thing > on Intel. So that alone might actually be enough to account for the > difference. Ultimately the multi-processor stuff is what's more important > to most apps, though, which is why I was hinting to properly run that > instead. > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Jose Ildefonso Camargo Tolosa wrote: > Also, nowadays, Intel has better performance than AMD, at least when > comparing Athlon 64 vs Core2, I'm still saving to get a Phenom II > system in order to benchmark them and see how it goes (does anyone > have one of these for testing?). > Things even out again when you reach the large server line from AMD that uses DDR-3 RAM; they've finally solved this problem there. Scott Marlowe has been helping me out with some tests of a new system he's got running the AMD Opteron 6172, using the STREAM memory benchmark. Intro to that and some sample numbers at http://www.advancedclustering.com/company-blog/stream-benchmarking.html He's been seeing >75GB/s of aggregate memory bandwidth out of that monster--using gcc, so even at a disadvantage compared to the Intel one used for that report. If you're only using one or two cores Intel still seems to have a lead, I am still working out if that's true in every situation. I haven't had a chance to test any of the Phenom II processors yet, from what I know of their design I expect them to still have the same fundamental design issues that kept all AMD processors from scaling very well, memory speed wise, the last few years. You might be able to dig a system using one of them out of the list at http://www.cs.virginia.edu/stream/peecee/Bandwidth.html , I didn't notice anything obvious that featured one. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith wrote: > He's been seeing >75GB/s of aggregate memory bandwidth out of that > monster--using gcc, so even at a disadvantage compared to the Intel > one used for that report. On second read this was confusing. The best STREAM results from using the Intel compiler on Linux. The ones I've been doing and that Scott has been running are using regular gcc instead. So when the new AMD system is clearing 75MB/s in the little test set I'm trying to get automated, that's actually a conservative figure, given that a compiler swap is almost guaranteed to boost results a bit too. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Jose Ildefonso Camargo Tolosa wrote: > Also, nowadays, Intel has better performance than AMD, at least when > comparing Athlon 64 vs Core2, I'm still saving to get a Phenom II > system in order to benchmark them and see how it goes (does anyone > have one of these for testing?). root@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 4 model name : AMD Phenom(tm) II X4 940 Processor stepping : 2 cpu MHz : 3000.000 cache size : 512 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt bogomips : 6020.46 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate stream compiled with -O3 root@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 5031 microseconds. (= 5031 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 5056.0434 0.0064 0.0063 0.0064 Scale: 4950.4916 0.0065 0.0065 0.0065 Add: 5322.0173 0.0091 0.0090 0.0091 Triad: 5395.1815 0.0089 0.0089 0.0089 ------------------------------------------------------------- Solution Validates ------------------------------------------------------------- two parallel root@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2984.2741 0.0108 0.0107 0.0108 Scale: 2945.8261 0.0109 0.0109 0.0110 Add: 3282.4631 0.0147 0.0146 0.0149 Triad: 3321.2893 0.0146 0.0145 0.0148 ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2981.4898 0.0108 0.0107 0.0108 Scale: 2943.3067 0.0109 0.0109 0.0109 Add: 3283.8552 0.0147 0.0146 0.0149 Triad: 3313.9634 0.0147 0.0145 0.0148 four parallel root@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out & ./a.out & ./a.out ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1567.4880 0.0208 0.0204 0.0210 Scale: 1525.3401 0.0211 0.0210 0.0213 Add: 1739.7735 0.0279 0.0276 0.0282 Triad: 1763.4858 0.0274 0.0272 0.0276 ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1559.0759 0.0208 0.0205 0.0210 Scale: 1536.2520 0.0211 0.0208 0.0212 Add: 1740.4503 0.0279 0.0276 0.0283 Triad: 1758.4951 0.0276 0.0273 0.0279 ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1552.7271 0.0208 0.0206 0.0210 Scale: 1527.5275 0.0211 0.0209 0.0212 Add: 1737.9263 0.0279 0.0276 0.0282 Triad: 1757.3439 0.0276 0.0273 0.0278 ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1515.5912 0.0213 0.0211 0.0214 Scale: 1544.7033 0.0210 0.0207 0.0212 Add: 1754.4495 0.0278 0.0274 0.0281 Triad: 1856.3659 0.0279 0.0259 0.0284
On Aug 27, 2010, at 10:25 AM, Greg Smith wrote: > Scott Carey wrote: >> But the select count(*) query, cached in RAM is 3x faster in one system than the other. The CPUs aren't 3x differentperformance wise. Something else may be wrong here. >> >> An individual Core2 Duo 2.93Ghz should be at most 50% faster than a 2.2Ghz Opteron for such a query. Unless there aresome compile options that are set wrong. I would check the compile options. >> > > Sure, it might be. But I've seen RAM on an Intel chip like the E7500 > here (DDR3-1066 or better, around 10GB/s possible) run almost 3X as fast > as what you'll find paired with an Opteron 2427 (DDR2-800, closer to > 3.5GB/s). Throw in the clock differences and there you go. The 2427 should do 12.8 GB/sec theoretical peak (dual channel 800Mhz DDR2) per processor socket (so 2x that if multithreadedand 2 Sockets). A Nehalem will do ~2x that (triple channel, 1066Mhz) and is also significantly faster clock for clock. But a Core2 based Xeon on Socket 775 at 1066Mhz FSB? Nah... its theoretical peak bandwidth is 33% more and real world nomore than 40% more. Latency and other factors might add up too. 3x just does not make sense here. Nehalem would be another story, but Core2 was only slightly faster than Opterons of this generation and did not scale aswell with more sockets. > > I've been wandering around for years warning that the older Opterons on > DDR2 running a single PostgreSQL process are dog slow compared to the > same thing on Intel. This isn't an older Opteron, its 6 core, 6MB L3 cache "Istanbul". Its not the newer stuff either. The E7500 is basicallythe end of line Core2 before Nehalem based processors took over. > So that alone might actually be enough to account > for the difference. Ultimately the multi-processor stuff is what's more > important to most apps, though, which is why I was hinting to properly > run that instead. > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us >
On Mon, Aug 30, 2010 at 1:58 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > four parallel > root@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out & ./a.out > & ./a.out You know you can just do "stream 4" to get 4 parallel streams right?
Scott Marlowe wrote: > On Mon, Aug 30, 2010 at 1:58 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > >> four parallel >> root@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out & ./a.out >> & ./a.out >> > > You know you can just do "stream 4" to get 4 parallel streams right? > Which version is that? The stream.c source contains no argc/argv usage, though Code/Versions/Experimental has a script called Parallel_jobs that spawns n processes. -- Yeb
Scott Carey wrote: > The 2427 should do 12.8 GB/sec theoretical peak (dual channel 800Mhz DDR2) per processor socket (so 2x that if multithreadedand 2 Sockets). > > A Nehalem will do ~2x that (triple channel, 1066Mhz) and is also significantly faster clock for clock. > > But a Core2 based Xeon on Socket 775 at 1066Mhz FSB? Nah... its theoretical peak bandwidth is 33% more and real worldno more than 40% more. > The E7500 is basically the end of line Core2 before Nehalem based processors took over. > Ah...from its use of DDR3, I thought that the E7500 was a low-end Nehalem. Now I see that you're right, that it's actually a high-end Wolfdale. So that does significantly decrease the margin between the two I'd expect. I agree with your figures, and that this may be back to looking a little fishy. The other thing I normally check is whether one of the two systems has more aggressive power management turned on. Easiest way to tell on Linux is look at /proc/cpuinfo , and see if the displayed processor speed is much lower than the actual one. Many systems default to something pretty conservative here, and don't return up to full speed nearly fast enough for some benchmark tests. > This isn't an older Opteron, its 6 core, 6MB L3 cache "Istanbul". Its not the newer stuff either. > Everything before Magny Cours is now an older Opteron from my perspective. They've caught up with Intel again with the release of those. Everything from AMD that's come out ever since Intel Nehalem products started shipping in quantity (early 2009) have been marginal products until the new M-C, and their early Quad-core stuff was pretty terrible too. So in my head I'm lumping AMD's Budapest, Shanghai, and Istanbul product lines all into a giant "slow compared to Intel during the same period" bin in my head. Fine for databases with lots of clients, not so good at executing single queries quickly. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Yeb Havinga wrote: > model name : AMD Phenom(tm) II X4 940 Processor @ 3.00GHz > cpu cores : 4 > stream compiled with -O3 > Function Rate (MB/s) Avg time Min time Max time > Triad: 5395.1815 0.0089 0.0089 0.0089 For comparison sake, an only moderately expensive desktop Intel CPU using DDR3-1600: model name : Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz cpu cores : 4 siblings : 8 Number of Threads requested = 4 Function Rate (MB/s) Avg time Min time Max time Triad: 13666.0986 0.0108 0.0107 0.0108 8 hyper-threaded cores here. They work well for improving CPU-heavy tasks, but with 4 threads total is where the memory throughput maxes out at. I'm not sure if Yeb's stream was compiled to use MPI correctly though, because I'm not seeing "Number of Threads" in his results. Here's what works for me: gcc -O3 -fopenmp stream.c -o stream And then you can set: export OMP_NUM_THREADS=4 Or whatever you want in order to control the number of threads it uses inside. Here's the way scaling works on my processor: Number of Threads requested = 1 Function Rate (MB/s) Avg time Min time Max time Triad: 9806.2648 0.0150 0.0149 0.0151 Number of Threads requested = 2 Function Rate (MB/s) Avg time Min time Max time Triad: 12495.2113 0.0117 0.0117 0.0118 Number of Threads requested = 3 Function Rate (MB/s) Avg time Min time Max time Triad: 13388.7187 0.0111 0.0109 0.0126 Number of Threads requested = 4 Function Rate (MB/s) Avg time Min time Max time Triad: 13695.6611 0.0107 0.0107 0.0108 Number of Threads requested = 5 Function Rate (MB/s) Avg time Min time Max time Triad: 12651.7200 0.0116 0.0116 0.0116 Number of Threads requested = 6 Function Rate (MB/s) Avg time Min time Max time Triad: 12804.7192 0.0115 0.0114 0.0117 Number of Threads requested = 7 Function Rate (MB/s) Avg time Min time Max time Triad: 12670.2525 0.0116 0.0116 0.0117 Number of Threads requested = 8 Function Rate (MB/s) Avg time Min time Max time Triad: 12468.5739 0.0119 0.0117 0.0131 -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Performance on new 64bit server compared to my 32bit desktop
От
Jose Ildefonso Camargo Tolosa
Дата:
Hi! Thanks you all for this great amount of information! What memory/motherboard (ie, chipset) is installed on the phenom ii one? it looks like it peaks to ~6.2GB/s with 4 threads. Also, what kernel is on it? (uname -a would be nice). Now, this looks like sustained memory speed, what about random memory access (where latency comes to play an important role): http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/ I don't have any of these systems to test, but it would be interesting to get the random access benchmarks too, what do you think? will the result be the same? Once again, thanks! Sincerely, Ildefonso Camargo
Hi, >> This isn't an older Opteron, its 6 core, 6MB L3 cache "Istanbul". Its not >> the newer stuff either. > > Everything before Magny Cours is now an older Opteron from my perspective. The 6-cores are identical to Magny Cours (except that Magny Cours has two of those beast in one package). - Clemens
Clemens Eisserer wrote:
In some ways, but not in regards to memory issues. http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/2 has a good intro. While the inside is like two 6-core models stuck together, the external memory interface was completely reworked.
Original report here involved Opteron 2427, correctly idenitified as being from the 6-core "Istanbul" architecture. All Istanbul processors use DDR2 and are quite slow at memory access compared to similar Intel Nehalem systems. The "Magny-Cours" architecture is available in 8 and 12 core variants, and the memory controller has been completely redesigned to take advantage of many banks of DDR3 at the same time; it is far faster than two of the older 6 cores working together.
http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors has a good summary of the models; it's confusing. Quick chart showing the three generations compared demonstrates what I just said above using the same STREAM benchmarking that a few results have popped out here using already:
http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/5
Istanbul Opteron 2435 in this case, 21GB/s. The two Nehelam Intel Xeons, >31GB/s. New Magny, 49MB/s.
Hi,This isn't an older Opteron, its 6 core, 6MB L3 cache "Istanbul". Its not the newer stuff either.Everything before Magny Cours is now an older Opteron from my perspective.The 6-cores are identical to Magny Cours (except that Magny Cours has two of those beast in one package).
In some ways, but not in regards to memory issues. http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/2 has a good intro. While the inside is like two 6-core models stuck together, the external memory interface was completely reworked.
Original report here involved Opteron 2427, correctly idenitified as being from the 6-core "Istanbul" architecture. All Istanbul processors use DDR2 and are quite slow at memory access compared to similar Intel Nehalem systems. The "Magny-Cours" architecture is available in 8 and 12 core variants, and the memory controller has been completely redesigned to take advantage of many banks of DDR3 at the same time; it is far faster than two of the older 6 cores working together.
http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors has a good summary of the models; it's confusing. Quick chart showing the three generations compared demonstrates what I just said above using the same STREAM benchmarking that a few results have popped out here using already:
http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/5
Istanbul Opteron 2435 in this case, 21GB/s. The two Nehelam Intel Xeons, >31GB/s. New Magny, 49MB/s.
-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Performance on new 64bit server compared to my 32bit desktop
От
Jose Ildefonso Camargo Tolosa
Дата:
Hi! Thanks for the review link! Ildefonso. On Mon, Aug 30, 2010 at 6:01 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Clemens Eisserer wrote: > > Hi, > > > > This isn't an older Opteron, its 6 core, 6MB L3 cache "Istanbul". Its not > the newer stuff either. > > > Everything before Magny Cours is now an older Opteron from my perspective. > > > The 6-cores are identical to Magny Cours (except that Magny Cours has > two of those beast in one package). > > > In some ways, but not in regards to memory issues. > http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/2 > has a good intro. While the inside is like two 6-core models stuck > together, the external memory interface was completely reworked. > > Original report here involved Opteron 2427, correctly idenitified as being > from the 6-core "Istanbul" architecture. All Istanbul processors use DDR2 > and are quite slow at memory access compared to similar Intel Nehalem > systems. The "Magny-Cours" architecture is available in 8 and 12 core > variants, and the memory controller has been completely redesigned to take > advantage of many banks of DDR3 at the same time; it is far faster than two > of the older 6 cores working together. > > http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors has a good > summary of the models; it's confusing. Quick chart showing the three > generations compared demonstrates what I just said above using the same > STREAM benchmarking that a few results have popped out here using already: > > http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/5 > > Istanbul Opteron 2435 in this case, 21GB/s. The two Nehelam Intel Xeons, >>31GB/s. New Magny, 49MB/s. > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us >
Greg Smith wrote: > Yeb Havinga wrote: >> model name : AMD Phenom(tm) II X4 940 Processor @ 3.00GHz >> cpu cores : 4 >> stream compiled with -O3 >> Function Rate (MB/s) Avg time Min time Max time >> Triad: 5395.1815 0.0089 0.0089 0.0089 > I'm not sure if Yeb's stream was compiled to use MPI correctly though, > because I'm not seeing "Number of Threads" in his results. Here's > what works for me: > > gcc -O3 -fopenmp stream.c -o stream > > And then you can set: > > export OMP_NUM_THREADS=4 Then I get the following. The rather wierd dip at 5 threads is consistent over multiple tries: Number of Threads requested = 1 Function Rate (MB/s) Avg time Min time Max time Triad: 5378.7495 0.0089 0.0089 0.0090 Number of Threads requested = 2 Function Rate (MB/s) Avg time Min time Max time Triad: 6596.1140 0.0073 0.0073 0.0073 Number of Threads requested = 3 Function Rate (MB/s) Avg time Min time Max time Triad: 7033.9806 0.0069 0.0068 0.0069 Number of Threads requested = 4 Function Rate (MB/s) Avg time Min time Max time Triad: 7007.2950 0.0069 0.0069 0.0069 Number of Threads requested = 5 Function Rate (MB/s) Avg time Min time Max time Triad: 6553.8133 0.0074 0.0073 0.0074 Number of Threads requested = 6 Function Rate (MB/s) Avg time Min time Max time Triad: 6803.6427 0.0071 0.0071 0.0071 Number of Threads requested = 7 Function Rate (MB/s) Avg time Min time Max time Triad: 6895.6909 0.0070 0.0070 0.0071 Number of Threads requested = 8 Function Rate (MB/s) Avg time Min time Max time Triad: 6931.3018 0.0069 0.0069 0.0070 Other info: DDR2 800MHz ECC memory MB: 790FX chipset (Asus m4a78-e) regards, Yeb Havinga
Yeb Havinga wrote: > The rather wierd dip at 5 threads is consistent over multiple tries I've seen that twice on 4 core systems now. The spot where there's just one more thread than cores seems to be the worst case for cache thrashing on a lot of these servers. How much total RAM is in this server? Are all the slots filled? Just filling in a spreadsheet I have here with sample configs of various hardware. Yeb's results look right to me now. That's what an AMD Phenom II X4 940 @ 3.00GHz should look like. It's a little faster, memory-wise, than my older Intel Q6600 @ 2.4GHz. So they've finally caught up with that generation of Intel's stuff. But my current desktop quad-core i860 with hyperthreading is nearly twice as fast in terms of memory access at every thread size. That's why I own one of them instead of a Phenom II X4. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Performance on new 64bit server compared to my 32bit desktop
От
Jose Ildefonso Camargo Tolosa
Дата:
Hi! On Tue, Aug 31, 2010 at 8:11 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Greg Smith wrote: >> >> Yeb Havinga wrote: >>> >>> model name : AMD Phenom(tm) II X4 940 Processor @ 3.00GHz >>> cpu cores : 4 >>> stream compiled with -O3 >>> Function Rate (MB/s) Avg time Min time Max time >>> Triad: 5395.1815 0.0089 0.0089 0.0089 >> >> I'm not sure if Yeb's stream was compiled to use MPI correctly though, >> because I'm not seeing "Number of Threads" in his results. Here's what >> works for me: >> >> gcc -O3 -fopenmp stream.c -o stream >> >> And then you can set: >> >> export OMP_NUM_THREADS=4 > > Then I get the following. The rather wierd dip at 5 threads is consistent > over multiple tries: > > Number of Threads requested = 1 > Function Rate (MB/s) Avg time Min time Max time > Triad: 5378.7495 0.0089 0.0089 0.0090 > > Number of Threads requested = 2 > Function Rate (MB/s) Avg time Min time Max time > Triad: 6596.1140 0.0073 0.0073 0.0073 > > Number of Threads requested = 3 > Function Rate (MB/s) Avg time Min time Max time > Triad: 7033.9806 0.0069 0.0068 0.0069 > > Number of Threads requested = 4 > Function Rate (MB/s) Avg time Min time Max time > Triad: 7007.2950 0.0069 0.0069 0.0069 > > Number of Threads requested = 5 > Function Rate (MB/s) Avg time Min time Max time > Triad: 6553.8133 0.0074 0.0073 0.0074 > > Number of Threads requested = 6 > Function Rate (MB/s) Avg time Min time Max time > Triad: 6803.6427 0.0071 0.0071 0.0071 > > Number of Threads requested = 7 > Function Rate (MB/s) Avg time Min time Max time > Triad: 6895.6909 0.0070 0.0070 0.0071 > > Number of Threads requested = 8 > Function Rate (MB/s) Avg time Min time Max time > Triad: 6931.3018 0.0069 0.0069 0.0070 > > Other info: DDR2 800MHz ECC memory Ok, this could explain the huge difference. I was planing on getting GigaByte GA-890GPA-UD3H, with a Phenom II X6 and that ram: Crucial CT2KIT25664BA1339, Crucial BL2KIT25664FN1608, or something better I find when I get enough money (depending on my budget at the moment). > MB: 790FX chipset (Asus m4a78-e) > > regards, > Yeb Havinga > > Thanks for the extra info! Ildefonso.
Jose Ildefonso Camargo Tolosa wrote: > Ok, this could explain the huge difference. I was planing on getting > GigaByte GA-890GPA-UD3H, with a Phenom II X6 and that ram: Crucial > CT2KIT25664BA1339, Crucial BL2KIT25664FN1608, or something better I > find when I get enough money (depending on my budget at the moment). > Why not pair a 8-core magny cours ($280,- at newegg http://www.newegg.com/Product/Product.aspx?Item=N82E16819105266) with a supermicro ATX board http://www.supermicro.com/Aplus/motherboard/Opteron6100/SR56x0/H8SGL-F.cfm ($264 at newegg http://www.newegg.com/Product/Product.aspx?Item=N82E16813182230&Tpk=H8SGL-F) and some memory? regards, Yeb Havinga
Re: Performance on new 64bit server compared to my 32bit desktop
От
Jose Ildefonso Camargo Tolosa
Дата:
Hi! On Tue, Aug 31, 2010 at 11:13 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Yeb Havinga wrote: >> >> The rather wierd dip at 5 threads is consistent over multiple tries > > I've seen that twice on 4 core systems now. The spot where there's just one > more thread than cores seems to be the worst case for cache thrashing on a > lot of these servers. > > How much total RAM is in this server? Are all the slots filled? Just > filling in a spreadsheet I have here with sample configs of various > hardware. > > Yeb's results look right to me now. That's what an AMD Phenom II X4 940 @ > 3.00GHz should look like. It's a little faster, memory-wise, than my older > Intel Q6600 @ 2.4GHz. So they've finally caught up with that generation of > Intel's stuff. But my current desktop quad-core i860 with hyperthreading is > nearly twice as fast in terms of memory access at every thread size. That's > why I own one of them instead of a Phenom II X4. your i860? http://en.wikipedia.org/wiki/Intel_i860 wow!. :D Now, seriously: what memory (brand/model) does the Q6600 and your newer desktop have? I'm just too curious, last time I was able to run benchmarks myself was with a core2duo and a athlon 64 x2, back then: core2due beated athlon at almost anything. Nowadays, it looks like amd is playing the "more cores for the money" game, but I think that sooner or later they will catchup again, and when that happen: Intel will just get another ET chip, and put on marked,and so on! :D This is a game where the winners are: us!
Note that in that graph, the odd dips are happening every 8 cores on a system with 4 12 core processors. I don't know why, I would expect it to be every 6 or something.
And, I have zone reclaim set to off because it makes the linux kernel on large cpu machines make pathologically unsound decisions during large file transfers.
Jose Ildefonso Camargo Tolosa wrote: > your i860? http://en.wikipedia.org/wiki/Intel_i860 wow!. :D > That's supposed to be i7-860: http://en.wikipedia.org/wiki/List_of_Intel_Core_i7_microprocessors It was a whole $199, so not an expensive processor. > Now, seriously: what memory (brand/model) does the Q6600 and your > newer desktop have? > Q6600 is running Corsair DDR2-800 (5-5-5-18): http://www.newegg.com/Product/Product.aspx?Item=N82E16820145176 i7-860 has Corsair DDR3-1600 C8 (8-8-8-24): http://www.newegg.com/Product/Product.aspx?Item=N82E16820145265 Both systems have 4 2GB modules in them for 8GB total. I've been both happy with the performance of the Corsair stuff, and with how their head spreader design keeps my grubby fingers off the sensitive parts of the chips. This is all desktop memory though; the registered and ECC stuff for servers tends to be a bit slower, but for good reasons. > I'm just too curious, last time I was able to run benchmarks myself > was with a core2duo and a athlon 64 x2, back then: core2due beated > athlon at almost anything. > Yes. The point I've made a couple of times here already is that Intel pulled ahead around the Core 2 time, and AMD has been anywhere from a little to way behind ever since. And in the last 18 months that's mainly been related to the memory controller design, not the CPUs themselves. Until these new Magny Cours designs, where AMD finally caught back up, particularly on big servers with lots of banks of RAM. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Scott Marlowe wrote: > On Tue, Aug 31, 2010 at 6:41 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > >>> export OMP_NUM_THREADS=4 >>> >> Then I get the following. The rather wierd dip at 5 threads is consistent >> over multiple tries: >> >> > > I get similar dips on my server. Especially as you make the stream > test write a large enough chunk of data to outrun its caches. > > See attached png. > Interesting graph, especially since the overall feeling is a linear like increase in memory bandwidth when more cores are active. Just curious, what is the 8-core cpu? -- Yeb
On Tue, Aug 31, 2010 at 12:55 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Scott Marlowe wrote: >> >> On Tue, Aug 31, 2010 at 6:41 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: >> >>>> >>>> export OMP_NUM_THREADS=4 >>>> >>> >>> Then I get the following. The rather wierd dip at 5 threads is consistent >>> over multiple tries: >>> >>> >> >> I get similar dips on my server. Especially as you make the stream >> test write a large enough chunk of data to outrun its caches. >> >> See attached png. >> > > Interesting graph, especially since the overall feeling is a linear like > increase in memory bandwidth when more cores are active. > > Just curious, what is the 8-core cpu? 8 core = dual 2352 cpus (2x4) 2.1 GHz 12 core = dual 2427 cpus (2x6) 2.2 GHz 48 core = quad 6127 cpus (4x12) 2.1 GHz -- To understand recursion, one must first understand recursion.