Обсуждение: Do I have a hardware or a software problem?
#### Pitch ######################################################################################I previously posted this question http://archives.postgresql.org/pgsql-performance/2012-11/msg00289.php about a performance issue with an update query.The question evolved into a more general discussion about my setup, and about a lot of I/O wait that I was encountering. Since then, I have gotten a whole lot more familiar with measuring things, and now I "just" need some experienced eyes to judge which direction I should go in - do I have a hardware issue, or a software issue - and what action should I take?##### My setup #############################################################################The use case:At night time we are doing a LOT of data maintenance, and hence the load on the database is very different from the day time. However we would like to be able to do some of it in the daytime, it's simply just too "heavy" on the database as is right now. The stats shown below is from one of those "heavy" load times.Hardware:- 32Gb ram- 8 core Xeon E3-1245 processor- Two SEAGATE ST33000650NS drives (called sdc and sdd in the stats) in a softeware RAID1 array (called md2 in the stats)- Two INTEL SSDSC2CW240A3 SSD drives (called sda and sdb in the stats) in a software RAID1 (called md3 in the stats)Software:Postgres 9.2 running on 64bit ubuntu 12.04 with kernel 3.2Configuration:# postgresql.conf (a shortlist of everything changed from the default)data_directory = '/var/lib/postgresql/9.2/main'hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'external_pid_file = '/var/run/postgresql/9.2-main.pid'listen_addresses = '192.168.0.2, localhost'port = 5432max_connections = 300unix_socket_directory = '/var/run/postgresql'wal_level = hot_standbysynchronous_commit = offarchive_mode = onarchive_command = 'rsync -a %p postgres@192.168.0.4:/var/lib/postgresql/9.2/wals/%f </dev/null'max_wal_senders = 1wal_keep_segments = 32log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d 'datestyle = 'iso, mdy'lc_monetary = 'en_US.UTF-8'lc_numeric = 'en_US.UTF-8'lc_time = 'en_US.UTF-8'default_text_search_config = 'pg_catalog.english'default_statistics_target = 100maintenance_work_mem = 1GBcheckpoint_completion_target = 0.9effective_cache_size = 22GBwork_mem = 160MBwal_buffers = 4MBcheckpoint_segments = 100shared_buffers = 4GBcheckpoint_timeout = 10minThe kernel has bee tweaked like so:vm.dirty_ratio = 10vm.dirty_background_ratio = 1kernel.shmmax = 8589934592kernel.shmall = 17179869184The pg_xlog folder has been moved onto the SSD array (md3), and symlinked back into the postgres dir.
##### The stats ###############################################################These are the typical observations/stats I see in one of these periods:1)At top level this is what I see in new relic:2)When the database is loaded like this, I see a lot of queries talking up to 1000 times as long, as they would when the database is not loaded so heavily.3)sudo iostat -dmx (typical usage)Linux 3.2.0-33-generic (master-db) 12/10/2012 _x86_64_ (8 CPU)Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilsda 0.00 6.52 3.59 26.61 0.22 0.74 65.49 0.01 0.40 0.77 0.35 0.14 0.43sdb 0.00 8.31 0.03 28.38 0.00 0.97 69.63 0.01 0.52 0.27 0.52 0.15 0.43sdc 1.71 46.01 34.83 116.62 0.56 4.06 62.47 1.90 12.57 21.81 9.81 1.89 28.66sdd 1.67 46.14 34.89 116.49 0.56 4.06 62.46 1.58 10.43 21.66 7.07 1.89 28.60md1 0.00 0.00 0.00 0.00 0.00 0.00 2.69 0.00 0.00 0.00 0.00 0.00 0.00md0 0.00 0.00 0.11 0.24 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00md2 0.00 0.00 72.99 161.95 1.11 4.06 45.10 0.00 0.00 0.00 0.00 0.00 0.00md3 0.00 0.00 0.05 32.32 0.00 0.74 47.00 0.00 0.00 0.00 0.00 0.00 0.003)sudo iotop -oa (running for about a minute or so)TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND292 be/4 root 0.00 B 0.00 B 0.00 % 99.33 % [md2_raid1]2815 be/4 postgres 19.51 M 25.90 M 0.00 % 45.49 % postgres: autovacuum worker process production32553 be/4 postgres 45.74 M 9.38 M 0.00 % 37.89 % postgres: user production 192.168.0.3(58866) UPDATE32570 be/4 postgres 6.91 M 35.02 M 0.00 % 16.71 % postgres: user production 192.168.0.3(35547) idle32575 be/4 postgres 4.06 M 43.90 M 0.00 % 16.62 % postgres: user production 192.168.0.3(35561) SELECT31673 be/4 postgres 4.14 M 52.16 M 0.00 % 16.24 % postgres: user production 192.168.0.3(39112) idle32566 be/4 postgres 4.73 M 44.95 M 0.00 % 15.66 % postgres: user production 192.168.0.3(35531) idle32568 be/4 postgres 4.50 M 33.84 M 0.00 % 14.62 % postgres: user production 192.168.0.3(35543) SELECT32573 be/4 postgres 3.20 M 34.44 M 0.00 % 13.98 % postgres: user production 192.168.0.3(35559) idle31590 be/4 postgres 3.23 M 29.72 M 0.00 % 13.90 % postgres: user production 192.168.0.3(50690) idle in transaction32577 be/4 postgres 5.09 M 25.54 M 0.00 % 13.63 % postgres: user production 192.168.0.3(35563) idle32565 be/4 postgres 2.06 M 35.93 M 0.00 % 13.41 % postgres: user production 192.168.0.3(35529) SELECT32546 be/4 postgres 4.48 M 36.49 M 0.00 % 13.39 % postgres: user production 192.168.0.3(56927) UPDATE waiting32569 be/4 postgres 3.50 M 26.75 M 0.00 % 12.82 % postgres: user production 192.168.0.3(35545) INSERT31671 be/4 postgres 4.58 M 24.45 M 0.00 % 12.76 % postgres: user production 192.168.0.3(34841) idle in transaction32551 be/4 postgres 3.26 M 31.77 M 0.00 % 12.06 % postgres: user production 192.168.0.3(58864) idle in transaction32574 be/4 postgres 5.32 M 32.92 M 0.00 % 11.70 % postgres: user production 192.168.0.3(35560) idle32572 be/4 postgres 3.00 M 32.66 M 0.00 % 11.66 % postgres: user production 192.168.0.3(35558) UPDATE32560 be/4 postgres 5.12 M 25.89 M 0.00 % 11.52 % postgres: user production 192.168.0.3(33886) SELECT32567 be/4 postgres 4.66 M 36.47 M 0.00 % 11.44 % postgres: user production 192.168.0.3(35534) SELECT32571 be/4 postgres 2.86 M 31.27 M 0.00 % 11.31 % postgres: user production 192.168.0.3(35557) SELECT32552 be/4 postgres 4.38 M 28.75 M 0.00 % 11.09 % postgres: user production 192.168.0.3(58865) idle in transaction32554 be/4 postgres 3.69 M 30.21 M 0.00 % 10.90 % postgres: user production 192.168.0.3(58870) UPDATE339 be/3 root 0.00 B 2.29 M 0.00 % 9.81 % [jbd2/md2-8]32576 be/4 postgres 3.37 M 19.91 M 0.00 % 9.73 % postgres: user production 192.168.0.3(35562) idle32555 be/4 postgres 3.09 M 31.96 M 0.00 % 9.02 % postgres: user production 192.168.0.3(58875) SELECT27548 be/4 postgres 0.00 B 97.12 M 0.00 % 7.41 % postgres: writer process31445 be/4 postgres 924.00 K 27.35 M 0.00 % 7.11 % postgres: user production 192.168.0.1(34536) idle31443 be/4 postgres 2.54 M 4.56 M 0.00 % 6.32 % postgres: user production 192.168.0.1(34508) idle31459 be/4 postgres 1480.00 K 21.36 M 0.00 % 5.63 % postgres: user production 192.168.0.1(34543) idle1801 be/4 postgres 1896.00 K 10.89 M 0.00 % 5.57 % postgres: user production 192.168.0.3(34177) idle32763 be/4 postgres 1696.00 K 6.95 M 0.00 % 5.33 % postgres: user production 192.168.0.3(57984) SELECT1800 be/4 postgres 2.46 M 5.13 M 0.00 % 5.24 % postgres: user production 192.168.0.3(34175) SELECT1803 be/4 postgres 1816.00 K 9.09 M 0.00 % 5.16 % postgres: user production 192.168.0.3(34206) idle32578 be/4 postgres 2.57 M 11.62 M 0.00 % 5.06 % postgres: user production 192.168.0.3(35564) SELECT31440 be/4 postgres 3.02 M 4.04 M 0.00 % 4.65 % postgres: user production 192.168.0.1(34463) idle32605 be/4 postgres 1844.00 K 11.82 M 0.00 % 4.49 % postgres: user production 192.168.0.3(40399) idle27547 be/4 postgres 0.00 B 0.00 B 0.00 % 3.93 % postgres: checkpointer process31356 be/4 postgres 1368.00 K 3.27 M 0.00 % 3.93 % postgres: user production 192.168.0.1(34450) idle32542 be/4 postgres 1180.00 K 6.05 M 0.00 % 3.90 % postgres: user production 192.168.0.3(56859) idle32523 be/4 postgres 1088.00 K 4.33 M 0.00 % 3.59 % postgres: user production 192.168.0.3(48164) idle32606 be/4 postgres 1964.00 K 6.94 M 0.00 % 3.51 % postgres: user production 192.168.0.3(40426) SELECT31466 be/4 postgres 1596.00 K 3.11 M 0.00 % 3.47 % postgres: user production 192.168.0.1(34550) idle32544 be/4 postgres 1184.00 K 4.25 M 0.00 % 3.38 % postgres: user production 192.168.0.3(56861) idle31458 be/4 postgres 1088.00 K 1528.00 K 0.00 % 3.33 % postgres: user production 192.168.0.1(34541) idle31444 be/4 postgres 884.00 K 4.23 M 0.00 % 3.27 % postgres: user production 192.168.0.1(34510) idle32522 be/4 postgres 408.00 K 2.98 M 0.00 % 3.27 % postgres: user production 192.168.0.5(38361) idle32762 be/4 postgres 1156.00 K 5.28 M 0.00 % 3.20 % postgres: user production 192.168.0.3(57962) idle32582 be/4 postgres 1084.00 K 3.38 M 0.00 % 2.86 % postgres: user production 192.168.0.5(43104) idle31353 be/4 postgres 2.04 M 3.02 M 0.00 % 2.82 % postgres: user production 192.168.0.1(34444) idle31441 be/4 postgres 700.00 K 2.68 M 0.00 % 2.64 % postgres: user production 192.168.0.1(34465) idle31462 be/4 postgres 980.00 K 3.50 M 0.00 % 2.57 % postgres: user production 192.168.0.1(34547) idle32709 be/4 postgres 428.00 K 3.23 M 0.00 % 2.56 % postgres: user production 192.168.0.5(34323) idle685 be/4 postgres 748.00 K 3.59 M 0.00 % 2.41 % postgres: user production 192.168.0.3(34911) idle683 be/4 postgres 728.00 K 3.19 M 0.00 % 2.38 % postgres: user production 192.168.0.3(34868) idle32765 be/4 postgres 464.00 K 3.76 M 0.00 % 2.21 % postgres: user production 192.168.0.3(58074) idle32760 be/4 postgres 808.00 K 6.18 M 0.00 % 2.16 % postgres: user production 192.168.0.3(57958) idle1912 be/4 postgres 372.00 K 3.03 M 0.00 % 2.16 % postgres: user production 192.168.0.5(33743) idle31446 be/4 postgres 1004.00 K 2.09 M 0.00 % 2.16 % postgres: user production 192.168.0.1(34539) idle31460 be/4 postgres 584.00 K 2.74 M 0.00 % 2.10 % postgres: user production 192.168.0.1(34545) idle5) vmstat 1procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----r b swpd free buff cache si so bi bo in cs us sy id wa1 1 573424 321080 27124 28504352 0 0 143 618 0 4 2 0 91 70 1 573424 320764 27124 28504496 0 0 104 15654 3788 4961 1 0 85 140 1 573424 320684 27124 28504616 0 0 276 12736 4099 5374 0 1 84 150 1 573424 319672 27124 28504900 0 0 80 7746 3624 4949 2 0 82 160 1 573424 319180 27124 28504972 0 0 36 12489 3653 4761 2 0 86 120 1 573424 318184 27132 28505000 0 0 8 10482 3413 4898 0 0 87 130 1 573424 318424 27132 28505016 0 0 0 9564 2746 4290 0 0 87 130 1 573424 318308 27168 28505016 0 0 36 10562 1895 2149 0 0 87 120 3 573424 318208 27168 28505020 0 0 84 18529 3035 3265 1 0 85 140 1 573424 318732 27176 28505080 0 0 84 14574 2986 3231 0 0 84 160 2 573424 317588 27176 28505184 0 0 4 6681 1991 2207 2 1 86 120 1 573424 316852 27176 28505260 0 0 76 7670 2910 3996 2 1 85 130 1 573424 316632 27184 28505256 0 0 0 7186 2661 3740 0 0 87 120 1 573424 316720 27188 28505260 0 0 0 2590 1731 2474 0 0 88 120 1 573424 314252 27192 28505696 0 0 460 11612 1757 2431 0 0 82 180 2 573424 313504 27192 28505724 0 0 0 19656 1775 2099 0 0 83 170 3 573424 313300 27196 28505780 0 0 188 6237 2746 3193 2 0 80 170 2 573424 312736 27200 28506348 0 0 804 18466 5014 6430 2 1 75 232 35 573424 307564 27200 28509920 0 0 3912 16280 14377 15470 14 3 28 560 5 573424 282848 27208 28533964 0 0 7484 27580 22017 25938 17 3 17 631 5 573424 221100 27208 28563360 0 0 2852 3120 19639 28664 12 5 52 310 4 573428 229912 26704 28519184 0 4 1208 5890 13976 20851 13 3 56 280 2 573448 234680 26672 28513632 0 20 0 17204 1694 2636 0 0 71 283 7 573452 220836 26644 28525548 0 4 1540 36370 27928 36551 17 5 50 271 3 573488 234380 26556 28517416 0 36 584 19066 8275 9467 3 2 60 360 1 573488 234496 26556 28517852 0 0 56 47429 3290 4310 0 0 79 206) sudo lsof - a hell of a lot of output, I can post it if anyone is interested :-)#### Notes and thoughts ##############################################################################As you can see, even though I have moved the pg_xlog folder to the SSD array (md3) the by far largest amount of writes still goes to the regular HDD's (md2), which puzzles me - what can that be?From stat 3) (the iostat) I notice that the SSD's doesn't seem to be something near fully utilized - maybe something else than just pg_xlog could be moved her?I have no idea if the amount of reads/writes is within the acceptable/capable for my kind of hardware, or if it is far beyond?In stat 3) (the iotop) it says that the RAID array (md2) is the most "waiting" part, does that taste like a root cause, or more like a symptom of some other bottleneck?Thanks, for taking the time to look at by data! :-)
On Mon, Dec 10, 2012 at 2:51 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote: > synchronous_commit = off > > The pg_xlog folder has been moved onto the SSD array (md3), and symlinked > back into the postgres dir. With synchronous_commit = off, or with large transactions, there is probably no advantage to moving those to SSD. > 2) > When the database is loaded like this, I see a lot of queries talking up to > 1000 times as long, as they would when the database is not loaded so > heavily. What kinds of queries are they? single-row look-ups, full table scans, etc. > > #### Notes and thoughts > ############################################################################## > > As you can see, even though I have moved the pg_xlog folder to the SSD array > (md3) the by far largest amount of writes still goes to the regular HDD's > (md2), which puzzles me - what can that be? Every row you insert or non-HOT update has to do maintenance on all indexes of that table. If the rows are not inserted/updated in index order, this means you every row inserted/updated dirties a randomly scattered 8KB for each of the indexes. If you have lots of indexes per table, that adds up fast. The fact that there is much more writing than reading tells me that most of your indexes are in RAM. The amount of index you are rapidly reading and dirtying is large enough to fit in RAM, but is not large enough to fit in shared_buffers + kernel's dirty-buffer comfort level. So you are redirtying the same blocks over and over, PG is desperately dumping them to the kernel (because shared_buffers it too small to hold them) and the kernel is desperately dumping them to disk, because vm.dirty_background_ratio is so low. There is little opportunity for write-combining, because they don't sit in memory long enough to accumulate neighbors. How big are your indexes? You could really crank up shared_buffers or vm.dirty_background_ratio, but doing so might cause problems with checkpoints stalling and latency spikes. That would probably not be a problem during the night, but could be during the day. Rather than moving maintenance to the day and hoping it doesn't interfere with normal operations, I'd focus on making night-time maintenance more efficient, for example by dropping indexes (either just at night, or if some indexes are not useful, just get rid of them altogether), or cranking up shared_buffers at night, or maybe partitioning or look into pg_bulkload. > From stat 3) (the iostat) I notice that the SSD's doesn't seem to be > something near fully utilized - maybe something else than just pg_xlog could > be moved her? I don't know how big each disk is, or how big your various categories of data are. Could you move everything to SSD? Could you move all your actively updated indexes there? Or, more fundamentally, it looks like you spent too much on CPUs (86% idle) and not nearly enough on disks. Maybe you can fix that for less money than it will cost you in your optimization time to make the best of the disks you already have. Cheers, Jeff
Den 11/12/2012 kl. 00.58 skrev Jeff Janes <jeff.janes@gmail.com>: > On Mon, Dec 10, 2012 at 2:51 PM, Niels Kristian Schjødt > <nielskristian@autouncle.com> wrote: > >> synchronous_commit = off >> >> The pg_xlog folder has been moved onto the SSD array (md3), and symlinked >> back into the postgres dir. > > With synchronous_commit = off, or with large transactions, there is > probably no advantage to moving those to SSD. > > >> 2) >> When the database is loaded like this, I see a lot of queries talking up to >> 1000 times as long, as they would when the database is not loaded so >> heavily. > > What kinds of queries are they? single-row look-ups, full table scans, etc. Well Mostly they are updates. Like the one shown in the previous question I referenced. >> >> #### Notes and thoughts >> ############################################################################## >> >> As you can see, even though I have moved the pg_xlog folder to the SSD array >> (md3) the by far largest amount of writes still goes to the regular HDD's >> (md2), which puzzles me - what can that be? > > Every row you insert or non-HOT update has to do maintenance on all > indexes of that table. If the rows are not inserted/updated in index > order, this means you every row inserted/updated dirties a randomly > scattered 8KB for each of the indexes. If you have lots of indexes > per table, that adds up fast. > > The fact that there is much more writing than reading tells me that > most of your indexes are in RAM. The amount of index you are rapidly > reading and dirtying is large enough to fit in RAM, but is not large > enough to fit in shared_buffers + kernel's dirty-buffer comfort level. Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehowutilize more of it? > So you are redirtying the same blocks over and over, PG is > desperately dumping them to the kernel (because shared_buffers it too > small to hold them) and the kernel is desperately dumping them to > disk, because vm.dirty_background_ratio is so low. There is little > opportunity for write-combining, because they don't sit in memory long > enough to accumulate neighbors. > > How big are your indexes? This is a size list of all my indexes: 117 MB, 118 MB, 11 MB, 12 MB, 12 MB, 12 MB, 12 MB, 140 MB, 15 MB, 15 MB, 16 kB, 16kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 MB, 16 MB, 176 kB, 176 kB, 17 MB, 18 MB, 19MB, 23 MB, 240 kB, 24 MB, 256 kB, 25 MB, 25 MB, 26 MB, 26 MB, 27 MB, 27 MB, 27 MB, 27 MB, 280 MB, 2832 kB, 2840 kB, 288kB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 29 MB, 29 MB, 3152 kB, 3280 kB, 32 kB, 32 MB, 32 MB, 3352kB, 3456 kB, 34 MB, 36 MB, 3744 kB, 3776 kB, 37 MB, 37 MB, 3952 kB, 400 kB, 408 kB, 40 kB, 40 kB, 40 kB, 416 kB, 416kB, 42 MB, 432 kB, 4520 kB, 4720 kB, 47 MB, 48 kB, 496 kB, 49 MB, 512 kB, 52 MB, 52 MB, 5304 kB, 5928 kB, 6088 kB, 61MB, 6224 kB, 62 MB, 6488 kB, 64 kB, 6512 kB, 71 MB, 72 kB, 72 kB, 8192 bytes, 8400 kB, 88 MB, 95 MB, 98 MB > You could really crank up shared_buffers or vm.dirty_background_ratio, > but doing so might cause problems with checkpoints stalling and > latency spikes. That would probably not be a problem during the > night, but could be during the day. What do you have in mind here? Tweaking what parameters to what values? > . > Rather than moving maintenance to the day and hoping it doesn't > interfere with normal operations, I'd focus on making night-time > maintenance more efficient, for example by dropping indexes (either > just at night, or if some indexes are not useful, just get rid of them > altogether), or cranking up shared_buffers at night, or maybe > partitioning or look into pg_bulkload. > >> From stat 3) (the iostat) I notice that the SSD's doesn't seem to be >> something near fully utilized - maybe something else than just pg_xlog could >> be moved her? > > I don't know how big each disk is, or how big your various categories > of data are. Could you move everything to SSD? Could you move all > your actively updated indexes there? With table spaces you mean? > Or, more fundamentally, it looks like you spent too much on CPUs (86% > idle) and not nearly enough on disks. Maybe you can fix that for less > money than it will cost you in your optimization time to make the best > of the disks you already have. The SSD's I use a are 240Gb each which will grow too small within a few months - so - how does moving the whole data dironto four of those in a RAID5 array sound? > > Cheers, > > Jeff
On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote: > > Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehowutilize more of it? For an update-mostly workload it probably won't do you tons of good so long as all your indexes fit in RAM. You're clearly severely bottlenecked on disk I/O not RAM. > The SSD's I use a are 240Gb each which will grow too small within a > few months - so - how does moving the whole data dir onto four of > those in a RAID5 array sound? Not RAID 5! Use a RAID10 of four or six SSDs. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>: > On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote: >> >> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehowutilize more of it? > For an update-mostly workload it probably won't do you tons of good so > long as all your indexes fit in RAM. You're clearly severely > bottlenecked on disk I/O not RAM. >> The SSD's I use a are 240Gb each which will grow too small within a >> few months - so - how does moving the whole data dir onto four of >> those in a RAID5 array sound? > > Not RAID 5! > > Use a RAID10 of four or six SSDs. > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC Are they suitable do you think?
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote: > > Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>: > >> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote: >>> >>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehowutilize more of it? >> For an update-mostly workload it probably won't do you tons of good so >> long as all your indexes fit in RAM. You're clearly severely >> bottlenecked on disk I/O not RAM. >>> The SSD's I use a are 240Gb each which will grow too small within a >>> few months - so - how does moving the whole data dir onto four of >>> those in a RAID5 array sound? >> >> Not RAID 5! >> >> Use a RAID10 of four or six SSDs. >> >> -- >> Craig Ringer http://www.2ndQuadrant.com/ >> PostgreSQL Development, 24x7 Support, Training & Services >> > Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC > Are they suitable do you think? > I am not Craig, but i use them in production in raid10 array now. > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote: > Den 11/12/2012 kl. 00.58 skrev Jeff Janes <jeff.janes@gmail.com>: > >> >> The fact that there is much more writing than reading tells me that >> most of your indexes are in RAM. The amount of index you are rapidly >> reading and dirtying is large enough to fit in RAM, but is not large >> enough to fit in shared_buffers + kernel's dirty-buffer comfort level. > Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehowutilize more of it? What tool do you use to determine that? Is that on top of the 4GB shared_buffers, are including it? How big is your entire data set? Maybe all your data fits in 5GB (believable, as all your indexes listed below sum to < 2.5GB) so there is no need to use more. Or maybe you have hit an bug in the 3.2 kernel. At least one of those has been frequently discussed. >> You could really crank up shared_buffers or vm.dirty_background_ratio, >> but doing so might cause problems with checkpoints stalling and >> latency spikes. That would probably not be a problem during the >> night, but could be during the day. > What do you have in mind here? Tweaking what parameters to what values? I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your data) and see what happens. And probably increase checkpoint_timeout and checkpoint_segments about 3x each. Also, turn on log_checkpoints so you can see what kinds of problem those changes may be causing there (i.e. long sync times). Preferably you do this on some kind of pre-production or test server. But if your database is growing so rapidly that it soon won't fit on 240GB, then cranking up shared_buffers won't do for long. If you can get your tables and all of their indexes clustered together, then you can do the updates in an order that makes IO more efficient. Maybe partitioning would help. >> I don't know how big each disk is, or how big your various categories >> of data are. Could you move everything to SSD? Could you move all >> your actively updated indexes there? > With table spaces you mean? Yes. Or moving everything to SSD if it fits, then you don't have go through and separate objects. The UPDATE you posted in a previous thread looked like the table blocks might also be getting dirtied in a fairly random order, which means the table blocks are in the same condition as the index blocks so maybe singling out the indexes isn't warranted. Cheers, Jeff
And what is your experience so far?
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>
I am not Craig, but i use them in production in raid10 array now.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
And what is your experience so far?
Den 11/12/2012 18.16 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>
I am not Craig, but i use them in production in raid10 array now.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
Are you using a hardware based raid controller with them?
On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:And what is your experience so far?
Increased tps by a factor of 10, database no longer a limiting factor of application.And it is cheaper than brand rotating drives.Den 11/12/2012 18.16 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>
I am not Craig, but i use them in production in raid10 array now.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
Are you using a hardware based raid controller with them?
Den 11/12/2012 20.11 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:And what is your experience so far?
Increased tps by a factor of 10, database no longer a limiting factor of application.And it is cheaper than brand rotating drives.Den 11/12/2012 18.16 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>
I am not Craig, but i use them in production in raid10 array now.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:Are you using a hardware based raid controller with them?
Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast fsync without it.
Most SSDs should offer fairly fast fsync without a hardware RAID controller, as they do write-back caching. The trick is to find ones that do write-back caching safely, so you don't get severe data corruption on power-loss.
A HW RAID controller is an absolute must for rotating magnetic media, though.
Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.
I've personally been pretty happy with mdadm. I find the array portability it offers very useful, so I don't need to buy a second RAID controller just in case my main controller dies and I need a compatible one to get the array running again. If you don't need a BBU for safe write-back caching then mdadm has advantages over hardware RAID.
I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware RAID controllers. I suspect a mid- to high end HW RAID unit will generally win.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/12/2012 06:44 AM, Evgeny Shishkin wrote:On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:Are you using a hardware based raid controller with them?
Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast fsync without it.
Most SSDs should offer fairly fast fsync without a hardware RAID controller, as they do write-back caching. The trick is to find ones that do write-back caching safely, so you don't get severe data corruption on power-loss.
A HW RAID controller is an absolute must for rotating magnetic media, though.Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.
I've personally been pretty happy with mdadm. I find the array portability it offers very useful, so I don't need to buy a second RAID controller just in case my main controller dies and I need a compatible one to get the array running again. If you don't need a BBU for safe write-back caching then mdadm has advantages over hardware RAID.
I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware RAID controllers. I suspect a mid- to high end HW RAID unit will generally win.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Or, worse, some of them do limited write caching but don't protect their write cache from power loss. Instant data corruption!Actually most of low-end SSDs don't do write caching, they do not have enough ram for that. Sandforce for example.
I would be extremely reluctant to use low-end SSDs for a database server.
I'd want to consider whether the same money is better spent on faster, higher quality SSDs with their own fast write caches.If we are talking about dedicated machine for database with ssd drives, why would anybody don't by hardware raid for about 500-700$?
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/12/2012 09:17 AM, Evgeny Shishkin wrote:Or, worse, some of them do limited write caching but don't protect their write cache from power loss. Instant data corruption!Actually most of low-end SSDs don't do write caching, they do not have enough ram for that. Sandforce for example.
I would be extremely reluctant to use low-end SSDs for a database server.I'd want to consider whether the same money is better spent on faster, higher quality SSDs with their own fast write caches.If we are talking about dedicated machine for database with ssd drives, why would anybody don't by hardware raid for about 500-700$?
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Actually most of low-end SSDs don't do write caching, they do not have enough ram for that.
AIUI, *all* SSDs do write-caching of a sort: writes are actually flushed to the NAND media by erasing, and then overwriting the erased space, and erasing is done in fixed-size blocks, usually much larger than a filesystem's pages. The drive's controller accumulates writes in an on-board cache until it has an "erase block"'s worth of them, which are then flushed. From casual searching, a common erase block size is 256 kbytes, while filesystem-level pages are usually 4k.
:wq
So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. Cost effective and fast, may be not very safe, but so far so good. All data protection measures from postgresql are on, of course.
You're aware that many low end SSDs lie to the RAID controller about having written data, right? Even if the RAID controller sends a flush command, the SSD might cache the write in non-durable cache. If you're using such SSDs and you lose power, data corruption is extremely likely, because your SSDs are essentially ignoring fsync.
Your RAID controller's BBU won't save you, because once the disks tell the RAID controller the data has hit durable storage, the RAID controller feels free to flush it from its battery backed cache. If the disks are lying...
The only solid way to find out if this is an issue with your SSDs is to do plug-pull testing and find out.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/12/2012 09:44 AM, Evgeny Shishkin wrote:So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. Cost effective and fast, may be not very safe, but so far so good. All data protection measures from postgresql are on, of course.
You're aware that many low end SSDs lie to the RAID controller about having written data, right? Even if the RAID controller sends a flush command, the SSD might cache the write in non-durable cache. If you're using such SSDs and you lose power, data corruption is extremely likely, because your SSDs are essentially ignoring fsync.
Your RAID controller's BBU won't save you, because once the disks tell the RAID controller the data has hit durable storage, the RAID controller feels free to flush it from its battery backed cache. If the disks are lying...
The only solid way to find out if this is an issue with your SSDs is to do plug-pull testing and find out.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/11/2012 7:13 PM, Evgeny Shishkin wrote: > Yes, i am aware of this issue. Never experienced this neither on intel > 520, no ocz vertex 3. > Have you heard of them on this list? People have done plug-pull tests and reported the results on the list (sometime in the past couple of years). But you don't need to do the test to know these drives are not safe. They're unsafe by design.
On Dec 12, 2012, at 6:15 AM, David Boreham <david_list@boreham.org> wrote: > On 12/11/2012 7:13 PM, Evgeny Shishkin wrote: >> Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3. >> Have you heard of them on this list? > People have done plug-pull tests and reported the results on the list (sometime in the past couple of years). > > But you don't need to do the test to know these drives are not safe. They're unsafe by design. > Oh, there is no 100% safe system. The only way to be sure is to read data back. Everything about system design is tradeoff between cost and risks. > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On 12/11/2012 7:20 PM, Evgeny Shishkin wrote: > Oh, there is no 100% safe system. In this case we're discussing specifically "safety in the event of power loss shortly after the drive indicates to the controller that it has committed a write operation". Some drives do provide 100% safety against this event, and they don't cost much more than those that don't.
On Dec 12, 2012, at 6:26 AM, David Boreham <david_list@boreham.org> wrote: > On 12/11/2012 7:20 PM, Evgeny Shishkin wrote: >> Oh, there is no 100% safe system. > In this case we're discussing specifically "safety in the event of power loss shortly after the drive indicates to thecontroller that it has committed a write operation". Some drives do provide 100% safety against this event, and they don'tcost much more than those that don't. Which drives would you recommend? Besides intel 320 and 710. > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On 12/11/2012 7:38 PM, Evgeny Shishkin wrote: > Which drives would you recommend? Besides intel 320 and 710. Those are the only drive types we have deployed in servers at present (almost all 710, but we have some 320 for less mission-critical machines). The new DC-S3700 Series looks nice too, but isn't yet in the sales channel : http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html
Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.
I wouldn't trust either of those drives. The 520 doesn't have Intel's " Enhanced Power Loss Data Protection"; it's going to lose its buffers if it loses power. Similarly, the Vertex 3 doesn't have any kind of power protection. See:
http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
http://ark.intel.com/products/family/56572/Intel-SSD-500-Family
http://www.ocztechnology.com/res/manuals/OCZ_SSD_Breakdown_Q2-11_1.pdf
The only way I'd use those for a production server was if I had synchronous replication running to another machine with trustworthy, durable storage - and if I didn't mind some downtime to restore the corrupt DB from the replica after power loss.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/11/2012 7:38 PM, Evgeny Shishkin wrote:Which drives would you recommend? Besides intel 320 and 710.Those are the only drive types we have deployed in servers at present (almost all 710, but we have some 320 for less mission-critical machines). The new DC-S3700 Series looks nice too, but isn't yet in the sales channel :
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On 12/12/12 15:41, David Boreham wrote: > On 12/11/2012 7:38 PM, Evgeny Shishkin wrote: >> Which drives would you recommend? Besides intel 320 and 710. > Those are the only drive types we have deployed in servers at present > (almost all 710, but we have some 320 for less mission-critical > machines). The new DC-S3700 Series looks nice too, but isn't yet in > the sales channel : > http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html > > > > > > +1 The s3700 is probably the one to get (when it is available). I'd opt for the 710 if you need something now. I'd avoid the 320 - we have encountered the firmware bug whereby you get an 8MB (yes 8MB) capacity after powerdown with a depressingly large number of them (they were updated to the latest firmware too). Regards Mark
On 12/11/2012 7:49 PM, Evgeny Shishkin wrote: > Yeah, s3700 looks promising, but sata interface is limiting factor for > this drive. > I'm looking towards SMART ssd > http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review > What don't you like about SATA ? I prefer to avoid SAS drives if possible due to the price premium for dubious benefits besides vague hand-waving "enterprise-ness" promises.
On 12/11/2012 7:49 PM, Evgeny Shishkin wrote:Yeah, s3700 looks promising, but sata interface is limiting factor for this drive.What don't you like about SATA ?
I'm looking towards SMART ssd http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review
I prefer to avoid SAS drives if possible due to the price premium for dubious benefits besides vague hand-waving "enterprise-ness" promises.
Intel makes the case that the S3700 is the ideal drive for entry, mainstream and performance enterprise computing including HPC use cases. The claim is bold, largely because of the decision to go with a SATA interface, which has several limitations in the enterprise. The SATA interface tops out at a queue depth 32 (SAS scales as high as 256 in most cases) which means that when requests go above that level average and peak latency spike as we saw in all of our workloads.
Another huge advantage of SAS is the ability to offer dual-port modes for high availability scenarios, where there are two controllers interfacing with the same drive at the same time. In the event one goes offline, the connection with the SSD is not lost, as it would with a standard SATA interface without additional hardware. Some SAS drives also offer wide-port configurations used to increase total bandwidth above a single-link connection. While the Intel SSD DC S3700 against other SATA competitors is very fast, the story changes when you introduce the latest MLC and SLC-based SAS SSDs, which can cope better with increased thread and queue levels.
We picked the primary post-preconditioning sections of our benchmarks after each SSD had reached steady-state. For the purposes of this section, we added the Intel SSD DC S3700 onto the throughput charts of the newest SAS high-performance SSDs. There are also significant latency differences at higher queue depths that play a significant factor, but for the sake of easy comparison we stick with raw I/O speed across varying thread and queue counts.
In a 100% 4K random write or random read scenario, the Intel SSD DC 3700 performs quite well up against the high-end SAS competition, with the second fastest 4K steady-state speed. When you switch focus to read throughput at a heavy 16T/16Q load it only offers 1/2 to 1/3 the performance of SSDs in this category.
http://www.storagereview.com/images/intel_ssd_dc_s3700_main_slc_4kwrite_throughput.png
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On 12/11/2012 8:11 PM, Evgeny Shishkin wrote: > > Quoting > http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review Heh. A fine example of the kind of hand-waving of which I spoke ;) Higher performance is certainly a benefit, although at present we can't saturate even a single 710 series drive (the application, CPU, OS, etc is the bottleneck). Similarly while dual-porting certainly has its uses, it is not something I need.
On 12/12/2012 10:13 AM, Evgeny Shishkin wrote:Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.
I wouldn't trust either of those drives. The 520 doesn't have Intel's " Enhanced Power Loss Data Protection"; it's going to lose its buffers if it loses power. Similarly, the Vertex 3 doesn't have any kind of power protection. See:
http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
http://ark.intel.com/products/family/56572/Intel-SSD-500-Family
http://www.ocztechnology.com/res/manuals/OCZ_SSD_Breakdown_Q2-11_1.pdf
The only way I'd use those for a production server was if I had synchronous replication running to another machine with trustworthy, durable storage - and if I didn't mind some downtime to restore the corrupt DB from the replica after power loss.-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Den 11/12/2012 kl. 18.25 skrev Jeff Janes <jeff.janes@gmail.com>: > On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt > <nielskristian@autouncle.com> wrote: >> Den 11/12/2012 kl. 00.58 skrev Jeff Janes <jeff.janes@gmail.com>: >> >>> >>> The fact that there is much more writing than reading tells me that >>> most of your indexes are in RAM. The amount of index you are rapidly >>> reading and dirtying is large enough to fit in RAM, but is not large >>> enough to fit in shared_buffers + kernel's dirty-buffer comfort level. > >> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehowutilize more of it? > > What tool do you use to determine that? Is that on top of the 4GB > shared_buffers, are including it? Okay I might not have made myself clear, I was talking "physical" memory utilization. Here is the stats: free -m total used free shared buffers cached Mem: 32075 25554 6520 0 69 22694 -/+ buffers/cache: 2791 29284 Swap: 2046 595 1451 > > How big is your entire data set? Maybe all your data fits in 5GB > (believable, as all your indexes listed below sum to < 2.5GB) so there > is no need to use more. It doesn't we are a search engine for used cars, and there are quite a lot of those out there :-) However, my indexes arealmost all partial indexes, which mean that they are only on cars which is still for sale, so in that sense, the indexesthem selves doesn't really grow, but the tables do. > > Or maybe you have hit an bug in the 3.2 kernel. At least one of those > has been frequently discussed. > Might be true - but likely? > >>> You could really crank up shared_buffers or vm.dirty_background_ratio, >>> but doing so might cause problems with checkpoints stalling and >>> latency spikes. That would probably not be a problem during the >>> night, but could be during the day. > >> What do you have in mind here? Tweaking what parameters to what values? > > I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your I had that before, Shaun suggested that I changed it to 4GB as he was talking about a strange behavior when larger than thaton 12.04. But I can say, that there has not been any notable difference between having it at 4Gb and at 8Gb. > data) and see what happens. And probably increase checkpoint_timeout > and checkpoint_segments about 3x each. Also, turn on log_checkpoints > so you can see what kinds of problem those changes may be causing > there (i.e. long sync times). Preferably you do this on some kind of > pre-production or test server. > > But if your database is growing so rapidly that it soon won't fit on > 240GB, then cranking up shared_buffers won't do for long. If you can > get your tables and all of their indexes clustered together, then you > can do the updates in an order that makes IO more efficient. Maybe > partitioning would help. Can you explain a little more about this, or provide me a good link? > > >>> I don't know how big each disk is, or how big your various categories >>> of data are. Could you move everything to SSD? Could you move all >>> your actively updated indexes there? > >> With table spaces you mean? > > Yes. Or moving everything to SSD if it fits, then you don't have go > through and separate objects. > > The UPDATE you posted in a previous thread looked like the table > blocks might also be getting dirtied in a fairly random order, which > means the table blocks are in the same condition as the index blocks > so maybe singling out the indexes isn't warranted. > > Cheers, > > Jeff
Well, In fact I do (as you can see from my configuration). I have a similar server running with hot standby replication - and it runs two 3T HDD in a RAID1 array.So long as you have it recording to a synchronous replia on another machine and you're fully prepared to accept the small risk that you'll have total and unrecoverable data corruption on that server, with the corresponding downtime while you rebuild it from the replica, it should be OK.So - is it still very bad if I choose to put four intel 520 disks in a RAID10 array on the other production server?
Alternately, you could use PITR with a basebackup to ship WAL to another machine or a reliable HDD, so you can recover all but the last checkpoint_timeout minutes of data from the base backup + WAL. There's small window of data loss that way, but you don't need a second machine as a streaming replication follower. barman might is worth checking out as a management tool for PITR backups.
If the data is fairly low-value you could even just take nightly backups and accept the risk of losing some data.
--
Craig Ringer
On Wed, Dec 12, 2012 at 8:46 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote: > > Den 11/12/2012 kl. 18.25 skrev Jeff Janes <jeff.janes@gmail.com>: > >> On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt >> <nielskristian@autouncle.com> wrote: >> >>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehowutilize more of it? >> >> What tool do you use to determine that? Is that on top of the 4GB >> shared_buffers, are including it? > > Okay I might not have made myself clear, I was talking "physical" memory utilization. Here is the stats: > free -m > total used free shared buffers cached > Mem: 32075 25554 6520 0 69 22694 > -/+ buffers/cache: 2791 29284 > Swap: 2046 595 1451 I don't how you get 5 Gig from that, though. You have 22 Gig of cached file-system, which for your purposes probably counts as being utilized. Although we don't know how much of this is for postgres data files, chances are it is a pretty good chunk. >> >> How big is your entire data set? Maybe all your data fits in 5GB >> (believable, as all your indexes listed below sum to < 2.5GB) so there >> is no need to use more. > > It doesn't we are a search engine for used cars, and there are quite a lot of those out there :-) But how big? More than 22GB? (you can use \l+ in psql, or du -s on the data directory) > However, my indexes are almost all partial indexes, which mean that they are only on cars which is still for sale, so inthat sense, the indexes them selves doesn't really grow, but the tables do. So maybe this reverses things. If your car table is huge and the active cars are scattered randomly among all the inactive ones, then updating random active cars is going to generate a lot of randomly scattered writing which can't be combined into sequential writes. Do you have plans for archiving cars that are no longer for sale? Why do you keep them around in the first place, i.e. what types of queries do you do on inactive ones? Unfortunately you currently can't use CLUSTER with partial indexes, otherwise that might be a good idea. You could build a full index on whatever it is you use as the criterion for the partial indexes, cluster on that, and then drop it. But the table would eventually become unclustered again, so if this works you might want to implement partitioning between active and inactive partitions so as to maintain the clustering. >>>> You could really crank up shared_buffers or vm.dirty_background_ratio, >>>> but doing so might cause problems with checkpoints stalling and >>>> latency spikes. That would probably not be a problem during the >>>> night, but could be during the day. >> >>> What do you have in mind here? Tweaking what parameters to what values? >> >> I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your > > I had that before, Shaun suggested that I changed it to 4GB as he was talking about a strange behavior when larger thanthat on 12.04. But I can say, that there has not been any notable difference between having it at 4Gb and at 8Gb. It is almost an all or nothing thing. If you need 16 or 20GB, just going from 4 to 8 isn't going to show much difference. If you can test this easily, I'd just set it to 24 or even 28GB and run the bulk update. I don't think you'd want to run a server permanently at those settings, but it is an easy way to rule in or out different theories about what is going on. >> But if your database is growing so rapidly that it soon won't fit on >> 240GB, then cranking up shared_buffers won't do for long. If you can >> get your tables and all of their indexes clustered together, then you >> can do the updates in an order that makes IO more efficient. Maybe >> partitioning would help. > > Can you explain a little more about this, or provide me a good link? If all your partial btree indexes are using the same WHERE clause, then your indexes are already clustered together in a sense--a partial index is kind of like a composite index with the WHERE clause as the first column. So the trick would be to get the table to be clustered on the same thing--either by partitioning or by the CLUSTER command, or something equivalent to those. I don't know of a good link, other than the documentation (which is more about how to do it, rather than why you would want to or how to design it) Cheers, Jeff
On Tue, Dec 11, 2012 at 6:03 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 12/12/2012 06:44 AM, Evgeny Shishkin wrote: > > > On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt > <nielskristian@autouncle.com> wrote: > > Are you using a hardware based raid controller with them? > > Yes, of course. Hardware raid with cache and bbu is a must. You can't get > fast fsync without it. > > > Most SSDs should offer fairly fast fsync without a hardware RAID controller, > as they do write-back caching. The trick is to find ones that do write-back > caching safely, so you don't get severe data corruption on power-loss. > > A HW RAID controller is an absolute must for rotating magnetic media, > though. > > > Also mdadm is a pain in the ass and is suitable only on amazon and other > cloud shit. > > > I've personally been pretty happy with mdadm. I find the array portability > it offers very useful, so I don't need to buy a second RAID controller just > in case my main controller dies and I need a compatible one to get the array > running again. If you don't need a BBU for safe write-back caching then > mdadm has advantages over hardware RAID. > > I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware > RAID controllers. I suspect a mid- to high end HW RAID unit will generally > win. Also for sequential throughput md RAID is usually faster than most RAID controllers, even the high end Areca and LSI ones.