Обсуждение: Do I have a hardware or a software problem?

Поиск
Список
Период
Сортировка

Do I have a hardware or a software problem?

От
Niels Kristian Schjødt
Дата:
#### Pitch ######################################################################################
I previously posted this question http://archives.postgresql.org/pgsql-performance/2012-11/msg00289.php about a performance issue with an update query. 
The question evolved into a more general discussion about my setup, and about a lot of I/O wait that I was encountering. Since then, I have gotten a whole lot more familiar with measuring things, and now I "just" need some experienced eyes to judge which direction I should go in - do I have a hardware issue, or a software issue - and what action should I take?

#####  My setup #############################################################################
The use case:
At night time we are doing a LOT of data maintenance, and hence the load on the database is very different from the day time. However we would like to be able to do some of it in the daytime, it's simply just too "heavy" on the database as is right now. The stats shown below is from one of those "heavy" load times.

Hardware: 
  - 32Gb ram 
  - 8 core Xeon E3-1245 processor
  - Two SEAGATE ST33000650NS drives (called sdc and sdd in the stats) in a softeware RAID1 array (called md2 in the stats)
  - Two INTEL SSDSC2CW240A3 SSD drives (called sda and sdb in the stats) in a software RAID1 (called md3 in the stats)

Software:
Postgres 9.2 running on 64bit ubuntu 12.04 with kernel 3.2

Configuration:
# postgresql.conf (a shortlist of everything changed from the default)
data_directory = '/var/lib/postgresql/9.2/main'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
external_pid_file = '/var/run/postgresql/9.2-main.pid'
listen_addresses = '192.168.0.2, localhost'
port = 5432
max_connections = 300
unix_socket_directory = '/var/run/postgresql'
wal_level = hot_standby
synchronous_commit = off
archive_mode = on
archive_command = 'rsync -a %p postgres@192.168.0.4:/var/lib/postgresql/9.2/wals/%f </dev/null'
max_wal_senders = 1
wal_keep_segments = 32
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
datestyle = 'iso, mdy'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 22GB
work_mem = 160MB
wal_buffers = 4MB
checkpoint_segments = 100
shared_buffers = 4GB
checkpoint_timeout = 10min

The kernel has bee tweaked like so:
vm.dirty_ratio = 10
vm.dirty_background_ratio = 1
kernel.shmmax = 8589934592
kernel.shmall = 17179869184

The pg_xlog folder has been moved onto the SSD array (md3), and symlinked back into the postgres dir.

##### The stats ###############################################################
These are the typical observations/stats I see in one of these periods:

1)
At top level this is what I see in new relic:

2)
When the database is loaded like this, I see a lot of queries talking up to 1000 times as long, as they would when the database is not loaded so heavily.

3)
sudo iostat -dmx (typical usage)
Linux 3.2.0-33-generic (master-db)  12/10/2012  _x86_64_ (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     6.52    3.59   26.61     0.22     0.74    65.49     0.01    0.40    0.77    0.35   0.14   0.43
sdb               0.00     8.31    0.03   28.38     0.00     0.97    69.63     0.01    0.52    0.27    0.52   0.15   0.43
sdc               1.71    46.01   34.83  116.62     0.56     4.06    62.47     1.90   12.57   21.81    9.81   1.89  28.66
sdd               1.67    46.14   34.89  116.49     0.56     4.06    62.46     1.58   10.43   21.66    7.07   1.89  28.60
md1               0.00     0.00    0.00    0.00     0.00     0.00     2.69     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.11    0.24     0.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00   72.99  161.95     1.11     4.06    45.10     0.00    0.00    0.00    0.00   0.00   0.00
md3               0.00     0.00    0.05   32.32     0.00     0.74    47.00     0.00    0.00    0.00    0.00   0.00   0.00

3)
sudo iotop -oa (running for about a minute or so)
TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  292    be/4 root               0.00 B      0.00 B    0.00 % 99.33 % [md2_raid1]
 2815  be/4 postgres     19.51 M     25.90 M  0.00 % 45.49 % postgres: autovacuum worker process   production
32553 be/4 postgres     45.74 M      9.38 M  0.00 % 37.89 % postgres: user production 192.168.0.3(58866) UPDATE
32570 be/4 postgres      6.91 M     35.02 M  0.00 % 16.71 % postgres: user production 192.168.0.3(35547) idle
32575 be/4 postgres      4.06 M     43.90 M  0.00 % 16.62 % postgres: user production 192.168.0.3(35561) SELECT
31673 be/4 postgres      4.14 M     52.16 M  0.00 % 16.24 % postgres: user production 192.168.0.3(39112) idle
32566 be/4 postgres      4.73 M     44.95 M  0.00 % 15.66 % postgres: user production 192.168.0.3(35531) idle
32568 be/4 postgres      4.50 M     33.84 M  0.00 % 14.62 % postgres: user production 192.168.0.3(35543) SELECT
32573 be/4 postgres      3.20 M     34.44 M  0.00 % 13.98 % postgres: user production 192.168.0.3(35559) idle
31590 be/4 postgres      3.23 M     29.72 M  0.00 % 13.90 % postgres: user production 192.168.0.3(50690) idle in transaction
32577 be/4 postgres      5.09 M     25.54 M  0.00 % 13.63 % postgres: user production 192.168.0.3(35563) idle
32565 be/4 postgres      2.06 M     35.93 M  0.00 % 13.41 % postgres: user production 192.168.0.3(35529) SELECT
32546 be/4 postgres      4.48 M     36.49 M  0.00 % 13.39 % postgres: user production 192.168.0.3(56927) UPDATE waiting
32569 be/4 postgres      3.50 M     26.75 M  0.00 % 12.82 % postgres: user production 192.168.0.3(35545) INSERT
31671 be/4 postgres      4.58 M     24.45 M  0.00 % 12.76 % postgres: user production 192.168.0.3(34841) idle in transaction
32551 be/4 postgres      3.26 M     31.77 M  0.00 % 12.06 % postgres: user production 192.168.0.3(58864) idle in transaction
32574 be/4 postgres      5.32 M     32.92 M  0.00 % 11.70 % postgres: user production 192.168.0.3(35560) idle
32572 be/4 postgres      3.00 M     32.66 M  0.00 % 11.66 % postgres: user production 192.168.0.3(35558) UPDATE
32560 be/4 postgres      5.12 M     25.89 M  0.00 % 11.52 % postgres: user production 192.168.0.3(33886) SELECT
32567 be/4 postgres      4.66 M     36.47 M  0.00 % 11.44 % postgres: user production 192.168.0.3(35534) SELECT
32571 be/4 postgres      2.86 M     31.27 M  0.00 % 11.31 % postgres: user production 192.168.0.3(35557) SELECT
32552 be/4 postgres      4.38 M     28.75 M  0.00 % 11.09 % postgres: user production 192.168.0.3(58865) idle in transaction
32554 be/4 postgres      3.69 M     30.21 M  0.00 % 10.90 % postgres: user production 192.168.0.3(58870) UPDATE
  339    be/3 root               0.00 B       2.29 M  0.00 %  9.81 % [jbd2/md2-8]
32576 be/4 postgres      3.37 M     19.91 M  0.00 %  9.73 % postgres: user production 192.168.0.3(35562) idle
32555 be/4 postgres      3.09 M     31.96 M  0.00 %  9.02 % postgres: user production 192.168.0.3(58875) SELECT
27548 be/4 postgres      0.00 B     97.12 M  0.00 %  7.41 % postgres: writer process
31445 be/4 postgres    924.00 K     27.35 M  0.00 %  7.11 % postgres: user production 192.168.0.1(34536) idle
31443 be/4 postgres      2.54 M      4.56 M  0.00 %  6.32 % postgres: user production 192.168.0.1(34508) idle
31459 be/4 postgres   1480.00 K     21.36 M  0.00 %  5.63 % postgres: user production 192.168.0.1(34543) idle
 1801 be/4 postgres   1896.00 K     10.89 M  0.00 %  5.57 % postgres: user production 192.168.0.3(34177) idle
32763 be/4 postgres   1696.00 K      6.95 M  0.00 %  5.33 % postgres: user production 192.168.0.3(57984) SELECT
 1800 be/4 postgres      2.46 M      5.13 M  0.00 %  5.24 % postgres: user production 192.168.0.3(34175) SELECT
 1803 be/4 postgres   1816.00 K      9.09 M  0.00 %  5.16 % postgres: user production 192.168.0.3(34206) idle
32578 be/4 postgres      2.57 M     11.62 M  0.00 %  5.06 % postgres: user production 192.168.0.3(35564) SELECT
31440 be/4 postgres      3.02 M      4.04 M  0.00 %  4.65 % postgres: user production 192.168.0.1(34463) idle
32605 be/4 postgres   1844.00 K     11.82 M  0.00 %  4.49 % postgres: user production 192.168.0.3(40399) idle
27547 be/4 postgres      0.00 B      0.00 B  0.00 %  3.93 % postgres: checkpointer process
31356 be/4 postgres   1368.00 K      3.27 M  0.00 %  3.93 % postgres: user production 192.168.0.1(34450) idle
32542 be/4 postgres   1180.00 K      6.05 M  0.00 %  3.90 % postgres: user production 192.168.0.3(56859) idle
32523 be/4 postgres   1088.00 K      4.33 M  0.00 %  3.59 % postgres: user production 192.168.0.3(48164) idle
32606 be/4 postgres   1964.00 K      6.94 M  0.00 %  3.51 % postgres: user production 192.168.0.3(40426) SELECT
31466 be/4 postgres   1596.00 K      3.11 M  0.00 %  3.47 % postgres: user production 192.168.0.1(34550) idle
32544 be/4 postgres   1184.00 K      4.25 M  0.00 %  3.38 % postgres: user production 192.168.0.3(56861) idle
31458 be/4 postgres   1088.00 K   1528.00 K  0.00 %  3.33 % postgres: user production 192.168.0.1(34541) idle
31444 be/4 postgres    884.00 K      4.23 M  0.00 %  3.27 % postgres: user production 192.168.0.1(34510) idle
32522 be/4 postgres    408.00 K      2.98 M  0.00 %  3.27 % postgres: user production 192.168.0.5(38361) idle
32762 be/4 postgres   1156.00 K      5.28 M  0.00 %  3.20 % postgres: user production 192.168.0.3(57962) idle
32582 be/4 postgres   1084.00 K      3.38 M  0.00 %  2.86 % postgres: user production 192.168.0.5(43104) idle
31353 be/4 postgres      2.04 M      3.02 M  0.00 %  2.82 % postgres: user production 192.168.0.1(34444) idle
31441 be/4 postgres    700.00 K      2.68 M  0.00 %  2.64 % postgres: user production 192.168.0.1(34465) idle
31462 be/4 postgres    980.00 K      3.50 M  0.00 %  2.57 % postgres: user production 192.168.0.1(34547) idle
32709 be/4 postgres    428.00 K      3.23 M  0.00 %  2.56 % postgres: user production 192.168.0.5(34323) idle
  685 be/4 postgres    748.00 K      3.59 M  0.00 %  2.41 % postgres: user production 192.168.0.3(34911) idle
  683 be/4 postgres    728.00 K      3.19 M  0.00 %  2.38 % postgres: user production 192.168.0.3(34868) idle
32765 be/4 postgres    464.00 K      3.76 M  0.00 %  2.21 % postgres: user production 192.168.0.3(58074) idle
32760 be/4 postgres    808.00 K      6.18 M  0.00 %  2.16 % postgres: user production 192.168.0.3(57958) idle
 1912 be/4 postgres    372.00 K      3.03 M  0.00 %  2.16 % postgres: user production 192.168.0.5(33743) idle
31446 be/4 postgres   1004.00 K      2.09 M  0.00 %  2.16 % postgres: user production 192.168.0.1(34539) idle
31460 be/4 postgres    584.00 K      2.74 M  0.00 %  2.10 % postgres: user production 192.168.0.1(34545) idle

5) vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1 573424 321080  27124 28504352    0    0   143   618    0    4  2  0 91  7
 0  1 573424 320764  27124 28504496    0    0   104 15654 3788 4961  1  0 85 14
 0  1 573424 320684  27124 28504616    0    0   276 12736 4099 5374  0  1 84 15
 0  1 573424 319672  27124 28504900    0    0    80  7746 3624 4949  2  0 82 16
 0  1 573424 319180  27124 28504972    0    0    36 12489 3653 4761  2  0 86 12
 0  1 573424 318184  27132 28505000    0    0     8 10482 3413 4898  0  0 87 13
 0  1 573424 318424  27132 28505016    0    0     0  9564 2746 4290  0  0 87 13
 0  1 573424 318308  27168 28505016    0    0    36 10562 1895 2149  0  0 87 12
 0  3 573424 318208  27168 28505020    0    0    84 18529 3035 3265  1  0 85 14
 0  1 573424 318732  27176 28505080    0    0    84 14574 2986 3231  0  0 84 16
 0  2 573424 317588  27176 28505184    0    0     4  6681 1991 2207  2  1 86 12
 0  1 573424 316852  27176 28505260    0    0    76  7670 2910 3996  2  1 85 13
 0  1 573424 316632  27184 28505256    0    0     0  7186 2661 3740  0  0 87 12
 0  1 573424 316720  27188 28505260    0    0     0  2590 1731 2474  0  0 88 12
 0  1 573424 314252  27192 28505696    0    0   460 11612 1757 2431  0  0 82 18
 0  2 573424 313504  27192 28505724    0    0     0 19656 1775 2099  0  0 83 17
 0  3 573424 313300  27196 28505780    0    0   188  6237 2746 3193  2  0 80 17
 0  2 573424 312736  27200 28506348    0    0   804 18466 5014 6430  2  1 75 23
 2 35 573424 307564  27200 28509920    0    0  3912 16280 14377 15470 14  3 28 56
 0  5 573424 282848  27208 28533964    0    0  7484 27580 22017 25938 17  3 17 63
 1  5 573424 221100  27208 28563360    0    0  2852  3120 19639 28664 12  5 52 31
 0  4 573428 229912  26704 28519184    0    4  1208  5890 13976 20851 13  3 56 28
 0  2 573448 234680  26672 28513632    0   20     0 17204 1694 2636  0  0 71 28
 3  7 573452 220836  26644 28525548    0    4  1540 36370 27928 36551 17  5 50 27
 1  3 573488 234380  26556 28517416    0   36   584 19066 8275 9467  3  2 60 36
 0  1 573488 234496  26556 28517852    0    0    56 47429 3290 4310  0  0 79 20

6) sudo lsof - a hell of a lot of output, I can post it if anyone is interested :-)

#### Notes and thoughts  ##############################################################################

As you can see, even though I have moved the pg_xlog folder to the SSD array (md3) the by far largest amount of writes still goes to the regular HDD's (md2), which puzzles me - what can that be?
From stat 3) (the iostat) I notice that the SSD's doesn't seem to be something near fully utilized - maybe something else than just pg_xlog could be moved her? 
I have no idea if the amount of reads/writes is  within the acceptable/capable for my kind of hardware, or if it is far beyond?
In stat 3) (the iotop) it says that the RAID array (md2) is the most "waiting" part, does that taste like a root cause, or more like a symptom of some other bottleneck?

Thanks, for taking the time to look at by data! :-)

Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 11, 2012, at 2:51 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

#### Pitch ######################################################################################
I previously posted this question http://archives.postgresql.org/pgsql-performance/2012-11/msg00289.php about a performance issue with an update query. 
The question evolved into a more general discussion about my setup, and about a lot of I/O wait that I was encountering. Since then, I have gotten a whole lot more familiar with measuring things, and now I "just" need some experienced eyes to judge which direction I should go in - do I have a hardware issue, or a software issue - and what action should I take?

#####  My setup #############################################################################
The use case:
At night time we are doing a LOT of data maintenance, and hence the load on the database is very different from the day time. However we would like to be able to do some of it in the daytime, it's simply just too "heavy" on the database as is right now. The stats shown below is from one of those "heavy" load times.

Hardware: 
  - 32Gb ram 
  - 8 core Xeon E3-1245 processor
  - Two SEAGATE ST33000650NS drives (called sdc and sdd in the stats) in a softeware RAID1 array (called md2 in the stats)
  - Two INTEL SSDSC2CW240A3 SSD drives (called sda and sdb in the stats) in a software RAID1 (called md3 in the stats)

Software:
Postgres 9.2 running on 64bit ubuntu 12.04 with kernel 3.2

Configuration:
# postgresql.conf (a shortlist of everything changed from the default)
data_directory = '/var/lib/postgresql/9.2/main'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
external_pid_file = '/var/run/postgresql/9.2-main.pid'
listen_addresses = '192.168.0.2, localhost'
port = 5432
max_connections = 300
unix_socket_directory = '/var/run/postgresql'
wal_level = hot_standby
synchronous_commit = off
archive_mode = on
archive_command = 'rsync -a %p postgres@192.168.0.4:/var/lib/postgresql/9.2/wals/%f </dev/null'
max_wal_senders = 1
wal_keep_segments = 32
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
datestyle = 'iso, mdy'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 22GB
work_mem = 160MB
wal_buffers = 4MB
checkpoint_segments = 100
shared_buffers = 4GB
checkpoint_timeout = 10min

The kernel has bee tweaked like so:
vm.dirty_ratio = 10
vm.dirty_background_ratio = 1
kernel.shmmax = 8589934592
kernel.shmall = 17179869184

The pg_xlog folder has been moved onto the SSD array (md3), and symlinked back into the postgres dir.


Actually, you should move xlog to rotating drives, since wal logs written sequentially, and everything else to ssd, because of random io pattern.


##### The stats ###############################################################
These are the typical observations/stats I see in one of these periods:

1)
At top level this is what I see in new relic:

2)
When the database is loaded like this, I see a lot of queries talking up to 1000 times as long, as they would when the database is not loaded so heavily.

3)
sudo iostat -dmx (typical usage)
Linux 3.2.0-33-generic (master-db)  12/10/2012  _x86_64_ (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     6.52    3.59   26.61     0.22     0.74    65.49     0.01    0.40    0.77    0.35   0.14   0.43
sdb               0.00     8.31    0.03   28.38     0.00     0.97    69.63     0.01    0.52    0.27    0.52   0.15   0.43
sdc               1.71    46.01   34.83  116.62     0.56     4.06    62.47     1.90   12.57   21.81    9.81   1.89  28.66
sdd               1.67    46.14   34.89  116.49     0.56     4.06    62.46     1.58   10.43   21.66    7.07   1.89  28.60
md1               0.00     0.00    0.00    0.00     0.00     0.00     2.69     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.11    0.24     0.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00   72.99  161.95     1.11     4.06    45.10     0.00    0.00    0.00    0.00   0.00   0.00
md3               0.00     0.00    0.05   32.32     0.00     0.74    47.00     0.00    0.00    0.00    0.00   0.00   0.00

3)
sudo iotop -oa (running for about a minute or so)
TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  292    be/4 root               0.00 B      0.00 B    0.00 % 99.33 % [md2_raid1]
 2815  be/4 postgres     19.51 M     25.90 M  0.00 % 45.49 % postgres: autovacuum worker process   production
32553 be/4 postgres     45.74 M      9.38 M  0.00 % 37.89 % postgres: user production 192.168.0.3(58866) UPDATE
32570 be/4 postgres      6.91 M     35.02 M  0.00 % 16.71 % postgres: user production 192.168.0.3(35547) idle
32575 be/4 postgres      4.06 M     43.90 M  0.00 % 16.62 % postgres: user production 192.168.0.3(35561) SELECT
31673 be/4 postgres      4.14 M     52.16 M  0.00 % 16.24 % postgres: user production 192.168.0.3(39112) idle
32566 be/4 postgres      4.73 M     44.95 M  0.00 % 15.66 % postgres: user production 192.168.0.3(35531) idle
32568 be/4 postgres      4.50 M     33.84 M  0.00 % 14.62 % postgres: user production 192.168.0.3(35543) SELECT
32573 be/4 postgres      3.20 M     34.44 M  0.00 % 13.98 % postgres: user production 192.168.0.3(35559) idle
31590 be/4 postgres      3.23 M     29.72 M  0.00 % 13.90 % postgres: user production 192.168.0.3(50690) idle in transaction
32577 be/4 postgres      5.09 M     25.54 M  0.00 % 13.63 % postgres: user production 192.168.0.3(35563) idle
32565 be/4 postgres      2.06 M     35.93 M  0.00 % 13.41 % postgres: user production 192.168.0.3(35529) SELECT
32546 be/4 postgres      4.48 M     36.49 M  0.00 % 13.39 % postgres: user production 192.168.0.3(56927) UPDATE waiting
32569 be/4 postgres      3.50 M     26.75 M  0.00 % 12.82 % postgres: user production 192.168.0.3(35545) INSERT
31671 be/4 postgres      4.58 M     24.45 M  0.00 % 12.76 % postgres: user production 192.168.0.3(34841) idle in transaction
32551 be/4 postgres      3.26 M     31.77 M  0.00 % 12.06 % postgres: user production 192.168.0.3(58864) idle in transaction
32574 be/4 postgres      5.32 M     32.92 M  0.00 % 11.70 % postgres: user production 192.168.0.3(35560) idle
32572 be/4 postgres      3.00 M     32.66 M  0.00 % 11.66 % postgres: user production 192.168.0.3(35558) UPDATE
32560 be/4 postgres      5.12 M     25.89 M  0.00 % 11.52 % postgres: user production 192.168.0.3(33886) SELECT
32567 be/4 postgres      4.66 M     36.47 M  0.00 % 11.44 % postgres: user production 192.168.0.3(35534) SELECT
32571 be/4 postgres      2.86 M     31.27 M  0.00 % 11.31 % postgres: user production 192.168.0.3(35557) SELECT
32552 be/4 postgres      4.38 M     28.75 M  0.00 % 11.09 % postgres: user production 192.168.0.3(58865) idle in transaction
32554 be/4 postgres      3.69 M     30.21 M  0.00 % 10.90 % postgres: user production 192.168.0.3(58870) UPDATE
  339    be/3 root               0.00 B       2.29 M  0.00 %  9.81 % [jbd2/md2-8]
32576 be/4 postgres      3.37 M     19.91 M  0.00 %  9.73 % postgres: user production 192.168.0.3(35562) idle
32555 be/4 postgres      3.09 M     31.96 M  0.00 %  9.02 % postgres: user production 192.168.0.3(58875) SELECT
27548 be/4 postgres      0.00 B     97.12 M  0.00 %  7.41 % postgres: writer process
31445 be/4 postgres    924.00 K     27.35 M  0.00 %  7.11 % postgres: user production 192.168.0.1(34536) idle
31443 be/4 postgres      2.54 M      4.56 M  0.00 %  6.32 % postgres: user production 192.168.0.1(34508) idle
31459 be/4 postgres   1480.00 K     21.36 M  0.00 %  5.63 % postgres: user production 192.168.0.1(34543) idle
 1801 be/4 postgres   1896.00 K     10.89 M  0.00 %  5.57 % postgres: user production 192.168.0.3(34177) idle
32763 be/4 postgres   1696.00 K      6.95 M  0.00 %  5.33 % postgres: user production 192.168.0.3(57984) SELECT
 1800 be/4 postgres      2.46 M      5.13 M  0.00 %  5.24 % postgres: user production 192.168.0.3(34175) SELECT
 1803 be/4 postgres   1816.00 K      9.09 M  0.00 %  5.16 % postgres: user production 192.168.0.3(34206) idle
32578 be/4 postgres      2.57 M     11.62 M  0.00 %  5.06 % postgres: user production 192.168.0.3(35564) SELECT
31440 be/4 postgres      3.02 M      4.04 M  0.00 %  4.65 % postgres: user production 192.168.0.1(34463) idle
32605 be/4 postgres   1844.00 K     11.82 M  0.00 %  4.49 % postgres: user production 192.168.0.3(40399) idle
27547 be/4 postgres      0.00 B      0.00 B  0.00 %  3.93 % postgres: checkpointer process
31356 be/4 postgres   1368.00 K      3.27 M  0.00 %  3.93 % postgres: user production 192.168.0.1(34450) idle
32542 be/4 postgres   1180.00 K      6.05 M  0.00 %  3.90 % postgres: user production 192.168.0.3(56859) idle
32523 be/4 postgres   1088.00 K      4.33 M  0.00 %  3.59 % postgres: user production 192.168.0.3(48164) idle
32606 be/4 postgres   1964.00 K      6.94 M  0.00 %  3.51 % postgres: user production 192.168.0.3(40426) SELECT
31466 be/4 postgres   1596.00 K      3.11 M  0.00 %  3.47 % postgres: user production 192.168.0.1(34550) idle
32544 be/4 postgres   1184.00 K      4.25 M  0.00 %  3.38 % postgres: user production 192.168.0.3(56861) idle
31458 be/4 postgres   1088.00 K   1528.00 K  0.00 %  3.33 % postgres: user production 192.168.0.1(34541) idle
31444 be/4 postgres    884.00 K      4.23 M  0.00 %  3.27 % postgres: user production 192.168.0.1(34510) idle
32522 be/4 postgres    408.00 K      2.98 M  0.00 %  3.27 % postgres: user production 192.168.0.5(38361) idle
32762 be/4 postgres   1156.00 K      5.28 M  0.00 %  3.20 % postgres: user production 192.168.0.3(57962) idle
32582 be/4 postgres   1084.00 K      3.38 M  0.00 %  2.86 % postgres: user production 192.168.0.5(43104) idle
31353 be/4 postgres      2.04 M      3.02 M  0.00 %  2.82 % postgres: user production 192.168.0.1(34444) idle
31441 be/4 postgres    700.00 K      2.68 M  0.00 %  2.64 % postgres: user production 192.168.0.1(34465) idle
31462 be/4 postgres    980.00 K      3.50 M  0.00 %  2.57 % postgres: user production 192.168.0.1(34547) idle
32709 be/4 postgres    428.00 K      3.23 M  0.00 %  2.56 % postgres: user production 192.168.0.5(34323) idle
  685 be/4 postgres    748.00 K      3.59 M  0.00 %  2.41 % postgres: user production 192.168.0.3(34911) idle
  683 be/4 postgres    728.00 K      3.19 M  0.00 %  2.38 % postgres: user production 192.168.0.3(34868) idle
32765 be/4 postgres    464.00 K      3.76 M  0.00 %  2.21 % postgres: user production 192.168.0.3(58074) idle
32760 be/4 postgres    808.00 K      6.18 M  0.00 %  2.16 % postgres: user production 192.168.0.3(57958) idle
 1912 be/4 postgres    372.00 K      3.03 M  0.00 %  2.16 % postgres: user production 192.168.0.5(33743) idle
31446 be/4 postgres   1004.00 K      2.09 M  0.00 %  2.16 % postgres: user production 192.168.0.1(34539) idle
31460 be/4 postgres    584.00 K      2.74 M  0.00 %  2.10 % postgres: user production 192.168.0.1(34545) idle

5) vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1 573424 321080  27124 28504352    0    0   143   618    0    4  2  0 91  7
 0  1 573424 320764  27124 28504496    0    0   104 15654 3788 4961  1  0 85 14
 0  1 573424 320684  27124 28504616    0    0   276 12736 4099 5374  0  1 84 15
 0  1 573424 319672  27124 28504900    0    0    80  7746 3624 4949  2  0 82 16
 0  1 573424 319180  27124 28504972    0    0    36 12489 3653 4761  2  0 86 12
 0  1 573424 318184  27132 28505000    0    0     8 10482 3413 4898  0  0 87 13
 0  1 573424 318424  27132 28505016    0    0     0  9564 2746 4290  0  0 87 13
 0  1 573424 318308  27168 28505016    0    0    36 10562 1895 2149  0  0 87 12
 0  3 573424 318208  27168 28505020    0    0    84 18529 3035 3265  1  0 85 14
 0  1 573424 318732  27176 28505080    0    0    84 14574 2986 3231  0  0 84 16
 0  2 573424 317588  27176 28505184    0    0     4  6681 1991 2207  2  1 86 12
 0  1 573424 316852  27176 28505260    0    0    76  7670 2910 3996  2  1 85 13
 0  1 573424 316632  27184 28505256    0    0     0  7186 2661 3740  0  0 87 12
 0  1 573424 316720  27188 28505260    0    0     0  2590 1731 2474  0  0 88 12
 0  1 573424 314252  27192 28505696    0    0   460 11612 1757 2431  0  0 82 18
 0  2 573424 313504  27192 28505724    0    0     0 19656 1775 2099  0  0 83 17
 0  3 573424 313300  27196 28505780    0    0   188  6237 2746 3193  2  0 80 17
 0  2 573424 312736  27200 28506348    0    0   804 18466 5014 6430  2  1 75 23
 2 35 573424 307564  27200 28509920    0    0  3912 16280 14377 15470 14  3 28 56
 0  5 573424 282848  27208 28533964    0    0  7484 27580 22017 25938 17  3 17 63
 1  5 573424 221100  27208 28563360    0    0  2852  3120 19639 28664 12  5 52 31
 0  4 573428 229912  26704 28519184    0    4  1208  5890 13976 20851 13  3 56 28
 0  2 573448 234680  26672 28513632    0   20     0 17204 1694 2636  0  0 71 28
 3  7 573452 220836  26644 28525548    0    4  1540 36370 27928 36551 17  5 50 27
 1  3 573488 234380  26556 28517416    0   36   584 19066 8275 9467  3  2 60 36
 0  1 573488 234496  26556 28517852    0    0    56 47429 3290 4310  0  0 79 20

6) sudo lsof - a hell of a lot of output, I can post it if anyone is interested :-)

#### Notes and thoughts  ##############################################################################

As you can see, even though I have moved the pg_xlog folder to the SSD array (md3) the by far largest amount of writes still goes to the regular HDD's (md2), which puzzles me - what can that be?
From stat 3) (the iostat) I notice that the SSD's doesn't seem to be something near fully utilized - maybe something else than just pg_xlog could be moved her? 
I have no idea if the amount of reads/writes is  within the acceptable/capable for my kind of hardware, or if it is far beyond?
In stat 3) (the iotop) it says that the RAID array (md2) is the most "waiting" part, does that taste like a root cause, or more like a symptom of some other bottleneck?

Thanks, for taking the time to look at by data! :-)

Re: Do I have a hardware or a software problem?

От
Jeff Janes
Дата:
On Mon, Dec 10, 2012 at 2:51 PM, Niels Kristian Schjødt
<nielskristian@autouncle.com> wrote:

> synchronous_commit = off
>
> The pg_xlog folder has been moved onto the SSD array (md3), and symlinked
> back into the postgres dir.

With synchronous_commit = off, or with large transactions, there is
probably no advantage to moving those to SSD.


> 2)
> When the database is loaded like this, I see a lot of queries talking up to
> 1000 times as long, as they would when the database is not loaded so
> heavily.

What kinds of queries are they?  single-row look-ups, full table scans, etc.


>
> #### Notes and thoughts
> ##############################################################################
>
> As you can see, even though I have moved the pg_xlog folder to the SSD array
> (md3) the by far largest amount of writes still goes to the regular HDD's
> (md2), which puzzles me - what can that be?

Every row you insert or non-HOT update has to do maintenance on all
indexes of that table.  If the rows are not inserted/updated in index
order, this means you every row inserted/updated dirties a randomly
scattered 8KB for each of the indexes.  If you have lots of indexes
per table, that adds up fast.

The fact that there is much more writing than reading tells me that
most of your indexes are in RAM.  The amount of index you are rapidly
reading and dirtying is large enough to fit in RAM, but is not large
enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
 So you are redirtying the same blocks over and over, PG is
desperately dumping them to the kernel (because shared_buffers it too
small to hold them) and the kernel is desperately dumping them to
disk, because vm.dirty_background_ratio is so low.  There is little
opportunity for write-combining, because they don't sit in memory long
enough to accumulate neighbors.

How big are your indexes?

You could really crank up shared_buffers or vm.dirty_background_ratio,
but doing so might cause problems with checkpoints stalling and
latency spikes.  That would probably not be a problem during the
night, but could be during the day.

Rather than moving maintenance to the day and hoping it doesn't
interfere with normal operations, I'd focus on making night-time
maintenance more efficient, for example by dropping indexes (either
just at night, or if some indexes are not useful, just get rid of them
altogether), or cranking up shared_buffers at night, or maybe
partitioning or look into pg_bulkload.

> From stat 3) (the iostat) I notice that the SSD's doesn't seem to be
> something near fully utilized - maybe something else than just pg_xlog could
> be moved her?

I don't know how big each disk is, or how big your various categories
of data are.  Could you move everything to SSD?  Could you move all
your actively updated indexes there?

Or, more fundamentally, it looks like you spent too much on CPUs (86%
idle) and not nearly enough on disks.  Maybe you can fix that for less
money than it will cost you in your optimization time to make the best
of the disks you already have.

Cheers,

Jeff


Re: Do I have a hardware or a software problem?

От
Niels Kristian Schjødt
Дата:
Den 11/12/2012 kl. 00.58 skrev Jeff Janes <jeff.janes@gmail.com>:

> On Mon, Dec 10, 2012 at 2:51 PM, Niels Kristian Schjødt
> <nielskristian@autouncle.com> wrote:
>
>> synchronous_commit = off
>>
>> The pg_xlog folder has been moved onto the SSD array (md3), and symlinked
>> back into the postgres dir.
>
> With synchronous_commit = off, or with large transactions, there is
> probably no advantage to moving those to SSD.
>
>
>> 2)
>> When the database is loaded like this, I see a lot of queries talking up to
>> 1000 times as long, as they would when the database is not loaded so
>> heavily.
>
> What kinds of queries are they?  single-row look-ups, full table scans, etc.
Well Mostly they are updates. Like the one shown in the previous question I referenced.
>>
>> #### Notes and thoughts
>> ##############################################################################
>>
>> As you can see, even though I have moved the pg_xlog folder to the SSD array
>> (md3) the by far largest amount of writes still goes to the regular HDD's
>> (md2), which puzzles me - what can that be?
>
> Every row you insert or non-HOT update has to do maintenance on all
> indexes of that table.  If the rows are not inserted/updated in index
> order, this means you every row inserted/updated dirties a randomly
> scattered 8KB for each of the indexes.  If you have lots of indexes
> per table, that adds up fast.
>
> The fact that there is much more writing than reading tells me that
> most of your indexes are in RAM.  The amount of index you are rapidly
> reading and dirtying is large enough to fit in RAM, but is not large
> enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I
somehowutilize more of it? 
> So you are redirtying the same blocks over and over, PG is
> desperately dumping them to the kernel (because shared_buffers it too
> small to hold them) and the kernel is desperately dumping them to
> disk, because vm.dirty_background_ratio is so low.  There is little
> opportunity for write-combining, because they don't sit in memory long
> enough to accumulate neighbors.
>
> How big are your indexes?
This is a size list of all my indexes: 117 MB, 118 MB, 11 MB, 12 MB, 12 MB, 12 MB, 12 MB, 140 MB, 15 MB, 15 MB, 16 kB,
16kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 MB, 16 MB, 176 kB, 176 kB, 17 MB, 18 MB,
19MB, 23 MB, 240 kB, 24 MB, 256 kB, 25 MB, 25 MB, 26 MB, 26 MB, 27 MB, 27 MB, 27 MB, 27 MB, 280 MB, 2832 kB, 2840 kB,
288kB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 29 MB, 29 MB, 3152 kB, 3280 kB, 32 kB, 32 MB, 32 MB,
3352kB, 3456 kB, 34 MB, 36 MB, 3744 kB, 3776 kB, 37 MB, 37 MB, 3952 kB, 400 kB, 408 kB, 40 kB, 40 kB, 40 kB, 416 kB,
416kB, 42 MB, 432 kB, 4520 kB, 4720 kB, 47 MB, 48 kB, 496 kB, 49 MB, 512 kB, 52 MB, 52 MB, 5304 kB, 5928 kB, 6088 kB,
61MB, 6224 kB, 62 MB, 6488 kB, 64 kB, 6512 kB, 71 MB, 72 kB, 72 kB, 8192 bytes, 8400 kB, 88 MB, 95 MB, 98 MB 
> You could really crank up shared_buffers or vm.dirty_background_ratio,
> but doing so might cause problems with checkpoints stalling and
> latency spikes.  That would probably not be a problem during the
> night, but could be during the day.
What do you have in mind here? Tweaking what parameters to what values?
> .
> Rather than moving maintenance to the day and hoping it doesn't
> interfere with normal operations, I'd focus on making night-time
> maintenance more efficient, for example by dropping indexes (either
> just at night, or if some indexes are not useful, just get rid of them
> altogether), or cranking up shared_buffers at night, or maybe
> partitioning or look into pg_bulkload.
>
>> From stat 3) (the iostat) I notice that the SSD's doesn't seem to be
>> something near fully utilized - maybe something else than just pg_xlog could
>> be moved her?
>
> I don't know how big each disk is, or how big your various categories
> of data are.  Could you move everything to SSD?  Could you move all
> your actively updated indexes there?
With table spaces you mean?
> Or, more fundamentally, it looks like you spent too much on CPUs (86%
> idle) and not nearly enough on disks.  Maybe you can fix that for less
> money than it will cost you in your optimization time to make the best
> of the disks you already have.
The SSD's I use a are 240Gb each which will grow too small within a few months - so - how does moving the whole data
dironto four of those in a RAID5 array sound? 
>
> Cheers,
>
> Jeff



Re: Do I have a hardware or a software problem?

От
Craig Ringer
Дата:
On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>
> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I
somehowutilize more of it? 
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
> The SSD's I use a are 240Gb each which will grow too small within a
> few months - so - how does moving the whole data dir onto four of
> those in a RAID5 array sound?

Not RAID 5!

Use a RAID10 of four or six SSDs.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Do I have a hardware or a software problem?

От
Niels Kristian Schjødt
Дата:
Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:

> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>
>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I
somehowutilize more of it? 
> For an update-mostly workload it probably won't do you tons of good so
> long as all your indexes fit in RAM. You're clearly severely
> bottlenecked on disk I/O not RAM.
>> The SSD's I use a are 240Gb each which will grow too small within a
>> few months - so - how does moving the whole data dir onto four of
>> those in a RAID5 array sound?
>
> Not RAID 5!
>
> Use a RAID10 of four or six SSDs.
>
> --
> Craig Ringer                   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
Hehe got it - did you have a look at the SSD's I am considering building it of?
http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC 
Are they suitable do you think?



Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I
somehowutilize more of it? 
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer                   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of?
http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC 
> Are they suitable do you think?
>

I am not Craig, but i use them in production in raid10 array now.

>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: Do I have a hardware or a software problem?

От
Jeff Janes
Дата:
On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt
<nielskristian@autouncle.com> wrote:
> Den 11/12/2012 kl. 00.58 skrev Jeff Janes <jeff.janes@gmail.com>:
>
>>
>> The fact that there is much more writing than reading tells me that
>> most of your indexes are in RAM.  The amount of index you are rapidly
>> reading and dirtying is large enough to fit in RAM, but is not large
>> enough to fit in shared_buffers + kernel's dirty-buffer comfort level.

> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I
somehowutilize more of it? 

What tool do you use to determine that?  Is that on top of the 4GB
shared_buffers, are including it?

How big is your entire data set?  Maybe all your data fits in 5GB
(believable, as all your indexes listed below sum to < 2.5GB) so there
is no need to use more.

Or maybe you have hit an bug in the 3.2 kernel.  At least one of those
has been frequently discussed.


>> You could really crank up shared_buffers or vm.dirty_background_ratio,
>> but doing so might cause problems with checkpoints stalling and
>> latency spikes.  That would probably not be a problem during the
>> night, but could be during the day.

> What do you have in mind here? Tweaking what parameters to what values?

I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your
data) and see what happens.  And probably increase checkpoint_timeout
and checkpoint_segments about 3x each. Also, turn on log_checkpoints
so you can see what kinds of problem those changes may be causing
there (i.e. long sync times).  Preferably you do this on some kind of
pre-production or test server.

But if your database is growing so rapidly that it soon won't fit on
240GB, then cranking up shared_buffers won't do for long.  If you can
get your tables and all of their indexes clustered together, then you
can do the updates in an order that makes IO more efficient.  Maybe
partitioning would help.


>> I don't know how big each disk is, or how big your various categories
>> of data are.  Could you move everything to SSD?  Could you move all
>> your actively updated indexes there?

> With table spaces you mean?

Yes.  Or moving everything to SSD if it fits, then you don't have go
through and separate objects.

The UPDATE you posted in a previous thread looked like the table
blocks might also be getting dirtied in a fairly random order, which
means the table blocks are in the same condition as the index blocks
so maybe singling out the indexes isn't warranted.

Cheers,

Jeff


Re: Do I have a hardware or a software problem?

От
Niels Kristian Schjødt
Дата:

And what is your experience so far?

Den 11/12/2012 18.16 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:

On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer                   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>

I am not Craig, but i use them in production in raid10 array now.

>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

And what is your experience so far?

Increased tps by a factor of 10, database no longer a limiting factor of application.
And it is cheaper than brand rotating drives.


Den 11/12/2012 18.16 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:

On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer                   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>

I am not Craig, but i use them in production in raid10 array now.

>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Do I have a hardware or a software problem?

От
Niels Kristian Schjødt
Дата:

Are you using a hardware based raid controller with them?

Den 11/12/2012 20.11 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:

On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

And what is your experience so far?

Increased tps by a factor of 10, database no longer a limiting factor of application.
And it is cheaper than brand rotating drives.


Den 11/12/2012 18.16 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:

On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer                   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>

I am not Craig, but i use them in production in raid10 array now.

>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

Are you using a hardware based raid controller with them?

Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast fsync without it.
Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.

Den 11/12/2012 20.11 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:

On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

And what is your experience so far?

Increased tps by a factor of 10, database no longer a limiting factor of application.
And it is cheaper than brand rotating drives.


Den 11/12/2012 18.16 skrev "Evgeny Shishkin" <itparanoia@gmail.com>:

On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

>
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer <craig@2ndQuadrant.com>:
>
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound?
>>
>> Not RAID 5!
>>
>> Use a RAID10 of four or six SSDs.
>>
>> --
>> Craig Ringer                   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>>
> Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> Are they suitable do you think?
>

I am not Craig, but i use them in production in raid10 array now.

>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: Do I have a hardware or a software problem?

От
Craig Ringer
Дата:
On 12/12/2012 06:44 AM, Evgeny Shishkin wrote:

On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

Are you using a hardware based raid controller with them?

Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast fsync without it.

Most SSDs should offer fairly fast fsync without a hardware RAID controller, as they do write-back caching. The trick is to find ones that do write-back caching safely, so you don't get severe data corruption on power-loss.

A HW RAID controller is an absolute must for rotating magnetic media, though.

Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.

I've personally been pretty happy with mdadm. I find the array portability it offers very useful, so I don't need to buy a second RAID controller just in case my main controller dies and I need a compatible one to get the array running again. If you don't need a BBU for safe write-back caching then mdadm has advantages over hardware RAID.

I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware RAID controllers. I suspect a mid- to high end HW RAID unit will generally win.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 12, 2012, at 5:03 AM, Craig Ringer <craig@2ndQuadrant.com> wrote:

On 12/12/2012 06:44 AM, Evgeny Shishkin wrote:

On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

Are you using a hardware based raid controller with them?

Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast fsync without it.

Most SSDs should offer fairly fast fsync without a hardware RAID controller, as they do write-back caching. The trick is to find ones that do write-back caching safely, so you don't get severe data corruption on power-loss.


Actually most of low-end SSDs don't do write caching, they do not have enough ram for that. Sandforce for example.

A HW RAID controller is an absolute must for rotating magnetic media, though.

Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.

I've personally been pretty happy with mdadm. I find the array portability it offers very useful, so I don't need to buy a second RAID controller just in case my main controller dies and I need a compatible one to get the array running again. If you don't need a BBU for safe write-back caching then mdadm has advantages over hardware RAID.


If we are talking about dedicated machine for database with ssd drives, why would anybody don't by hardware raid for about 500-700$?  

I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware RAID controllers. I suspect a mid- to high end HW RAID unit will generally win.


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
Craig Ringer
Дата:
On 12/12/2012 09:17 AM, Evgeny Shishkin wrote:

Actually most of low-end SSDs don't do write caching, they do not have enough ram for that. Sandforce for example.

Or, worse, some of them do limited write caching but don't protect their write cache from power loss. Instant data corruption!

I would be extremely reluctant to use low-end SSDs for a database server.

If we are talking about dedicated machine for database with ssd drives, why would anybody don't by hardware raid for about 500-700$?
I'd want to consider whether the same money is better spent on faster, higher quality SSDs with their own fast write caches.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 12, 2012, at 5:29 AM, Craig Ringer <craig@2ndQuadrant.com> wrote:

On 12/12/2012 09:17 AM, Evgeny Shishkin wrote:

Actually most of low-end SSDs don't do write caching, they do not have enough ram for that. Sandforce for example.

Or, worse, some of them do limited write caching but don't protect their write cache from power loss. Instant data corruption!

I would be extremely reluctant to use low-end SSDs for a database server.

If we are talking about dedicated machine for database with ssd drives, why would anybody don't by hardware raid for about 500-700$?
I'd want to consider whether the same money is better spent on faster, higher quality SSDs with their own fast write caches.


High quality ssd costs 5-7$ per GB. Consumer grade ssd - 1$. Highend - 11$
New intel dc s3700 2-3$ per GB as far as i remember.

So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. Cost effective and fast, may be not very safe, but so far so good. All data protection measures from postgresql are on, of course.
-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
Rosser Schwarz
Дата:
On Tue, Dec 11, 2012 at 5:17 PM, Evgeny Shishkin <itparanoia@gmail.com> wrote:
Actually most of low-end SSDs don't do write caching, they do not have enough ram for that.

AIUI, *all* SSDs do write-caching of a sort: writes are actually flushed to the NAND media by erasing, and then overwriting the erased space, and erasing is done in fixed-size blocks, usually much larger than a filesystem's pages.  The drive's controller accumulates writes in an on-board cache until it has an "erase block"'s worth of them, which are then flushed.  From casual searching, a common erase block size is 256 kbytes, while filesystem-level pages are usually 4k.

Most low-end (and even many mid-range) SSDs, including Sandforce-based drives, don't offer any form of protection (e.g., supercaps, as featured on the Intel 320 and 710-series drives) for the data in that write cache, however, which may be what you're thinking of.  I wouldn't let one of those anywhere near one of my servers, unless it was a completely disposable, load-balanced slave, and probably not even then.

rls

--
:wq

Re: Do I have a hardware or a software problem?

От
Craig Ringer
Дата:
On 12/12/2012 09:44 AM, Evgeny Shishkin wrote:
So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. Cost effective and fast, may be not very safe, but so far so good. All data protection measures from postgresql are on, of course.

You're aware that many low end SSDs lie to the RAID controller about having written data, right? Even if the RAID controller sends a flush command, the SSD might cache the write in non-durable cache. If you're using such SSDs and you lose power, data corruption is extremely likely, because your SSDs are essentially ignoring fsync.

Your RAID controller's BBU won't save you, because once the disks tell the RAID controller the data has hit durable storage, the RAID controller feels free to flush it from its battery backed cache. If the disks are lying...

The only solid way to find out if this is an issue with your SSDs is to do plug-pull testing and find out.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 12, 2012, at 6:02 AM, Craig Ringer <craig@2ndQuadrant.com> wrote:

On 12/12/2012 09:44 AM, Evgeny Shishkin wrote:
So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. Cost effective and fast, may be not very safe, but so far so good. All data protection measures from postgresql are on, of course.

You're aware that many low end SSDs lie to the RAID controller about having written data, right? Even if the RAID controller sends a flush command, the SSD might cache the write in non-durable cache. If you're using such SSDs and you lose power, data corruption is extremely likely, because your SSDs are essentially ignoring fsync.

Your RAID controller's BBU won't save you, because once the disks tell the RAID controller the data has hit durable storage, the RAID controller feels free to flush it from its battery backed cache. If the disks are lying...

The only solid way to find out if this is an issue with your SSDs is to do plug-pull testing and find out.


Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.
Have you heard of them on this list?
-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
David Boreham
Дата:
On 12/11/2012 7:13 PM, Evgeny Shishkin wrote:
> Yes, i am aware of this issue. Never experienced this neither on intel
> 520, no ocz vertex 3.
> Have you heard of them on this list?
People have done plug-pull tests and reported the results on the list
(sometime in the past couple of years).

But you don't need to do the test to know these drives are not safe.
They're unsafe by design.




Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:
On Dec 12, 2012, at 6:15 AM, David Boreham <david_list@boreham.org> wrote:

> On 12/11/2012 7:13 PM, Evgeny Shishkin wrote:
>> Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.
>> Have you heard of them on this list?
> People have done plug-pull tests and reported the results on the list (sometime in the past couple of years).
>
> But you don't need to do the test to know these drives are not safe. They're unsafe by design.
>

Oh, there is no 100% safe system. The only way to be sure is to read data back.
Everything about system design is tradeoff between cost and risks.

>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: Do I have a hardware or a software problem?

От
David Boreham
Дата:
On 12/11/2012 7:20 PM, Evgeny Shishkin wrote:
> Oh, there is no 100% safe system.
In this case we're discussing specifically "safety in the event of power
loss shortly after the drive indicates to the controller that it has
committed a write operation". Some drives do provide 100% safety against
this event, and they don't cost much more than those that don't.




Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:
On Dec 12, 2012, at 6:26 AM, David Boreham <david_list@boreham.org> wrote:

> On 12/11/2012 7:20 PM, Evgeny Shishkin wrote:
>> Oh, there is no 100% safe system.
> In this case we're discussing specifically "safety in the event of power loss shortly after the drive indicates to
thecontroller that it has committed a write operation". Some drives do provide 100% safety against this event, and they
don'tcost much more than those that don't. 

Which drives would you recommend? Besides intel 320 and 710.


>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: Do I have a hardware or a software problem?

От
David Boreham
Дата:
On 12/11/2012 7:38 PM, Evgeny Shishkin wrote:
> Which drives would you recommend? Besides intel 320 and 710.
Those are the only drive types we have deployed in servers at present
(almost all 710, but we have some 320 for less mission-critical
machines). The new DC-S3700 Series looks nice too, but isn't yet in the
sales channel :
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html





Re: Do I have a hardware or a software problem?

От
Craig Ringer
Дата:
On 12/12/2012 10:13 AM, Evgeny Shishkin wrote:

Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.


I wouldn't trust either of those drives. The 520 doesn't have Intel's " Enhanced Power Loss Data Protection"; it's going to lose its buffers if it loses power. Similarly, the Vertex 3 doesn't have any kind of power protection. See:

http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
http://ark.intel.com/products/family/56572/Intel-SSD-500-Family

http://www.ocztechnology.com/res/manuals/OCZ_SSD_Breakdown_Q2-11_1.pdf

The only way I'd use those for a production server was if I had synchronous replication running to another machine with trustworthy, durable storage - and if I didn't mind some downtime to restore the corrupt DB from the replica after power loss.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 12, 2012, at 6:41 AM, David Boreham <david_list@boreham.org> wrote:

On 12/11/2012 7:38 PM, Evgeny Shishkin wrote:
Which drives would you recommend? Besides intel 320 and 710.
Those are the only drive types we have deployed in servers at present (almost all 710, but we have some 320 for less mission-critical machines). The new DC-S3700 Series looks nice too, but isn't yet in the sales channel :
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html

Yeah, s3700 looks promising, but sata interface is limiting factor for this drive.

but i don't heard of it anywhere else.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Do I have a hardware or a software problem?

От
Mark Kirkwood
Дата:
On 12/12/12 15:41, David Boreham wrote:
> On 12/11/2012 7:38 PM, Evgeny Shishkin wrote:
>> Which drives would you recommend? Besides intel 320 and 710.
> Those are the only drive types we have deployed in servers at present
> (almost all 710, but we have some 320 for less mission-critical
> machines). The new DC-S3700 Series looks nice too, but isn't yet in
> the sales channel :
> http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html
>
>
>
>
>
>
+1

The s3700 is probably the one to get (when it is available). I'd opt for
the 710 if you need something now. I'd avoid the 320 - we have
encountered the firmware bug whereby you get an 8MB (yes 8MB) capacity
after powerdown with a depressingly large number of them (they were
updated to the latest firmware too).

Regards

Mark


Re: Do I have a hardware or a software problem?

От
David Boreham
Дата:
On 12/11/2012 7:49 PM, Evgeny Shishkin wrote:
> Yeah, s3700 looks promising, but sata interface is limiting factor for
> this drive.
> I'm looking towards SMART ssd
> http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review
>
What don't you like about SATA ?

I prefer to avoid SAS drives if possible due to the price premium for
dubious benefits besides vague hand-waving "enterprise-ness" promises.




Re: Do I have a hardware or a software problem?

От
Evgeny Shishkin
Дата:

On Dec 12, 2012, at 7:05 AM, David Boreham <david_list@boreham.org> wrote:

On 12/11/2012 7:49 PM, Evgeny Shishkin wrote:
Yeah, s3700 looks promising, but sata interface is limiting factor for this drive.
I'm looking towards SMART ssd http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review

What don't you like about SATA ?

I prefer to avoid SAS drives if possible due to the price premium for dubious benefits besides vague hand-waving "enterprise-ness" promises.



Intel makes the case that the S3700 is the ideal drive for entry, mainstream and performance enterprise computing including HPC use cases. The claim is bold, largely because of the decision to go with a SATA interface, which has several limitations in the enterprise. The SATA interface tops out at a queue depth 32 (SAS scales as high as 256 in most cases) which means that when requests go above that level average and peak latency spike as we saw in all of our workloads.

Another huge advantage of SAS is the ability to offer dual-port modes for high availability scenarios, where there are two controllers interfacing with the same drive at the same time. In the event one goes offline, the connection with the SSD is not lost, as it would with a standard SATA interface without additional hardware. Some SAS drives also offer wide-port configurations used to increase total bandwidth above a single-link connection. While the Intel SSD DC S3700 against other SATA competitors is very fast, the story changes when you introduce the latest MLC and SLC-based SAS SSDs, which can cope better with increased thread and queue levels.

We picked the primary post-preconditioning sections of our benchmarks after each SSD had reached steady-state. For the purposes of this section, we added the Intel SSD DC S3700 onto the throughput charts of the newest SAS high-performance SSDs. There are also significant latency differences at higher queue depths that play a significant factor, but for the sake of easy comparison we stick with raw I/O speed across varying thread and queue counts.

In a 100% 4K random write or random read scenario, the Intel SSD DC 3700 performs quite well up against the high-end SAS competition, with the second fastest 4K steady-state speed. When you switch focus to read throughput at a heavy 16T/16Q load it only offers 1/2 to 1/3 the performance of SSDs in this category.

http://www.storagereview.com/images/intel_ssd_dc_s3700_main_slc_4kwrite_throughput.png





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Do I have a hardware or a software problem?

От
David Boreham
Дата:
On 12/11/2012 8:11 PM, Evgeny Shishkin wrote:
>
> Quoting
> http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review
Heh. A fine example of the kind of hand-waving of which I spoke ;)

Higher performance is certainly a benefit, although at present we can't
saturate even a single 710 series drive (the application, CPU, OS, etc
is the bottleneck). Similarly while dual-porting certainly has its uses,
it is not something I need.





Re: Do I have a hardware or a software problem?

От
Niels Kristian Schjødt
Дата:
Well, In fact I do (as you can see from my configuration). I have a similar server running with hot standby replication - and it runs two 3T HDD in a RAID1 array.

So - is it still very bad if I choose to put four intel 520 disks in a RAID10 array on the other production server?

Den 12/12/2012 kl. 03.47 skrev Craig Ringer <craig@2ndquadrant.com>:

On 12/12/2012 10:13 AM, Evgeny Shishkin wrote:

Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.


I wouldn't trust either of those drives. The 520 doesn't have Intel's " Enhanced Power Loss Data Protection"; it's going to lose its buffers if it loses power. Similarly, the Vertex 3 doesn't have any kind of power protection. See:

http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
http://ark.intel.com/products/family/56572/Intel-SSD-500-Family

http://www.ocztechnology.com/res/manuals/OCZ_SSD_Breakdown_Q2-11_1.pdf

The only way I'd use those for a production server was if I had synchronous replication running to another machine with trustworthy, durable storage - and if I didn't mind some downtime to restore the corrupt DB from the replica after power loss.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Do I have a hardware or a software problem?

От
Niels Kristian Schjødt
Дата:
Den 11/12/2012 kl. 18.25 skrev Jeff Janes <jeff.janes@gmail.com>:

> On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt
> <nielskristian@autouncle.com> wrote:
>> Den 11/12/2012 kl. 00.58 skrev Jeff Janes <jeff.janes@gmail.com>:
>>
>>>
>>> The fact that there is much more writing than reading tells me that
>>> most of your indexes are in RAM.  The amount of index you are rapidly
>>> reading and dirtying is large enough to fit in RAM, but is not large
>>> enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
>
>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I
somehowutilize more of it? 
>
> What tool do you use to determine that?  Is that on top of the 4GB
> shared_buffers, are including it?

Okay I might not have made myself clear, I was talking "physical" memory utilization. Here is the stats:
free -m
total       used       free     shared    buffers     cached
Mem:         32075      25554       6520          0         69      22694
-/+ buffers/cache:       2791      29284
Swap:         2046        595       1451
>
> How big is your entire data set?  Maybe all your data fits in 5GB
> (believable, as all your indexes listed below sum to < 2.5GB) so there
> is no need to use more.

It doesn't we are a search engine for used cars, and there are quite a lot of those out there :-) However, my indexes
arealmost all partial indexes, which mean that they are only on cars which is still for sale, so in that sense, the
indexesthem selves doesn't really grow, but the tables do. 

>
> Or maybe you have hit an bug in the 3.2 kernel.  At least one of those
> has been frequently discussed.
>
Might be true - but likely?
>
>>> You could really crank up shared_buffers or vm.dirty_background_ratio,
>>> but doing so might cause problems with checkpoints stalling and
>>> latency spikes.  That would probably not be a problem during the
>>> night, but could be during the day.
>
>> What do you have in mind here? Tweaking what parameters to what values?
>
> I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your

I had that before, Shaun suggested that I changed it to 4GB as he was talking about a strange behavior when larger than
thaton 12.04. But I can say, that there has not been any notable difference between having it at 4Gb and at 8Gb. 

> data) and see what happens.  And probably increase checkpoint_timeout
> and checkpoint_segments about 3x each. Also, turn on log_checkpoints
> so you can see what kinds of problem those changes may be causing
> there (i.e. long sync times).  Preferably you do this on some kind of
> pre-production or test server.
>
> But if your database is growing so rapidly that it soon won't fit on
> 240GB, then cranking up shared_buffers won't do for long.  If you can
> get your tables and all of their indexes clustered together, then you
> can do the updates in an order that makes IO more efficient.  Maybe
> partitioning would help.

Can you explain a little more about this, or provide me a good link?
>
>
>>> I don't know how big each disk is, or how big your various categories
>>> of data are.  Could you move everything to SSD?  Could you move all
>>> your actively updated indexes there?
>
>> With table spaces you mean?
>
> Yes.  Or moving everything to SSD if it fits, then you don't have go
> through and separate objects.
>
> The UPDATE you posted in a previous thread looked like the table
> blocks might also be getting dirtied in a fairly random order, which
> means the table blocks are in the same condition as the index blocks
> so maybe singling out the indexes isn't warranted.
>
> Cheers,
>
> Jeff



Re: Do I have a hardware or a software problem?

От
Craig Ringer
Дата:
On 13/12/2012 12:22 AM, Niels Kristian Schjødt wrote:
Well, In fact I do (as you can see from my configuration). I have a similar server running with hot standby replication - and it runs two 3T HDD in a RAID1 array.

So - is it still very bad if I choose to put four intel 520 disks in a RAID10 array on the other production server?
So long as you have it recording to a synchronous replia on another machine and you're fully prepared to accept the small risk that you'll have total and unrecoverable data corruption on that server, with the corresponding downtime while you rebuild it from the replica, it should be OK.

Alternately, you could use PITR with a basebackup to ship WAL to another machine or a reliable HDD, so you can recover all but the last checkpoint_timeout minutes of data from the base backup + WAL. There's small window of data loss that way, but you don't need a second machine as a streaming replication follower. barman might is worth checking out as a management tool for PITR backups.

If the data is fairly low-value you could even just take nightly backups and accept the risk of losing some data.

--
Craig Ringer

Re: Do I have a hardware or a software problem?

От
Jeff Janes
Дата:
On Wed, Dec 12, 2012 at 8:46 AM, Niels Kristian Schjødt
<nielskristian@autouncle.com> wrote:
>
> Den 11/12/2012 kl. 18.25 skrev Jeff Janes <jeff.janes@gmail.com>:
>
>> On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt
>> <nielskristian@autouncle.com> wrote:
>>
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I
somehowutilize more of it? 
>>
>> What tool do you use to determine that?  Is that on top of the 4GB
>> shared_buffers, are including it?
>
> Okay I might not have made myself clear, I was talking "physical" memory utilization. Here is the stats:
> free -m
> total       used       free     shared    buffers     cached
> Mem:         32075      25554       6520          0         69      22694
> -/+ buffers/cache:       2791      29284
> Swap:         2046        595       1451

I don't how you get 5 Gig from that, though.  You have 22 Gig of
cached file-system, which for your purposes probably counts as being
utilized.  Although we don't know how much of this is for postgres
data files, chances are it is a pretty good chunk.


>>
>> How big is your entire data set?  Maybe all your data fits in 5GB
>> (believable, as all your indexes listed below sum to < 2.5GB) so there
>> is no need to use more.
>
> It doesn't we are a search engine for used cars, and there are quite a lot of those out there :-)

But how big?  More than 22GB?  (you can use \l+ in psql, or du -s on
the data directory)

> However, my indexes are almost all partial indexes, which mean that they are only on cars which is still for sale, so
inthat sense, the indexes them selves doesn't really grow, but the tables do. 

So maybe this reverses things.  If your car table is huge and the
active cars are scattered randomly among all the inactive ones, then
updating random active cars is going to generate a lot of randomly
scattered writing which can't be combined into sequential writes.

Do you have plans for archiving cars that are no longer for sale?  Why
do you keep them around in the first place, i.e. what types of queries
do you do on inactive ones?

Unfortunately you currently can't use CLUSTER with partial indexes,
otherwise that might be a good idea.  You could build a full index on
whatever it is you use as the criterion for the partial indexes,
cluster on that, and then drop it.

But the table would eventually become unclustered again, so if this
works you might want to implement partitioning between active and
inactive partitions so as to maintain the clustering.


>>>> You could really crank up shared_buffers or vm.dirty_background_ratio,
>>>> but doing so might cause problems with checkpoints stalling and
>>>> latency spikes.  That would probably not be a problem during the
>>>> night, but could be during the day.
>>
>>> What do you have in mind here? Tweaking what parameters to what values?
>>
>> I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your
>
> I had that before, Shaun suggested that I changed it to 4GB as he was talking about a strange behavior when larger
thanthat on 12.04. But I can say, that there has not been any notable difference between having it at 4Gb and at 8Gb. 

It is almost an all or nothing thing.  If you need 16 or 20GB, just
going from 4 to 8 isn't going to show much difference.  If you can
test this easily, I'd just set it to 24 or even 28GB and run the bulk
update.  I don't think you'd want to run a server permanently at those
settings, but it is an easy way to rule in or out different theories
about what is going on.

>> But if your database is growing so rapidly that it soon won't fit on
>> 240GB, then cranking up shared_buffers won't do for long.  If you can
>> get your tables and all of their indexes clustered together, then you
>> can do the updates in an order that makes IO more efficient.  Maybe
>> partitioning would help.
>
> Can you explain a little more about this, or provide me a good link?

If all your partial btree indexes are using the same WHERE clause,
then your indexes are already clustered together in a sense--a partial
index is kind of like a composite index with the WHERE clause as the
first column.

So the trick would be to get the table to be clustered on the same
thing--either by partitioning or by the CLUSTER command, or something
equivalent to those.  I don't know of a good link, other than the
documentation (which is more about how to do it, rather than why you
would want to or how to design it)

Cheers,

Jeff


Re: Do I have a hardware or a software problem?

От
Scott Marlowe
Дата:
On Tue, Dec 11, 2012 at 6:03 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 12/12/2012 06:44 AM, Evgeny Shishkin wrote:
>
>
> On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt
> <nielskristian@autouncle.com> wrote:
>
> Are you using a hardware based raid controller with them?
>
> Yes, of course. Hardware raid with cache and bbu is a must. You can't get
> fast fsync without it.
>
>
> Most SSDs should offer fairly fast fsync without a hardware RAID controller,
> as they do write-back caching. The trick is to find ones that do write-back
> caching safely, so you don't get severe data corruption on power-loss.
>
> A HW RAID controller is an absolute must for rotating magnetic media,
> though.
>
>
> Also mdadm is a pain in the ass and is suitable only on amazon and other
> cloud shit.
>
>
> I've personally been pretty happy with mdadm. I find the array portability
> it offers very useful, so I don't need to buy a second RAID controller just
> in case my main controller dies and I need a compatible one to get the array
> running again. If you don't need a BBU for safe write-back caching then
> mdadm has advantages over hardware RAID.
>
> I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware
> RAID controllers. I suspect a mid- to high end HW RAID unit will generally
> win.

Also for sequential throughput md RAID is usually faster than most
RAID controllers, even the high end Areca and LSI ones.