Обсуждение: Performances issues with SSD volume ?
Hi everyone, Due to increasing load of my application, I recently changed my production server from a server with SAS 15k HDDs to a server with intel SSD disks (only for the postgresql data partition). It is RAID 10 configuration in both servers. The rest of the hardware is approximatively the same (maybe a little more recent on the new server). I did not changed the postgresql.conf In a first time, I plugged the new server as a hot standby slave, during weeks. (I have a master/ slave configuration) As everything looks good in terms of performances in this slave role, (latency on /data partition decreased from ~60ms to ~3ms ; and utilisation of the device decreased significantly too), I decided to change server roles, and promote as master the server with SSD drives. (I did several benchmark tests too on both configuration, and SSD config was always better than old HDD config.) After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout) The only way to decrease load is to restart pooler (pgpool) so that all requests are stopped, and then the postgres is accessible again, during some time... I tried to adapt postgresql.conf parameters to SSD config (random page cost=1 ), but this does not changed the problem. When I look IO disk, the RAID volume is 100% busy, which is weird because if was around 20/30% just before. Transactions per second is at this moment very low (<10) I think of a cache issue, but I can't explain why data load should be longer on SSDs than on HDDs, and I don't know how to find the origin of the problem. Does anyone has an idea :) ? Sorry for my poor english thanks, Thomas
Hi everyone, Due to increasing load of my application, I recently changed my production server from a server with SAS 15k HDDs to a server with intel SSD disks (only for the postgresql data partition). It is RAID 10 configuration in both servers. The rest of the hardware is approximatively the same (maybe a little more recent on the new server). I did not changed the postgresql.conf In a first time, I plugged the new server as a hot standby slave, during weeks. (I have a master/ slave configuration) As everything looks good in terms of performances in this slave role, (latency on /data partition decreased from ~60ms to ~3ms ; and utilisation of the device decreased significantly too), I decided to change server roles, and promote as master the server with SSD drives. (I did several benchmark tests too on both configuration, and SSD config was always better than old HDD config.) After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout) The only way to decrease load is to restart pooler (pgpool) so that all requests are stopped, and then the postgres is accessible again, during some time... I tried to adapt postgresql.conf parameters to SSD config (random page cost=1 ), but this does not changed the problem. When I look IO disk, the RAID volume is 100% busy, which is weird because if was around 20/30% just before. Transactions per second is at this moment very low (<10) I think of a cache issue, but I can't explain why data load should be longer on SSDs than on HDDs, and I don't know how to find the origin of the problem. Does anyone has an idea :) ? Sorry for my poor english thanks, Thomas
On Mon, May 18, 2015 at 10:05 AM, Thomas SIMON <tsimon@neteven.com> wrote:
After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout)
Before we go mad, what is the OS/kernel/distro that you are using? Did it change from the previous one?
If Linux, have you disabled transparent_huge_pages and zone_reclaim_mode (two known problematic Linux features)?
If Linux, have you disabled transparent_huge_pages and zone_reclaim_mode (two known problematic Linux features)?
Best regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
Hi matheus,
thanks for reply.
Kernel is slightly different ; with a 3.10.23-xxxx-std-ipv6-64 kernel (OVH kernel) on HDD server, and 3.2.65-1+deb7u1 kernel on SSD server.
I'm using debian wheezy on both servers.
I didn't changed the 2 parameters you talk about.
Thanks,
thanks for reply.
Kernel is slightly different ; with a 3.10.23-xxxx-std-ipv6-64 kernel (OVH kernel) on HDD server, and 3.2.65-1+deb7u1 kernel on SSD server.
I'm using debian wheezy on both servers.
I didn't changed the 2 parameters you talk about.
Thanks,
ThomasLe 18/05/2015 15:13, Matheus de Oliveira a écrit :
On Mon, May 18, 2015 at 10:05 AM, Thomas SIMON <tsimon@neteven.com> wrote:After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout)Before we go mad, what is the OS/kernel/distro that you are using? Did it change from the previous one?
If Linux, have you disabled transparent_huge_pages and zone_reclaim_mode (two known problematic Linux features)?Best regards,--Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
Hi Thomas,
3.2 kernel may be #1 cause of your I/O load problem:
Have a nice day.
Koray
On Mon, May 18, 2015 at 4:35 PM, Thomas SIMON <tsimon@neteven.com> wrote:
Hi matheus,
thanks for reply.
Kernel is slightly different ; with a 3.10.23-xxxx-std-ipv6-64 kernel (OVH kernel) on HDD server, and 3.2.65-1+deb7u1 kernel on SSD server.
I'm using debian wheezy on both servers.
I didn't changed the 2 parameters you talk about.
Thanks,ThomasLe 18/05/2015 15:13, Matheus de Oliveira a écrit :On Mon, May 18, 2015 at 10:05 AM, Thomas SIMON <tsimon@neteven.com> wrote:After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout)Before we go mad, what is the OS/kernel/distro that you are using? Did it change from the previous one?
If Linux, have you disabled transparent_huge_pages and zone_reclaim_mode (two known problematic Linux features)?Best regards,--Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
Hi Thomas,
This is just a wild guess. Are you using hardware RAID? Is the cache battery still working?
Cheers
On 18 May 2015 9:36 pm, "Thomas SIMON" <tsimon@neteven.com> wrote:
Hi matheus,
thanks for reply.
Kernel is slightly different ; with a 3.10.23-xxxx-std-ipv6-64 kernel (OVH kernel) on HDD server, and 3.2.65-1+deb7u1 kernel on SSD server.
I'm using debian wheezy on both servers.
I didn't changed the 2 parameters you talk about.
Thanks,ThomasLe 18/05/2015 15:13, Matheus de Oliveira a écrit :On Mon, May 18, 2015 at 10:05 AM, Thomas SIMON <tsimon@neteven.com> wrote:After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout)Before we go mad, what is the OS/kernel/distro that you are using? Did it change from the previous one?
If Linux, have you disabled transparent_huge_pages and zone_reclaim_mode (two known problematic Linux features)?Best regards,--Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
Wow, thanks a lot for advice Koray!
This is a huge difference indeed !
I will start with upgrading my kernel asap :)
This is a huge difference indeed !
I will start with upgrading my kernel asap :)
ThomasLe 18/05/2015 15:51, Koray Eyidoğan a écrit :
Hi Thomas,3.2 kernel may be #1 cause of your I/O load problem:Have a nice day.KorayOn Mon, May 18, 2015 at 4:35 PM, Thomas SIMON <tsimon@neteven.com> wrote:Hi matheus,
thanks for reply.
Kernel is slightly different ; with a 3.10.23-xxxx-std-ipv6-64 kernel (OVH kernel) on HDD server, and 3.2.65-1+deb7u1 kernel on SSD server.
I'm using debian wheezy on both servers.
I didn't changed the 2 parameters you talk about.
Thanks,ThomasLe 18/05/2015 15:13, Matheus de Oliveira a écrit :On Mon, May 18, 2015 at 10:05 AM, Thomas SIMON <tsimon@neteven.com> wrote:After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout)Before we go mad, what is the OS/kernel/distro that you are using? Did it change from the previous one?
If Linux, have you disabled transparent_huge_pages and zone_reclaim_mode (two known problematic Linux features)?Best regards,--Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
Hi Wei,
Yes i'm using RAID (megacli card), battery is ok, i had already checked BBU :)
Cheers
Yes i'm using RAID (megacli card), battery is ok, i had already checked BBU :)
Cheers
ThomasLe 18/05/2015 15:52, Wei Shan a écrit :
Hi Thomas,
This is just a wild guess. Are you using hardware RAID? Is the cache battery still working?
Cheers
On 18 May 2015 9:36 pm, "Thomas SIMON" <tsimon@neteven.com> wrote:Hi matheus,
thanks for reply.
Kernel is slightly different ; with a 3.10.23-xxxx-std-ipv6-64 kernel (OVH kernel) on HDD server, and 3.2.65-1+deb7u1 kernel on SSD server.
I'm using debian wheezy on both servers.
I didn't changed the 2 parameters you talk about.
Thanks,ThomasLe 18/05/2015 15:13, Matheus de Oliveira a écrit :On Mon, May 18, 2015 at 10:05 AM, Thomas SIMON <tsimon@neteven.com> wrote:After the change, I had the following behavior, and I don't understand why : everything seems to work fine (load is ~6/7, when it was previously ~25/30 on the HDD server), so the SSD server is faster than the HDD one, and my apps run faster too, but after some time (can be 5 minutes or 2 hours), the load average increases suddently (can reach 150 !) and does not decrease, so postgres and my application are almost unusable. (even small requests are in statement timeout)Before we go mad, what is the OS/kernel/distro that you are using? Did it change from the previous one?
If Linux, have you disabled transparent_huge_pages and zone_reclaim_mode (two known problematic Linux features)?Best regards,--Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
> From: Koray Eyidoğan <korayey@gmail.com> >To: Thomas SIMON <tsimon@neteven.com> >Cc: pgsql-admin@postgresql.org >Sent: Monday, 18 May 2015, 14:51 >Subject: Re: [ADMIN] Performances issues with SSD volume ? > > > >Hi Thomas, > > >3.2 kernel may be #1 cause of your I/O load problem: > > >http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html > >https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4 > > > >Have a nice day. > > >Koray Likely 3.2 kernel isn't helping, but I think we need much more information before jumping to conclusions. You say you're IO bound, so some output from sar / iostat / dstat and pg_stat_activity etc before and during the issue wouldbe of use. Also: System memory size SSD Model numbers and how many Raid controller Partition allignments and stripe sizes Kernel options Filesystem used and mount options IO Scheduler Postgresql version and configuration Connection pool sizing Perhaps you could thow us the output of some of these: fdisk -l cat /etc/mtab cat /sys/block/<ssd device>/queue/scheduler sysctl kernel | grep sched sysctl vm select * from pg_stat_activity select name, setting from pg_settings lspci | grep -E 'RAID|SCSI|IDE|SATA'
Hi Glyn, I'll try to answer this points. I've made some benchs, and indeed 3.2 not helping. not helping at all. I changed to 3.14 and gap is quite big ! With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same conditions With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same conditions too. It should so be better, but when server was in production, and ever with bad kernel, performances was already quite good before they quickly decreased. So i think too I have another configuration problem. You say you're IO bound, so some output from sar / iostat / dstat and pg_stat_activity etc before and during the issue wouldbe of use. -> My server is not in production right now, so it is difficult to replay production load and have some useful metrics. The best way I've found is to replay trafic from logs with pgreplay. I hoped that the server falls back by replaying this traffic, but it never happens ... Another thing I can't understand ... Below is my dstat output when I replay this traffic (and so when server runs normally) I have unfortunately no more outputs when server's performances decreased. [17:31:12]root@db10:~$ dstat -alrmy ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ---load-avg--- --io/total- ------memory-usage----- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw | 1m 5m 15m | read writ| used buff cach free| int csw 1 1 98 0 0 0|4870k 16M| 0 0 | 0 0 |4666 35k|4.68 1.48 0.56| 419 774 |6204M 47.1M 70.9G 175G|4666 35k 10 4 81 5 0 0| 177M 368k| 510B 9674B| 0 0 | 32k 81k|5.11 1.62 0.61|14.7k 54.0 |6232M 47.1M 71.1G 175G| 32k 81k 3 2 92 3 0 0| 81M 7568k| 224B 738B| 0 0 | 12k 28k|5.11 1.62 0.61|8687 822 |6233M 47.1M 71.1G 175G| 12k 28k 4 2 92 2 0 0| 87M 7844k|1046B 10k| 0 0 | 12k 25k|5.11 1.62 0.61|8031 837 |6235M 47.1M 71.2G 175G| 12k 25k 14 5 78 3 0 0| 155M 2972k| 556B 1084B| 0 0 | 28k 69k|5.11 1.62 0.61|12.7k 176 |6645M 47.1M 71.4G 174G| 28k 70k 19 4 73 4 0 0| 172M 3012k|1082B 9182B| 0 0 | 33k 73k|5.11 1.62 0.61|12.5k 248 |6727M 47.1M 71.5G 174G| 33k 72k 23 5 66 6 0 0| 247M 20M| 252B 800B| 0 0 | 42k 88k|5.42 1.75 0.65|20.1k 426 |6827M 47.1M 71.8G 173G| 42k 88k 16 3 78 4 0 0| 119M 1384k| 744B 7680B| 0 0 | 28k 47k|5.42 1.75 0.65|11.3k 113 |6883M 47.1M 71.9G 173G| 28k 47k 5 1 92 2 0 0| 55M 5952k|6960B 8210B| 0 0 | 16k 25k|5.42 1.75 0.65|6304 676 |6890M 47.1M 72.0G 173G| 16k 25k 5 2 91 2 0 1| 56M 568k|9593B 13k| 0 0 | 16k 26k|5.42 1.75 0.65|6383 10.0 |6892M 47.1M 72.0G 173G| 16k 26k 5 1 92 2 0 0| 51M 0 | 546B 888B| 0 0 | 14k 24k|5.42 1.75 0.65|6091 0 |6891M 47.1M 72.1G 173G| 14k 24k 21 5 71 4 0 0| 110M 1416k| 384B 6874B| 0 0 | 31k 58k|6.18 1.97 0.73|11.1k 203 |7023M 47.1M 72.2G 173G| 31k 58k 19 4 74 2 0 0| 110M 2364k| 312B 846B| 0 0 | 23k 44k|6.18 1.97 0.73|9399 221 |7055M 47.2M 72.3G 173G| 22k 44k 26 6 65 3 0 0| 110M 6040k|1770B 7068B| 0 0 | 31k 59k|6.18 1.97 0.73|10.5k 781 |7255M 47.2M 72.4G 172G| 31k 59k 23 5 69 3 0 0| 128M 2912k|1146B 1350B| 0 0 | 27k 52k|6.18 1.97 0.73|11.2k 240 |7314M 47.2M 72.5G 172G| 27k 52k Other things you asked System memory size : 256 Go SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model INTEL SSDSC2BB480G4 Raid controller : MegaRAID SAS 2208 Partition alignments and stripe sizes : see fdisk delow Kernel options : the config file is here : ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 Filesystem used and mount options : ext4, see mtab below IO Scheduler : noop [deadline] cfq for my ssd raid volume Postgresql version and configuration : 9.3.5 max_connections=1800 shared_buffers=8GB temp_buffers=32MB work_mem=100MB maintenance_work_mem=12GB bgwriter_lru_maxpages=200 effective_io_concurrency=4 wal_level=hot_standby wal_sync_method=fdatasync wal_writer_delay=2000ms commit_delay=1000 checkpoint_segments=80 checkpoint_timeout=15min checkpoint_completion_target=0.7 archive_command='rsync ....' max_wal_senders=10 wal_keep_segments=38600 vacuum_defer_cleanup_age=100 hot_standby = on max_standby_archive_delay = 5min max_standby_streaming_delay = 5min hot_standby_feedback = on random_page_cost = 1.0 effective_cache_size = 240GB log_min_error_statement = warning log_min_duration_statement = 0 log_checkpoints = on log_connections = on log_disconnections = on log_line_prefix = '%m|%u|%d|%c|' log_lock_waits = on log_statement = 'all' log_timezone = 'localtime' track_activities = on track_functions = pl track_activity_query_size = 8192 autovacuum_max_workers = 5 autovacuum_naptime = 30s autovacuum_vacuum_threshold = 40 autovacuum_analyze_threshold = 20 autovacuum_vacuum_scale_factor = 0.10 autovacuum_analyze_scale_factor = 0.10 autovacuum_vacuum_cost_delay = 5ms default_transaction_isolation = 'read committed' max_locks_per_transaction = 128 Connection pool sizing (pgpool2) num_init_children = 1790 max_pool = 1 I also add megacli parameters : Virtual Drive: 2 (Target Id: 2) Name :datassd RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 893.25 GB Sector Size : 512 Is VD emulated : Yes Mirror Data : 893.25 GB State : Optimal Strip Size : 256 KB Number Of Drives per span:2 Span Depth : 2 Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Default Access Policy: Read/Write Current Access Policy: Read/Write Disk Cache Policy : Enabled Encryption Type : None Bad Blocks Exist: No PI type: No PI Is VD Cached: No Other outputs : fdisk -l Disk /dev/sdc: 959.1 GB, 959119884288 bytes 255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes 255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 cat /etc/mtab /dev/mapper/vg_datassd-lv_datassd /datassd ext4 rw,relatime,discard,nobarrier,data=ordered 0 0 (I added nobarrier option) cat /sys/block/sdc/queue/scheduler noop [deadline] cfq sysctl kernel | grep sched kernel.sched_child_runs_first = 0 kernel.sched_rr_timeslice_ms = 25 kernel.sched_rt_period_us = 1000000 kernel.sched_rt_runtime_us = 950000 I've read some advices about changing kernel.sched_autogroup_enabled=0 and kernel.sched_migration_cost_ns=5000000, but this parameters are not recognized by my kernel. So I don't know what to do with that... sysctl vm vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.dirty_background_bytes = 8388608 vm.dirty_background_ratio = 0 vm.dirty_bytes = 67108864 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 0 vm.dirty_writeback_centisecs = 500 vm.drop_caches = 3 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 vm.max_map_count = 65530 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 65008 vm.min_slab_ratio = 5 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.nr_pdflush_threads = 0 vm.numa_zonelist_order = default vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 2 vm.overcommit_ratio = 50 vm.page-cluster = 3 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.scan_unevictable_pages = 0 vm.stat_interval = 1 vm.swappiness = 60 vm.user_reserve_kbytes = 131072 vm.vfs_cache_pressure = 100 vm.zone_reclaim_mode = 0 select * from pg_stat_activity I've got hundred of entries for that when i'm in production, and I can't paste it here due to confidentiality. By day, it is around 50 millions queries usually. (35% selects ; 55% updates & 5% inserts) lspci | grep -E 'RAID|SCSI|IDE|SATA' 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06) 02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05) 07:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 4-Port SATA Storage Control Unit (rev 06) Thanks Thomas Le 18/05/2015 16:29, Glyn Astill a écrit : >> From: Koray Eyidoğan <korayey@gmail.com> >> To: Thomas SIMON <tsimon@neteven.com> >> Cc: pgsql-admin@postgresql.org >> Sent: Monday, 18 May 2015, 14:51 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> >> >> Hi Thomas, >> >> >> 3.2 kernel may be #1 cause of your I/O load problem: >> >> >> http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html >> >> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4 >> >> >> >> Have a nice day. >> >> >> Koray > > Likely 3.2 kernel isn't helping, but I think we need much more information before jumping to conclusions. > > You say you're IO bound, so some output from sar / iostat / dstat and pg_stat_activity etc before and during the issuewould be of use. > > Also: > System memory size > > SSD Model numbers and how many > Raid controller > Partition allignments and stripe sizes > Kernel options > Filesystem used and mount options > IO Scheduler > Postgresql version and configuration > Connection pool sizing > > > Perhaps you could thow us the output of some of these: > > fdisk -l > cat /etc/mtab > cat /sys/block/<ssd device>/queue/scheduler > sysctl kernel | grep sched > sysctl vm > select * from pg_stat_activity > select name, setting from pg_settings > lspci | grep -E 'RAID|SCSI|IDE|SATA'
> From: Thomas SIMON <tsimon@neteven.com> > To: glynastill@yahoo.co.uk > Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> > Sent: Wednesday, 20 May 2015, 16:41 > Subject: Re: [ADMIN] Performances issues with SSD volume ? > > Hi Glyn, > > I'll try to answer this points. > > I've made some benchs, and indeed 3.2 not helping. not helping at all. > I changed to 3.14 and gap is quite big ! > With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same > conditions > With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same > conditions too. > That's a start then. > It should so be better, but when server was in production, and ever with > bad kernel, performances was already quite good before they quickly > decreased. > So i think too I have another configuration problem. > > You say you're IO bound, so some output from sar / iostat / dstat and > pg_stat_activity etc before and during the issue would be of use. > > -> My server is not in production right now, so it is difficult to > replay production load and have some useful metrics. > The best way I've found is to replay trafic from logs with pgreplay. > I hoped that the server falls back by replaying this traffic, but it > never happens ... Another thing I can't understand ... > > Below is my dstat output when I replay this traffic (and so when server > runs normally) > I have unfortunately no more outputs when server's performances decreased. > It's a shame we can't get any insight into activity on the server during the issues. > > > Other things you asked > > System memory size : 256 Go > SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model > INTEL SSDSC2BB480G4 > Raid controller : MegaRAID SAS 2208 > Partition alignments and stripe sizes : see fdisk delow > Kernel options : the config file is here : > ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 > Filesystem used and mount options : ext4, see mtab below > IO Scheduler : noop [deadline] cfq for my ssd raid volume > Postgresql version and configuration : 9.3.5 > > max_connections=1800 > shared_buffers=8GB > temp_buffers=32MB > work_mem=100MB > maintenance_work_mem=12GB > bgwriter_lru_maxpages=200 > effective_io_concurrency=4 > wal_level=hot_standby > wal_sync_method=fdatasync > wal_writer_delay=2000ms > commit_delay=1000 > checkpoint_segments=80 > checkpoint_timeout=15min > checkpoint_completion_target=0.7 > archive_command='rsync ....' > max_wal_senders=10 > wal_keep_segments=38600 > vacuum_defer_cleanup_age=100 > hot_standby = on > max_standby_archive_delay = 5min > max_standby_streaming_delay = 5min > hot_standby_feedback = on > random_page_cost = 1.0 > effective_cache_size = 240GB > log_min_error_statement = warning > log_min_duration_statement = 0 > log_checkpoints = on > log_connections = on > log_disconnections = on > log_line_prefix = '%m|%u|%d|%c|' > log_lock_waits = on > log_statement = 'all' > log_timezone = 'localtime' > track_activities = on > track_functions = pl > track_activity_query_size = 8192 > autovacuum_max_workers = 5 > autovacuum_naptime = 30s > autovacuum_vacuum_threshold = 40 > autovacuum_analyze_threshold = 20 > autovacuum_vacuum_scale_factor = 0.10 > autovacuum_analyze_scale_factor = 0.10 > autovacuum_vacuum_cost_delay = 5ms > default_transaction_isolation = 'read committed' > max_locks_per_transaction = 128 > > > > Connection pool sizing (pgpool2) > num_init_children = 1790 > max_pool = 1 > 1800 is quite a lot of connections, and with max_pool=1 in pgpool you're effectively just using pgpool as a proxy (as I recall,my memory is a little fuzzy on pgpool now). Unless your app is stateful in some way or has unique users for eachof those 1800 connections you should lower the quantity of active connections. A general starting point is usually cpucores * 2, so you could up max_pool and divide num_init_children by the same amount. Hard to say what you need to do without knowing what exactly you're doing though. What's the nature of the app(s)? > I also add megacli parameters : > > Virtual Drive: 2 (Target Id: 2) > Name :datassd > RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 > Size : 893.25 GB > Sector Size : 512 > Is VD emulated : Yes > Mirror Data : 893.25 GB > State : Optimal > Strip Size : 256 KB > Number Of Drives per span:2 > Span Depth : 2 > Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write > Cache if Bad BBU > Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write > Cache if Bad BBU > Default Access Policy: Read/Write > Current Access Policy: Read/Write > Disk Cache Policy : Enabled > Encryption Type : None > Bad Blocks Exist: No > PI type: No PI > > Is VD Cached: No > Not using your raid controllers write cache then? Not sure just how important that is with SSDs these days, but if you'vegot a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that. > > Other outputs : > fdisk -l > > Disk /dev/sdc: 959.1 GB, 959119884288 bytes > 255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes > 255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > > cat /etc/mtab > /dev/mapper/vg_datassd-lv_datassd /datassd ext4 > rw,relatime,discard,nobarrier,data=ordered 0 0 > (I added nobarrier option) > > > cat /sys/block/sdc/queue/scheduler > noop [deadline] cfq > > You could swap relatime for noatime,nodiratime. > sysctl kernel | grep sched > kernel.sched_child_runs_first = 0 > kernel.sched_rr_timeslice_ms = 25 > kernel.sched_rt_period_us = 1000000 > kernel.sched_rt_runtime_us = 950000 > > I've read some advices about changing kernel.sched_autogroup_enabled=0 > and kernel.sched_migration_cost_ns=5000000, but this parameters are not > recognized by my kernel. So I don't know what to do with that... > sched_migration_cost_ns would be called sched_migration_cost in your old 3.2 kernel, not sure why sched_autogroup_enabled wouldn't be recognized though. > sysctl vm > vm.admin_reserve_kbytes = 8192 > vm.block_dump = 0 > vm.dirty_background_bytes = 8388608 > vm.dirty_background_ratio = 0 > vm.dirty_bytes = 67108864 > vm.dirty_expire_centisecs = 3000 > vm.dirty_ratio = 0 > vm.dirty_writeback_centisecs = 500 > vm.drop_caches = 3 > vm.laptop_mode = 0 > vm.legacy_va_layout = 0 > vm.lowmem_reserve_ratio = 256 256 32 > vm.max_map_count = 65530 > vm.memory_failure_early_kill = 0 > vm.memory_failure_recovery = 1 > vm.min_free_kbytes = 65008 > vm.min_slab_ratio = 5 > vm.min_unmapped_ratio = 1 > vm.mmap_min_addr = 4096 > vm.nr_pdflush_threads = 0 > vm.numa_zonelist_order = default > vm.oom_dump_tasks = 1 > vm.oom_kill_allocating_task = 0 > vm.overcommit_kbytes = 0 > vm.overcommit_memory = 2 > vm.overcommit_ratio = 50 > vm.page-cluster = 3 > vm.panic_on_oom = 0 > vm.percpu_pagelist_fraction = 0 > vm.scan_unevictable_pages = 0 > vm.stat_interval = 1 > vm.swappiness = 60 > vm.user_reserve_kbytes = 131072 > vm.vfs_cache_pressure = 100 > vm.zone_reclaim_mode = 0 > > > select * from pg_stat_activity > I've got hundred of entries for that when i'm in production, and I > can't > paste it here due to confidentiality. > By day, it is around 50 millions queries usually. (35% selects ; 55% > updates & 5% inserts) > > > lspci | grep -E 'RAID|SCSI|IDE|SATA' > 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset > 6-Port SATA AHCI Controller (rev 06) > 02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 > [Thunderbolt] (rev 05) > 07:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset > 4-Port SATA Storage Control Unit (rev 06) > > Thanks > > Thomas > > > Le 18/05/2015 16:29, Glyn Astill a écrit : >>> From: Koray Eyidoğan <korayey@gmail.com> >>> To: Thomas SIMON <tsimon@neteven.com> >>> Cc: pgsql-admin@postgresql.org >>> Sent: Monday, 18 May 2015, 14:51 >>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>> >>> >>> >>> Hi Thomas, >>> >>> >>> 3.2 kernel may be #1 cause of your I/O load problem: >>> >>> >>> > http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html >>> >>> > https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4 >>> >>> >>> >>> Have a nice day. >>> >>> >>> Koray >> >> Likely 3.2 kernel isn't helping, but I think we need much more > information before jumping to conclusions. >> >> You say you're IO bound, so some output from sar / iostat / dstat and > pg_stat_activity etc before and during the issue would be of use. >> >> Also: >> System memory size >> >> SSD Model numbers and how many >> Raid controller >> Partition allignments and stripe sizes >> Kernel options >> Filesystem used and mount options >> IO Scheduler >> Postgresql version and configuration >> Connection pool sizing >> >> >> Perhaps you could thow us the output of some of these: >> >> fdisk -l >> cat /etc/mtab >> cat /sys/block/<ssd device>/queue/scheduler >> sysctl kernel | grep sched >> sysctl vm >> select * from pg_stat_activity >> select name, setting from pg_settings >> lspci | grep -E 'RAID|SCSI|IDE|SATA' > > > > -- > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-admin >
----- Original Message ----- > From: Glyn Astill <glynastill@yahoo.co.uk> > To: Thomas SIMON <tsimon@neteven.com> > Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> > Sent: Wednesday, 20 May 2015, 17:50 > Subject: Re: [ADMIN] Performances issues with SSD volume ? > > > >> From: Thomas SIMON <tsimon@neteven.com> >> To: glynastill@yahoo.co.uk >> Cc: "pgsql-admin@postgresql.org" > <pgsql-admin@postgresql.org> >> Sent: Wednesday, 20 May 2015, 16:41 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> Hi Glyn, >> >> I'll try to answer this points. >> >> I've made some benchs, and indeed 3.2 not helping. not helping at all. >> I changed to 3.14 and gap is quite big ! >> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same >> conditions >> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same >> conditions too. > >> > > > That's a start then. > >> It should so be better, but when server was in production, and ever with >> bad kernel, performances was already quite good before they quickly >> decreased. >> So i think too I have another configuration problem. >> >> You say you're IO bound, so some output from sar / iostat / dstat and >> pg_stat_activity etc before and during the issue would be of use. >> >> -> My server is not in production right now, so it is difficult to >> replay production load and have some useful metrics. >> The best way I've found is to replay trafic from logs with pgreplay. >> I hoped that the server falls back by replaying this traffic, but it >> never happens ... Another thing I can't understand ... >> >> Below is my dstat output when I replay this traffic (and so when server >> runs normally) >> I have unfortunately no more outputs when server's performances > decreased. > >> > > It's a shame we can't get any insight into activity on the server during > the issues. >> >> >> Other things you asked >> >> System memory size : 256 Go >> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model >> INTEL SSDSC2BB480G4 >> Raid controller : MegaRAID SAS 2208 >> Partition alignments and stripe sizes : see fdisk delow >> Kernel options : the config file is here : >> > ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 >> Filesystem used and mount options : ext4, see mtab below >> IO Scheduler : noop [deadline] cfq for my ssd raid volume >> Postgresql version and configuration : 9.3.5 >> >> max_connections=1800 >> shared_buffers=8GB >> temp_buffers=32MB >> work_mem=100MB >> maintenance_work_mem=12GB >> bgwriter_lru_maxpages=200 >> effective_io_concurrency=4 >> wal_level=hot_standby >> wal_sync_method=fdatasync >> wal_writer_delay=2000ms >> commit_delay=1000 >> checkpoint_segments=80 >> checkpoint_timeout=15min >> checkpoint_completion_target=0.7 >> archive_command='rsync ....' >> max_wal_senders=10 >> wal_keep_segments=38600 >> vacuum_defer_cleanup_age=100 >> hot_standby = on >> max_standby_archive_delay = 5min >> max_standby_streaming_delay = 5min >> hot_standby_feedback = on >> random_page_cost = 1.0 >> effective_cache_size = 240GB >> log_min_error_statement = warning >> log_min_duration_statement = 0 >> log_checkpoints = on >> log_connections = on >> log_disconnections = on >> log_line_prefix = '%m|%u|%d|%c|' >> log_lock_waits = on >> log_statement = 'all' >> log_timezone = 'localtime' >> track_activities = on >> track_functions = pl >> track_activity_query_size = 8192 >> autovacuum_max_workers = 5 >> autovacuum_naptime = 30s >> autovacuum_vacuum_threshold = 40 >> autovacuum_analyze_threshold = 20 >> autovacuum_vacuum_scale_factor = 0.10 >> autovacuum_analyze_scale_factor = 0.10 >> autovacuum_vacuum_cost_delay = 5ms >> default_transaction_isolation = 'read committed' >> max_locks_per_transaction = 128 >> >> >> >> Connection pool sizing (pgpool2) >> num_init_children = 1790 >> max_pool = 1 > >> > > > 1800 is quite a lot of connections, and with max_pool=1 in pgpool you're > effectively just using pgpool as a proxy (as I recall, my memory is a little > fuzzy on pgpool now). Unless your app is stateful in some way or has unique > users for each of those 1800 connections you should lower the quantity of active > connections. A general starting point is usually cpu cores * 2, so you could up > max_pool and divide num_init_children by the same amount. > > Hard to say what you need to do without knowing what exactly you're doing > though. What's the nature of the app(s)? > >> I also add megacli parameters : >> >> Virtual Drive: 2 (Target Id: 2) >> Name :datassd >> RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 >> Size : 893.25 GB >> Sector Size : 512 >> Is VD emulated : Yes >> Mirror Data : 893.25 GB >> State : Optimal >> Strip Size : 256 KB >> Number Of Drives per span:2 >> Span Depth : 2 >> Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write >> Cache if Bad BBU >> Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write >> Cache if Bad BBU >> Default Access Policy: Read/Write >> Current Access Policy: Read/Write >> Disk Cache Policy : Enabled >> Encryption Type : None >> Bad Blocks Exist: No >> PI type: No PI >> >> Is VD Cached: No > >> > > > Not using your raid controllers write cache then? Not sure just how important > that is with SSDs these days, but if you've got a BBU set it to > "WriteBack". Also change "Cache if Bad BBU" to "No > Write Cache if Bad BBU" if you do that. > > >> >> Other outputs : >> fdisk -l >> >> Disk /dev/sdc: 959.1 GB, 959119884288 bytes >> 255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 4096 bytes >> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >> Disk identifier: 0x00000000 >> >> Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes >> 255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 4096 bytes >> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >> Disk identifier: 0x00000000 >> >> >> cat /etc/mtab >> /dev/mapper/vg_datassd-lv_datassd /datassd ext4 >> rw,relatime,discard,nobarrier,data=ordered 0 0 >> (I added nobarrier option) >> >> >> cat /sys/block/sdc/queue/scheduler >> noop [deadline] cfq >> > >> > > > You could swap relatime for noatime,nodiratime. > You could also see if the noop scheduler makes any improvement.
Le 20/05/2015 18:50, Glyn Astill a écrit : > >> From: Thomas SIMON <tsimon@neteven.com> >> To: glynastill@yahoo.co.uk >> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Wednesday, 20 May 2015, 16:41 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> Hi Glyn, >> >> I'll try to answer this points. >> >> I've made some benchs, and indeed 3.2 not helping. not helping at all. >> I changed to 3.14 and gap is quite big ! >> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same >> conditions >> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same >> conditions too. > > That's a start then. > >> It should so be better, but when server was in production, and ever with >> bad kernel, performances was already quite good before they quickly >> decreased. >> So i think too I have another configuration problem. >> >> You say you're IO bound, so some output from sar / iostat / dstat and >> pg_stat_activity etc before and during the issue would be of use. >> >> -> My server is not in production right now, so it is difficult to >> replay production load and have some useful metrics. >> The best way I've found is to replay trafic from logs with pgreplay. >> I hoped that the server falls back by replaying this traffic, but it >> never happens ... Another thing I can't understand ... >> >> Below is my dstat output when I replay this traffic (and so when server >> runs normally) >> I have unfortunately no more outputs when server's performances decreased. > It's a shame we can't get any insight into activity on the server during the issues. >> >> Other things you asked >> >> System memory size : 256 Go >> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model >> INTEL SSDSC2BB480G4 >> Raid controller : MegaRAID SAS 2208 >> Partition alignments and stripe sizes : see fdisk delow >> Kernel options : the config file is here : >> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 >> Filesystem used and mount options : ext4, see mtab below >> IO Scheduler : noop [deadline] cfq for my ssd raid volume >> Postgresql version and configuration : 9.3.5 >> >> max_connections=1800 >> shared_buffers=8GB >> temp_buffers=32MB >> work_mem=100MB >> maintenance_work_mem=12GB >> bgwriter_lru_maxpages=200 >> effective_io_concurrency=4 >> wal_level=hot_standby >> wal_sync_method=fdatasync >> wal_writer_delay=2000ms >> commit_delay=1000 >> checkpoint_segments=80 >> checkpoint_timeout=15min >> checkpoint_completion_target=0.7 >> archive_command='rsync ....' >> max_wal_senders=10 >> wal_keep_segments=38600 >> vacuum_defer_cleanup_age=100 >> hot_standby = on >> max_standby_archive_delay = 5min >> max_standby_streaming_delay = 5min >> hot_standby_feedback = on >> random_page_cost = 1.0 >> effective_cache_size = 240GB >> log_min_error_statement = warning >> log_min_duration_statement = 0 >> log_checkpoints = on >> log_connections = on >> log_disconnections = on >> log_line_prefix = '%m|%u|%d|%c|' >> log_lock_waits = on >> log_statement = 'all' >> log_timezone = 'localtime' >> track_activities = on >> track_functions = pl >> track_activity_query_size = 8192 >> autovacuum_max_workers = 5 >> autovacuum_naptime = 30s >> autovacuum_vacuum_threshold = 40 >> autovacuum_analyze_threshold = 20 >> autovacuum_vacuum_scale_factor = 0.10 >> autovacuum_analyze_scale_factor = 0.10 >> autovacuum_vacuum_cost_delay = 5ms >> default_transaction_isolation = 'read committed' >> max_locks_per_transaction = 128 >> >> >> >> Connection pool sizing (pgpool2) >> num_init_children = 1790 >> max_pool = 1 > > 1800 is quite a lot of connections, and with max_pool=1 in pgpool you're effectively just using pgpool as a proxy (as Irecall, my memory is a little fuzzy on pgpool now). Unless your app is stateful in some way or has unique users for eachof those 1800 connections you should lower the quantity of active connections. A general starting point is usually cpucores * 2, so you could up max_pool and divide num_init_children by the same amount. > > Hard to say what you need to do without knowing what exactly you're doing though. What's the nature of the app(s)? Yes, we just use it as a proxy for now. We have approximately 100 different active users, doing for all of then various number of connexions (twisted + zope apps) result is ~ 900 idle connexions for ~ 60 active connexions, but sometimes (when stopping/starting prod), we need almost double of connexion because some twisted services don't stop their connexions immediatly. But this is the actual (working) configuration, and I don't think think my performance disk is related to this. > >> I also add megacli parameters : >> >> Virtual Drive: 2 (Target Id: 2) >> Name :datassd >> RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 >> Size : 893.25 GB >> Sector Size : 512 >> Is VD emulated : Yes >> Mirror Data : 893.25 GB >> State : Optimal >> Strip Size : 256 KB >> Number Of Drives per span:2 >> Span Depth : 2 >> Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write >> Cache if Bad BBU >> Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write >> Cache if Bad BBU >> Default Access Policy: Read/Write >> Current Access Policy: Read/Write >> Disk Cache Policy : Enabled >> Encryption Type : None >> Bad Blocks Exist: No >> PI type: No PI >> >> Is VD Cached: No > > Not using your raid controllers write cache then? Not sure just how important that is with SSDs these days, but if you'vegot a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that. No, I had read some megacli related docs about SSD, and the advice was to put writethrough on disks. (see http://wiki.mikejung.biz/LSI#Configure_LSI_Card_for_SSD_RAID), last section. Disks are already in "No Write Cache if Bad BBU" mode. (wrote on splitted line on my extract) > > >> Other outputs : >> fdisk -l >> >> Disk /dev/sdc: 959.1 GB, 959119884288 bytes >> 255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 4096 bytes >> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >> Disk identifier: 0x00000000 >> >> Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes >> 255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 4096 bytes >> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >> Disk identifier: 0x00000000 >> >> >> cat /etc/mtab >> /dev/mapper/vg_datassd-lv_datassd /datassd ext4 >> rw,relatime,discard,nobarrier,data=ordered 0 0 >> (I added nobarrier option) >> >> >> cat /sys/block/sdc/queue/scheduler >> noop [deadline] cfq >> > > You could swap relatime for noatime,nodiratime. I'll swap to noatime, thanks. > >> sysctl kernel | grep sched >> kernel.sched_child_runs_first = 0 >> kernel.sched_rr_timeslice_ms = 25 >> kernel.sched_rt_period_us = 1000000 >> kernel.sched_rt_runtime_us = 950000 >> >> I've read some advices about changing kernel.sched_autogroup_enabled=0 >> and kernel.sched_migration_cost_ns=5000000, but this parameters are not >> recognized by my kernel. So I don't know what to do with that... > sched_migration_cost_ns would be called sched_migration_cost in your old 3.2 kernel, not sure why > sched_autogroup_enabled wouldn't be recognized though. I've found on config file that "CONFIG_SCHED_AUTOGROUP is not set" in this kernel. So i guess it is the same thing as if it was enabled=0 , right ? I haven't found any parameter related to migration_sost in config file. > >> sysctl vm >> vm.admin_reserve_kbytes = 8192 >> vm.block_dump = 0 >> vm.dirty_background_bytes = 8388608 >> vm.dirty_background_ratio = 0 >> vm.dirty_bytes = 67108864 >> vm.dirty_expire_centisecs = 3000 >> vm.dirty_ratio = 0 >> vm.dirty_writeback_centisecs = 500 >> vm.drop_caches = 3 >> vm.laptop_mode = 0 >> vm.legacy_va_layout = 0 >> vm.lowmem_reserve_ratio = 256 256 32 >> vm.max_map_count = 65530 >> vm.memory_failure_early_kill = 0 >> vm.memory_failure_recovery = 1 >> vm.min_free_kbytes = 65008 >> vm.min_slab_ratio = 5 >> vm.min_unmapped_ratio = 1 >> vm.mmap_min_addr = 4096 >> vm.nr_pdflush_threads = 0 >> vm.numa_zonelist_order = default >> vm.oom_dump_tasks = 1 >> vm.oom_kill_allocating_task = 0 >> vm.overcommit_kbytes = 0 >> vm.overcommit_memory = 2 >> vm.overcommit_ratio = 50 >> vm.page-cluster = 3 >> vm.panic_on_oom = 0 >> vm.percpu_pagelist_fraction = 0 >> vm.scan_unevictable_pages = 0 >> vm.stat_interval = 1 >> vm.swappiness = 60 >> vm.user_reserve_kbytes = 131072 >> vm.vfs_cache_pressure = 100 >> vm.zone_reclaim_mode = 0 >> >> >> select * from pg_stat_activity >> I've got hundred of entries for that when i'm in production, and I >> can't >> paste it here due to confidentiality. >> By day, it is around 50 millions queries usually. (35% selects ; 55% >> updates & 5% inserts) >> >> >> lspci | grep -E 'RAID|SCSI|IDE|SATA' >> 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset >> 6-Port SATA AHCI Controller (rev 06) >> 02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 >> [Thunderbolt] (rev 05) >> 07:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset >> 4-Port SATA Storage Control Unit (rev 06) >> >> Thanks >> >> Thomas >> >> >> Le 18/05/2015 16:29, Glyn Astill a écrit : >>>> From: Koray Eyidoğan <korayey@gmail.com> >>>> To: Thomas SIMON <tsimon@neteven.com> >>>> Cc: pgsql-admin@postgresql.org >>>> Sent: Monday, 18 May 2015, 14:51 >>>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>>> >>>> >>>> >>>> Hi Thomas, >>>> >>>> >>>> 3.2 kernel may be #1 cause of your I/O load problem: >>>> >>>> >>>> >> http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html >>>> >> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4 >>>> >>>> >>>> Have a nice day. >>>> >>>> >>>> Koray >>> Likely 3.2 kernel isn't helping, but I think we need much more >> information before jumping to conclusions. >>> You say you're IO bound, so some output from sar / iostat / dstat and >> pg_stat_activity etc before and during the issue would be of use. >>> Also: >>> System memory size >>> >>> SSD Model numbers and how many >>> Raid controller >>> Partition allignments and stripe sizes >>> Kernel options >>> Filesystem used and mount options >>> IO Scheduler >>> Postgresql version and configuration >>> Connection pool sizing >>> >>> >>> Perhaps you could thow us the output of some of these: >>> >>> fdisk -l >>> cat /etc/mtab >>> cat /sys/block/<ssd device>/queue/scheduler >>> sysctl kernel | grep sched >>> sysctl vm >>> select * from pg_stat_activity >>> select name, setting from pg_settings >>> lspci | grep -E 'RAID|SCSI|IDE|SATA' >> >> >> -- >> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-admin >>
disk was already in noop mode cat /sys/block/sdc/queue/scheduler noop [deadline] cfq Thomas Le 20/05/2015 19:03, Glyn Astill a écrit : > ----- Original Message ----- >> From: Glyn Astill <glynastill@yahoo.co.uk> >> To: Thomas SIMON <tsimon@neteven.com> >> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Wednesday, 20 May 2015, 17:50 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> >> >>> From: Thomas SIMON <tsimon@neteven.com> >>> To: glynastill@yahoo.co.uk >>> Cc: "pgsql-admin@postgresql.org" >> <pgsql-admin@postgresql.org> >>> Sent: Wednesday, 20 May 2015, 16:41 >>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>> >>> Hi Glyn, >>> >>> I'll try to answer this points. >>> >>> I've made some benchs, and indeed 3.2 not helping. not helping at all. >>> I changed to 3.14 and gap is quite big ! >>> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same >>> conditions >>> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same >>> conditions too. >> >> That's a start then. >> >>> It should so be better, but when server was in production, and ever with >>> bad kernel, performances was already quite good before they quickly >>> decreased. >>> So i think too I have another configuration problem. >>> >>> You say you're IO bound, so some output from sar / iostat / dstat and >>> pg_stat_activity etc before and during the issue would be of use. >>> >>> -> My server is not in production right now, so it is difficult to >>> replay production load and have some useful metrics. >>> The best way I've found is to replay trafic from logs with pgreplay. >>> I hoped that the server falls back by replaying this traffic, but it >>> never happens ... Another thing I can't understand ... >>> >>> Below is my dstat output when I replay this traffic (and so when server >>> runs normally) >>> I have unfortunately no more outputs when server's performances >> decreased. >> >> It's a shame we can't get any insight into activity on the server during >> the issues. >>> >>> Other things you asked >>> >>> System memory size : 256 Go >>> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model >>> INTEL SSDSC2BB480G4 >>> Raid controller : MegaRAID SAS 2208 >>> Partition alignments and stripe sizes : see fdisk delow >>> Kernel options : the config file is here : >>> >> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 >>> Filesystem used and mount options : ext4, see mtab below >>> IO Scheduler : noop [deadline] cfq for my ssd raid volume >>> Postgresql version and configuration : 9.3.5 >>> >>> max_connections=1800 >>> shared_buffers=8GB >>> temp_buffers=32MB >>> work_mem=100MB >>> maintenance_work_mem=12GB >>> bgwriter_lru_maxpages=200 >>> effective_io_concurrency=4 >>> wal_level=hot_standby >>> wal_sync_method=fdatasync >>> wal_writer_delay=2000ms >>> commit_delay=1000 >>> checkpoint_segments=80 >>> checkpoint_timeout=15min >>> checkpoint_completion_target=0.7 >>> archive_command='rsync ....' >>> max_wal_senders=10 >>> wal_keep_segments=38600 >>> vacuum_defer_cleanup_age=100 >>> hot_standby = on >>> max_standby_archive_delay = 5min >>> max_standby_streaming_delay = 5min >>> hot_standby_feedback = on >>> random_page_cost = 1.0 >>> effective_cache_size = 240GB >>> log_min_error_statement = warning >>> log_min_duration_statement = 0 >>> log_checkpoints = on >>> log_connections = on >>> log_disconnections = on >>> log_line_prefix = '%m|%u|%d|%c|' >>> log_lock_waits = on >>> log_statement = 'all' >>> log_timezone = 'localtime' >>> track_activities = on >>> track_functions = pl >>> track_activity_query_size = 8192 >>> autovacuum_max_workers = 5 >>> autovacuum_naptime = 30s >>> autovacuum_vacuum_threshold = 40 >>> autovacuum_analyze_threshold = 20 >>> autovacuum_vacuum_scale_factor = 0.10 >>> autovacuum_analyze_scale_factor = 0.10 >>> autovacuum_vacuum_cost_delay = 5ms >>> default_transaction_isolation = 'read committed' >>> max_locks_per_transaction = 128 >>> >>> >>> >>> Connection pool sizing (pgpool2) >>> num_init_children = 1790 >>> max_pool = 1 >> >> 1800 is quite a lot of connections, and with max_pool=1 in pgpool you're >> effectively just using pgpool as a proxy (as I recall, my memory is a little >> fuzzy on pgpool now). Unless your app is stateful in some way or has unique >> users for each of those 1800 connections you should lower the quantity of active >> connections. A general starting point is usually cpu cores * 2, so you could up >> max_pool and divide num_init_children by the same amount. >> >> Hard to say what you need to do without knowing what exactly you're doing >> though. What's the nature of the app(s)? >> >>> I also add megacli parameters : >>> >>> Virtual Drive: 2 (Target Id: 2) >>> Name :datassd >>> RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 >>> Size : 893.25 GB >>> Sector Size : 512 >>> Is VD emulated : Yes >>> Mirror Data : 893.25 GB >>> State : Optimal >>> Strip Size : 256 KB >>> Number Of Drives per span:2 >>> Span Depth : 2 >>> Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write >>> Cache if Bad BBU >>> Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write >>> Cache if Bad BBU >>> Default Access Policy: Read/Write >>> Current Access Policy: Read/Write >>> Disk Cache Policy : Enabled >>> Encryption Type : None >>> Bad Blocks Exist: No >>> PI type: No PI >>> >>> Is VD Cached: No >> >> Not using your raid controllers write cache then? Not sure just how important >> that is with SSDs these days, but if you've got a BBU set it to >> "WriteBack". Also change "Cache if Bad BBU" to "No >> Write Cache if Bad BBU" if you do that. >> >> >>> Other outputs : >>> fdisk -l >>> >>> Disk /dev/sdc: 959.1 GB, 959119884288 bytes >>> 255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors >>> Units = sectors of 1 * 512 = 512 bytes >>> Sector size (logical/physical): 512 bytes / 4096 bytes >>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >>> Disk identifier: 0x00000000 >>> >>> Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes >>> 255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors >>> Units = sectors of 1 * 512 = 512 bytes >>> Sector size (logical/physical): 512 bytes / 4096 bytes >>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >>> Disk identifier: 0x00000000 >>> >>> >>> cat /etc/mtab >>> /dev/mapper/vg_datassd-lv_datassd /datassd ext4 >>> rw,relatime,discard,nobarrier,data=ordered 0 0 >>> (I added nobarrier option) >>> >>> >>> cat /sys/block/sdc/queue/scheduler >>> noop [deadline] cfq >>> >> >> You could swap relatime for noatime,nodiratime. >> > > You could also see if the noop scheduler makes any improvement. > >
> From: Thomas SIMON <tsimon@neteven.com> >To: Glyn Astill <glynastill@yahoo.co.uk> >Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >Sent: Thursday, 21 May 2015, 13:28 >Subject: Re: [ADMIN] Performances issues with SSD volume ? > > >disk was already in noop mode > >cat /sys/block/sdc/queue/scheduler > noop [deadline] cfq > > No, the output shows the i/o scheduler between the brackets. That says it's deadline for sdc. Of course deadline may well be the better option for you, but it's worth testing.
> From: Thomas SIMON <tsimon@neteven.com> > To: Glyn Astill <glynastill@yahoo.co.uk> > Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> > Sent: Thursday, 21 May 2015, 13:12 > Subject: Re: [ADMIN] Performances issues with SSD volume ? > > Le 20/05/2015 18:50, Glyn Astill a écrit : >> >>> From: Thomas SIMON <tsimon@neteven.com> >>> To: glynastill@yahoo.co.uk >>> Cc: "pgsql-admin@postgresql.org" > <pgsql-admin@postgresql.org> >>> Sent: Wednesday, 20 May 2015, 16:41 >>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>> >>> Hi Glyn, >>> >>> I'll try to answer this points. >>> >>> I've made some benchs, and indeed 3.2 not helping. not helping at > all. >>> I changed to 3.14 and gap is quite big ! >>> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in > same >>> conditions >>> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, > same >>> conditions too. >> >> That's a start then. >> >>> It should so be better, but when server was in production, and ever > with >>> bad kernel, performances was already quite good before they quickly >>> decreased. >>> So i think too I have another configuration problem. >>> >>> You say you're IO bound, so some output from sar / iostat / dstat > and >>> pg_stat_activity etc before and during the issue would be of use. >>> >>> -> My server is not in production right now, so it is difficult to >>> replay production load and have some useful metrics. >>> The best way I've found is to replay trafic from logs with > pgreplay. >>> I hoped that the server falls back by replaying this traffic, but it >>> never happens ... Another thing I can't understand ... >>> >>> Below is my dstat output when I replay this traffic (and so when server >>> runs normally) >>> I have unfortunately no more outputs when server's performances > decreased. >> It's a shame we can't get any insight into activity on the server > during the issues. >>> >>> Other things you asked >>> >>> System memory size : 256 Go >>> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model >>> INTEL SSDSC2BB480G4 >>> Raid controller : MegaRAID SAS 2208 >>> Partition alignments and stripe sizes : see fdisk delow >>> Kernel options : the config file is here : >>> > ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 >>> Filesystem used and mount options : ext4, see mtab below >>> IO Scheduler : noop [deadline] cfq for my ssd raid volume >>> Postgresql version and configuration : 9.3.5 >>> >>> max_connections=1800 >>> shared_buffers=8GB >>> temp_buffers=32MB >>> work_mem=100MB >>> maintenance_work_mem=12GB >>> bgwriter_lru_maxpages=200 >>> effective_io_concurrency=4 >>> wal_level=hot_standby >>> wal_sync_method=fdatasync >>> wal_writer_delay=2000ms >>> commit_delay=1000 >>> checkpoint_segments=80 >>> checkpoint_timeout=15min >>> checkpoint_completion_target=0.7 >>> archive_command='rsync ....' >>> max_wal_senders=10 >>> wal_keep_segments=38600 >>> vacuum_defer_cleanup_age=100 >>> hot_standby = on >>> max_standby_archive_delay = 5min >>> max_standby_streaming_delay = 5min >>> hot_standby_feedback = on >>> random_page_cost = 1.0 >>> effective_cache_size = 240GB >>> log_min_error_statement = warning >>> log_min_duration_statement = 0 >>> log_checkpoints = on >>> log_connections = on >>> log_disconnections = on >>> log_line_prefix = '%m|%u|%d|%c|' >>> log_lock_waits = on >>> log_statement = 'all' >>> log_timezone = 'localtime' >>> track_activities = on >>> track_functions = pl >>> track_activity_query_size = 8192 >>> autovacuum_max_workers = 5 >>> autovacuum_naptime = 30s >>> autovacuum_vacuum_threshold = 40 >>> autovacuum_analyze_threshold = 20 >>> autovacuum_vacuum_scale_factor = 0.10 >>> autovacuum_analyze_scale_factor = 0.10 >>> autovacuum_vacuum_cost_delay = 5ms >>> default_transaction_isolation = 'read committed' >>> max_locks_per_transaction = 128 >>> >>> >>> >>> Connection pool sizing (pgpool2) >>> num_init_children = 1790 >>> max_pool = 1 >> >> 1800 is quite a lot of connections, and with max_pool=1 in pgpool > you're effectively just using pgpool as a proxy (as I recall, my memory is a > little fuzzy on pgpool now). Unless your app is stateful in some way or has > unique users for each of those 1800 connections you should lower the quantity of > active connections. A general starting point is usually cpu cores * 2, so you > could up max_pool and divide num_init_children by the same amount. >> >> Hard to say what you need to do without knowing what exactly you're >> doing though. What's the nature of the app(s)? > > Yes, we just use it as a proxy for now. > We have approximately 100 different active users, doing for all of then > various number of connexions (twisted + zope apps) > result is ~ 900 idle connexions for ~ 60 active connexions, but > sometimes (when stopping/starting prod), we need almost double of > connexion because some twisted services don't stop their connexions > immediatly. > But this is the actual (working) configuration, and I don't think think > my performance disk is related to this. I think at this point you could do with going back and trying to reproduce the issue, then trace back up to pg_stat_activityto see what activity could be causing the disk i/o. I assume you've tried to reproduce the disk issues witha simple disk benchmark like bonnie++?
Le 21/05/2015 14:57, Glyn Astill a écrit : >> From: Thomas SIMON <tsimon@neteven.com> >> To: Glyn Astill <glynastill@yahoo.co.uk> >> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Thursday, 21 May 2015, 13:12 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> Le 20/05/2015 18:50, Glyn Astill a écrit : >>>> From: Thomas SIMON <tsimon@neteven.com> >>>> To: glynastill@yahoo.co.uk >>>> Cc: "pgsql-admin@postgresql.org" >> <pgsql-admin@postgresql.org> >>>> Sent: Wednesday, 20 May 2015, 16:41 >>>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>>> >>>> Hi Glyn, >>>> >>>> I'll try to answer this points. >>>> >>>> I've made some benchs, and indeed 3.2 not helping. not helping at >> all. >>>> I changed to 3.14 and gap is quite big ! >>>> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in >> same >>>> conditions >>>> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, >> same >>>> conditions too. >>> That's a start then. >>> >>>> It should so be better, but when server was in production, and ever >> with >>>> bad kernel, performances was already quite good before they quickly >>>> decreased. >>>> So i think too I have another configuration problem. >>>> >>>> You say you're IO bound, so some output from sar / iostat / dstat >> and >>>> pg_stat_activity etc before and during the issue would be of use. >>>> >>>> -> My server is not in production right now, so it is difficult to >>>> replay production load and have some useful metrics. >>>> The best way I've found is to replay trafic from logs with >> pgreplay. >>>> I hoped that the server falls back by replaying this traffic, but it >>>> never happens ... Another thing I can't understand ... >>>> >>>> Below is my dstat output when I replay this traffic (and so when server >>>> runs normally) >>>> I have unfortunately no more outputs when server's performances >> decreased. >>> It's a shame we can't get any insight into activity on the server >> during the issues. >>>> Other things you asked >>>> >>>> System memory size : 256 Go >>>> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model >>>> INTEL SSDSC2BB480G4 >>>> Raid controller : MegaRAID SAS 2208 >>>> Partition alignments and stripe sizes : see fdisk delow >>>> Kernel options : the config file is here : >>>> >> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64 >>>> Filesystem used and mount options : ext4, see mtab below >>>> IO Scheduler : noop [deadline] cfq for my ssd raid volume >>>> Postgresql version and configuration : 9.3.5 >>>> >>>> max_connections=1800 >>>> shared_buffers=8GB >>>> temp_buffers=32MB >>>> work_mem=100MB >>>> maintenance_work_mem=12GB >>>> bgwriter_lru_maxpages=200 >>>> effective_io_concurrency=4 >>>> wal_level=hot_standby >>>> wal_sync_method=fdatasync >>>> wal_writer_delay=2000ms >>>> commit_delay=1000 >>>> checkpoint_segments=80 >>>> checkpoint_timeout=15min >>>> checkpoint_completion_target=0.7 >>>> archive_command='rsync ....' >>>> max_wal_senders=10 >>>> wal_keep_segments=38600 >>>> vacuum_defer_cleanup_age=100 >>>> hot_standby = on >>>> max_standby_archive_delay = 5min >>>> max_standby_streaming_delay = 5min >>>> hot_standby_feedback = on >>>> random_page_cost = 1.0 >>>> effective_cache_size = 240GB >>>> log_min_error_statement = warning >>>> log_min_duration_statement = 0 >>>> log_checkpoints = on >>>> log_connections = on >>>> log_disconnections = on >>>> log_line_prefix = '%m|%u|%d|%c|' >>>> log_lock_waits = on >>>> log_statement = 'all' >>>> log_timezone = 'localtime' >>>> track_activities = on >>>> track_functions = pl >>>> track_activity_query_size = 8192 >>>> autovacuum_max_workers = 5 >>>> autovacuum_naptime = 30s >>>> autovacuum_vacuum_threshold = 40 >>>> autovacuum_analyze_threshold = 20 >>>> autovacuum_vacuum_scale_factor = 0.10 >>>> autovacuum_analyze_scale_factor = 0.10 >>>> autovacuum_vacuum_cost_delay = 5ms >>>> default_transaction_isolation = 'read committed' >>>> max_locks_per_transaction = 128 >>>> >>>> >>>> >>>> Connection pool sizing (pgpool2) >>>> num_init_children = 1790 >>>> max_pool = 1 >>> 1800 is quite a lot of connections, and with max_pool=1 in pgpool >> you're effectively just using pgpool as a proxy (as I recall, my memory is a >> little fuzzy on pgpool now). Unless your app is stateful in some way or has >> unique users for each of those 1800 connections you should lower the quantity of >> active connections. A general starting point is usually cpu cores * 2, so you >> could up max_pool and divide num_init_children by the same amount. >>> Hard to say what you need to do without knowing what exactly you're >>> doing though. What's the nature of the app(s)? >> Yes, we just use it as a proxy for now. >> We have approximately 100 different active users, doing for all of then >> various number of connexions (twisted + zope apps) >> result is ~ 900 idle connexions for ~ 60 active connexions, but >> sometimes (when stopping/starting prod), we need almost double of >> connexion because some twisted services don't stop their connexions >> immediatly. >> But this is the actual (working) configuration, and I don't think think >> my performance disk is related to this. > I think at this point you could do with going back and trying to reproduce the issue, then trace back up to pg_stat_activityto see what activity could be causing the disk i/o. I assume you've tried to reproduce the disk issues witha simple disk benchmark like bonnie++? Yes, I think the same thing. Probably I will doing this tomorrow early in the morning. I tried to reproduce disk issues with different stress tests like bonnie, fio, tsung, and I use a more realistic scenario with pgreplay to reproduce my production trafic from postgresql logfile. However, I'm note sure how to diagnostic performance issues. I mean, if I see ssd are 100% full, how can I figure out why their behavior changes ? I'm asking myself another question, about master/slave configuration. For doing my test, I will put my ssd server as slave of hdd server. After that, I will promote him as master. In case I still have performance issues and I must do a rollback, am I necessarily forced to reconstruct completely my new slave (hdd) with pg_basebackup (and wait some hours file are transferer), or can I promote directly this old master as a slave without pending time to reconstruct (as files should be the same on both servers) ? Thanks Thomas
>> I think at this point you could do with going back and trying to reproduce >> the issue, then trace back up to pg_stat_activity to see what activity could be >> causing the disk i/o. I assume you've tried to reproduce the disk issues >> with a simple disk benchmark like bonnie++? > > Yes, I think the same thing. Probably I will doing this tomorrow early > in the morning. > I tried to reproduce disk issues with different stress tests like > bonnie, fio, tsung, and I use a more realistic scenario with pgreplay to > reproduce my production trafic from postgresql logfile. > However, I'm note sure how to diagnostic performance issues. > I mean, if I see ssd are 100% full, how can I figure out why their > behavior changes ? > Well the disk benchmarks are purely to see what your disks are capable of, and help with your initial tuning. You need to trace back which processes are causing most of the IO you're seeing, as well as the postgresql logs somethinglike iotop, or dstat with the --top-bio option might help you there. You could also look at the pg_statio_user_tables view to narrow down which tables are being hit the hardest, which mightgive you some clues. Also see here: https://wiki.postgresql.org/wiki/Performance_Analysis_Tools https://wiki.postgresql.org/wiki/Monitoring > > I'm asking myself another question, about master/slave configuration. > For doing my test, I will put my ssd server as slave of hdd server. Unless you've got a mainly read-only worlkoad, you can't really test the slave properly that way as all it's doing is replayingthe wal. > After that, I will promote him as master. > In case I still have performance issues and I must do a rollback, am I > necessarily forced to reconstruct completely my new slave (hdd) with > pg_basebackup (and wait some hours file are transferer), or can I > promote directly this old master as a slave without pending time to > reconstruct (as files should be the same on both servers) ? > Yes you will need to rebuild it or look at pg_rewind: https://github.com/vmware/pg_rewind
Le 21/05/2015 14:46, Glyn Astill a écrit : >> From: Thomas SIMON <tsimon@neteven.com> >> To: Glyn Astill <glynastill@yahoo.co.uk> >> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Thursday, 21 May 2015, 13:28 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> >> disk was already in noop mode >> >> cat /sys/block/sdc/queue/scheduler >> noop [deadline] cfq >> >> > > No, the output shows the i/o scheduler between the brackets. That says it's deadline for sdc. > > Of course deadline may well be the better option for you, but it's worth testing. You're totally right. I changed parameter to noop and redo pgbench test, but I've seen no significantly differences.
Le 21/05/2015 16:30, Glyn Astill a écrit : > >>> I think at this point you could do with going back and trying to reproduce >>> the issue, then trace back up to pg_stat_activity to see what activity could be >>> causing the disk i/o. I assume you've tried to reproduce the disk issues >>> with a simple disk benchmark like bonnie++? >> Yes, I think the same thing. Probably I will doing this tomorrow early >> in the morning. >> I tried to reproduce disk issues with different stress tests like >> bonnie, fio, tsung, and I use a more realistic scenario with pgreplay to >> reproduce my production trafic from postgresql logfile. >> However, I'm note sure how to diagnostic performance issues. >> I mean, if I see ssd are 100% full, how can I figure out why their >> behavior changes ? >> > Well the disk benchmarks are purely to see what your disks are capable of, and help with your initial tuning. > > > You need to trace back which processes are causing most of the IO you're seeing, as well as the postgresql logs somethinglike iotop, or dstat with the --top-bio option might help you there. > > > You could also look at the pg_statio_user_tables view to narrow down which tables are being hit the hardest, which mightgive you some clues. Is there something to activate for seeing something in this table ? Because its empty on my production server postgres=# select * from pg_statio_user_tables; relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit -------+------------+---------+----------------+---------------+---------------+--------------+-----------------+----------------+----------------+--------------- (0 rows) > > > Also see here: > https://wiki.postgresql.org/wiki/Performance_Analysis_Tools > https://wiki.postgresql.org/wiki/Monitoring > >>> I'm asking myself another question, about master/slave configuration. >> For doing my test, I will put my ssd server as slave of hdd server. > > Unless you've got a mainly read-only worlkoad, you can't really test the slave properly that way as all it's doing is replayingthe wal. Sorry, I mispoke. I meant that for promoting my actual ssd test server to master, I had to promote its as a slave in a first time. > >> After that, I will promote him as master. >> In case I still have performance issues and I must do a rollback, am I >> necessarily forced to reconstruct completely my new slave (hdd) with >> pg_basebackup (and wait some hours file are transferer), or can I >> promote directly this old master as a slave without pending time to >> reconstruct (as files should be the same on both servers) ? >> > > Yes you will need to rebuild it or look at pg_rewind: > > > https://github.com/vmware/pg_rewind Thanks for reference. Could do the trick. I'll read that with interest. > >
> From: Thomas SIMON <tsimon@neteven.com> > To: Glyn Astill <glynastill@yahoo.co.uk> > Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> > Sent: Thursday, 21 May 2015, 17:56 > Subject: Re: [ADMIN] Performances issues with SSD volume ? > > Le 21/05/2015 16:30, Glyn Astill a écrit : >> >>>> I think at this point you could do with going back and trying to > reproduce >>>> the issue, then trace back up to pg_stat_activity to see what > activity could be >>>> causing the disk i/o. I assume you've tried to reproduce the > disk issues >>>> with a simple disk benchmark like bonnie++? >>> Yes, I think the same thing. Probably I will doing this tomorrow early >>> in the morning. >>> I tried to reproduce disk issues with different stress tests like >>> bonnie, fio, tsung, and I use a more realistic scenario with pgreplay > to >>> reproduce my production trafic from postgresql logfile. >>> However, I'm note sure how to diagnostic performance issues. >>> I mean, if I see ssd are 100% full, how can I figure out why their >>> behavior changes ? >>> >> Well the disk benchmarks are purely to see what your disks are capable of, > and help with your initial tuning. >> >> >> You need to trace back which processes are causing most of the IO > you're seeing, as well as the postgresql logs something like iotop, or dstat > with the --top-bio option might help you there. >> >> >> You could also look at the pg_statio_user_tables view to narrow down which > tables are being hit the hardest, which might give you some clues. > Is there something to activate for seeing something in this table ? > Because its empty on my production server > > postgres=# select * from pg_statio_user_tables; > relid | schemaname | relname | heap_blks_read | heap_blks_hit | > idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | > tidx_blks_read | tidx_blks_hit > -------+------------+---------+----------------+---------------+---------------+--------------+-----------------+----------------+----------------+--------------- > (0 rows) > Looks like you need to set track_counts=on then. Infact if you've got track_counts off then you're also not running autovacuum,that's a warning flag unless it's intentional.
Thomas Le 22/05/2015 11:37, Glyn Astill a écrit : >> From: Thomas SIMON <tsimon@neteven.com> >> To: Glyn Astill <glynastill@yahoo.co.uk> >> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Thursday, 21 May 2015, 17:56 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> Le 21/05/2015 16:30, Glyn Astill a écrit : >>>>> I think at this point you could do with going back and trying to >> reproduce >>>>> the issue, then trace back up to pg_stat_activity to see what >> activity could be >>>>> causing the disk i/o. I assume you've tried to reproduce the >> disk issues >>>>> with a simple disk benchmark like bonnie++? >>>> Yes, I think the same thing. Probably I will doing this tomorrow early >>>> in the morning. >>>> I tried to reproduce disk issues with different stress tests like >>>> bonnie, fio, tsung, and I use a more realistic scenario with pgreplay >> to >>>> reproduce my production trafic from postgresql logfile. >>>> However, I'm note sure how to diagnostic performance issues. >>>> I mean, if I see ssd are 100% full, how can I figure out why their >>>> behavior changes ? >>>> >>> Well the disk benchmarks are purely to see what your disks are capable of, >> and help with your initial tuning. >>> >>> You need to trace back which processes are causing most of the IO >> you're seeing, as well as the postgresql logs something like iotop, or dstat >> with the --top-bio option might help you there. >>> >>> You could also look at the pg_statio_user_tables view to narrow down which >> tables are being hit the hardest, which might give you some clues. >> Is there something to activate for seeing something in this table ? >> Because its empty on my production server >> >> postgres=# select * from pg_statio_user_tables; >> relid | schemaname | relname | heap_blks_read | heap_blks_hit | >> idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | >> tidx_blks_read | tidx_blks_hit >> -------+------------+---------+----------------+---------------+---------------+--------------+-----------------+----------------+----------------+--------------- >> (0 rows) >> > > Looks like you need to set track_counts=on then. Infact if you've got track_counts off then you're also not running autovacuum,that's a warning flag unless it's intentional. Arf, no it's just me who does dummy things ... I was in postgres database, not in my app... Here is result of my top tables. relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit ----------+------------+------------------------------------------------------+----------------+---------------+---------------+---------------+-----------------+----------------+----------------+--------------- 2056411 | public | table 1 | 326588572770 | 432317896616 | 12839634494 | 33230958888 | 52298815 | 20020628 | 9406665 | 153557966 2059324 | public | table 2 | 223323685944 | 72194867978 | 10319078550 | 103672304590 | 189135 | 262100 | 27815 | 207212 2056368 | public | table 3 | 198706578066 | 1259514828344 | 21083719219 | 1456776226532 | 90446 | 331680 | 30850 | 367441 2056347 | public | table 4 | 124086575669 | 554775474730 | 5236764894 | 614913194098 | 14978587 | 67085822 | 751951 | 134187396 2057296 | public | table 5 | 68120702371 | 18134568265 | 1243107028 | 10537765277 | 350646003 | 87888306 | 119113944 | 629790819 2058460 | public | table 6 | 35119995405 | 158177258403 | 1687014495 | 1395923096049 | 738542 | 214574 | 325765 | 1318187 2058900 | public | table 7 | 34205345184 | 271135546976 | 21050227933 | 252099340046 | 3924 | 6567 | 422 | 16333 2059195 | public | table 8 | 29169256266 | 152937546457 | 2263998316 | 127782237565 | 0 | 0 | 0 | 0 how you interpret these values? You check ratio between heap_read & heap_hit ? i've made ratio, it's about 89%
----- Original Message ----- > From: Thomas SIMON <tsimon@neteven.com> > To: Glyn Astill <glynastill@yahoo.co.uk> > Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> > Sent: Friday, 22 May 2015, 16:29 > Subject: Re: [ADMIN] Performances issues with SSD volume ? > > > Thomas > > Le 22/05/2015 11:37, Glyn Astill a écrit : >>> From: Thomas SIMON <tsimon@neteven.com> >>> To: Glyn Astill <glynastill@yahoo.co.uk> >>> Cc: "pgsql-admin@postgresql.org" > <pgsql-admin@postgresql.org> >>> Sent: Thursday, 21 May 2015, 17:56 >>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>> >>> Le 21/05/2015 16:30, Glyn Astill a écrit : >>>>>> I think at this point you could do with going back and > trying to >>> reproduce >>>>>> the issue, then trace back up to pg_stat_activity to see > what >>> activity could be >>>>>> causing the disk i/o. I assume you've tried to > reproduce the >>> disk issues >>>>>> with a simple disk benchmark like bonnie++? >>>>> Yes, I think the same thing. Probably I will doing this > tomorrow early >>>>> in the morning. >>>>> I tried to reproduce disk issues with different stress tests > like >>>>> bonnie, fio, tsung, and I use a more realistic scenario with > pgreplay >>> to >>>>> reproduce my production trafic from postgresql logfile. >>>>> However, I'm note sure how to diagnostic performance > issues. >>>>> I mean, if I see ssd are 100% full, how can I figure out why > their >>>>> behavior changes ? >>>>> >>>> Well the disk benchmarks are purely to see what your disks are > capable of, >>> and help with your initial tuning. >>>> >>>> You need to trace back which processes are causing most of the IO >>> you're seeing, as well as the postgresql logs something like iotop, > or dstat >>> with the --top-bio option might help you there. >>>> >>>> You could also look at the pg_statio_user_tables view to narrow > down which >>> tables are being hit the hardest, which might give you some clues. >>> Is there something to activate for seeing something in this table ? >>> Because its empty on my production server >>> >>> postgres=# select * from pg_statio_user_tables; >>> relid | schemaname | relname | heap_blks_read | heap_blks_hit | >>> idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | >>> tidx_blks_read | tidx_blks_hit >>> > -------+------------+---------+----------------+---------------+---------------+--------------+-----------------+----------------+----------------+--------------- >>> (0 rows) >>> >> >> Looks like you need to set track_counts=on then. Infact if you've got > track_counts off then you're also not running autovacuum, that's a > warning flag unless it's intentional. > Arf, no it's just me who does dummy things ... > I was in postgres database, not in my app... > > > Here is result of my top tables. > > relid | schemaname | relname | heap_blks_read | > heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | > toast_blks_hit | tidx_blks_read | tidx_blks_hit > ----------+------------+------------------------------------------------------+----------------+---------------+---------------+---------------+-----------------+----------------+----------------+--------------- > 2056411 | public | table 1 > | 326588572770 | 432317896616 | 12839634494 | 33230958888 | > 52298815 | 20020628 | 9406665 | 153557966 > 2059324 | public | table 2 | > 223323685944 | 72194867978 | 10319078550 | 103672304590 | > 189135 | 262100 | 27815 | 207212 > 2056368 | public | table 3 > | 198706578066 | > 1259514828344 | 21083719219 | 1456776226532 | 90446 > | 331680 | 30850 | 367441 > 2056347 | public | table 4 | 124086575669 | 554775474730 | > 5236764894 | 614913194098 | 14978587 | 67085822 | > 751951 | 134187396 > 2057296 | public | table 5 > | 68120702371 | 18134568265 | 1243107028 | 10537765277 | > 350646003 | 87888306 | 119113944 | 629790819 > 2058460 | public | table 6 | 35119995405 | 158177258403 | > 1687014495 | 1395923096049 | 738542 | 214574 | > 325765 | 1318187 > 2058900 | public | table 7 > | 34205345184 | 271135546976 | 21050227933 | 252099340046 > | 3924 | 6567 | 422 | 16333 > 2059195 | public | table 8 > | 29169256266 | > 152937546457 | 2263998316 | 127782237565 | 0 > | 0 | 0 | 0 > > how you interpret these values? > You check ratio between heap_read & heap_hit ? > i've made ratio, it's about 89% > It shows blocks fetched from disk, or possibly from OS the buffer cache as "read", and blocks already in shared buffers as"hit". See here: http://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STATIO-ALL-TABLES-VIEW Also this thread appears to now be split over diagnosing the issue from two different standpoints; 1) Is the i/o performance(or some other aspect) of your new SSD server worse than that of your old live spinning disk system. 2) Are thereimprovements to be had with your database and application setup. For 1) Maybe if you could post the output of bonnie++ results from the data volumes on both servers when nothing else isgoing off on them. If you can't do that on your production server perhaps just the output from sar -d during a busy periodmight help. Hopefully that would help to clarify the specific disk performance differences for us. Other interestingtests would be the output of Greg Smiths stream scaling script (https://github.com/gregs1104/stream-scaling). We're still missing the exact specs of your old system and the CPUs in thenew one. For 2) I think we're still missing a lot of information about what's happening in your database to be able to say much, i.e.what's the size of the database, tables 1 - 8, typical read / write paterns, and transaction rate? Perhaps you couldgive a simple description of what your apps(s) do on a daily basis, and then post a diff of the outputs of the pg_stat_database,pg_stat_user_tables and pg_staio_user_tables views over a specific period of time. You really need to dig into each of these yourself to try and see where your new system might be lacking.
Thomas Le 22/05/2015 18:21, Glyn Astill a écrit : > ----- Original Message ----- > >> From: Thomas SIMON<tsimon@neteven.com> >> To: Glyn Astill<glynastill@yahoo.co.uk> >> Cc:"pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Friday, 22 May 2015, 16:29 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> >> Thomas >> >> Le 22/05/2015 11:37, Glyn Astill a écrit : >>>> From: Thomas SIMON<tsimon@neteven.com> >>>> To: Glyn Astill<glynastill@yahoo.co.uk> >>>> Cc:"pgsql-admin@postgresql.org" >> <pgsql-admin@postgresql.org> >>>> Sent: Thursday, 21 May 2015, 17:56 >>>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>>> >>>> Le 21/05/2015 16:30, Glyn Astill a écrit : >>>>>>> I think at this point you could do with going back and >> trying to >>>> reproduce >>>>>>> the issue, then trace back up to pg_stat_activity to see >> what >>>> activity could be >>>>>>> causing the disk i/o. I assume you've tried to >> reproduce the >>>> disk issues >>>>>>> with a simple disk benchmark like bonnie++? >>>>>> Yes, I think the same thing. Probably I will doing this >> tomorrow early >>>>>> in the morning. >>>>>> I tried to reproduce disk issues with different stress tests >> like >>>>>> bonnie, fio, tsung, and I use a more realistic scenario with >> pgreplay >>>> to >>>>>> reproduce my production trafic from postgresql logfile. >>>>>> However, I'm note sure how to diagnostic performance >> issues. >>>>>> I mean, if I see ssd are 100% full, how can I figure out why >> their >>>>>> behavior changes ? >>>>>> >>>>> Well the disk benchmarks are purely to see what your disks are >> capable of, >>>> and help with your initial tuning. >>>>> You need to trace back which processes are causing most of the IO >>>> you're seeing, as well as the postgresql logs something like iotop, >> or dstat >>>> with the --top-bio option might help you there. >>>>> You could also look at the pg_statio_user_tables view to narrow >> down which >>>> tables are being hit the hardest, which might give you some clues. >>>> Is there something to activate for seeing something in this table ? >>>> Because its empty on my production server >>>> >>>> postgres=# select * from pg_statio_user_tables; >>>> relid | schemaname | relname | heap_blks_read | heap_blks_hit | >>>> idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | >>>> tidx_blks_read | tidx_blks_hit >>>> >> -------+------------+---------+----------------+---------------+---------------+--------------+-----------------+----------------+----------------+--------------- >>>> (0 rows) >>>> >>> Looks like you need to set track_counts=on then. Infact if you've got >> track_counts off then you're also not running autovacuum, that's a >> warning flag unless it's intentional. >> Arf, no it's just me who does dummy things ... >> I was in postgres database, not in my app... >> >> >> Here is result of my top tables. >> >> relid | schemaname | relname | heap_blks_read | >> heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | >> toast_blks_hit | tidx_blks_read | tidx_blks_hit >> ----------+------------+------------------------------------------------------+----------------+---------------+---------------+---------------+-----------------+----------------+----------------+--------------- >> 2056411 | public | table 1 >> | 326588572770 | 432317896616 | 12839634494 | 33230958888 | >> 52298815 | 20020628 | 9406665 | 153557966 >> 2059324 | public | table 2 | >> 223323685944 | 72194867978 | 10319078550 | 103672304590 | >> 189135 | 262100 | 27815 | 207212 >> 2056368 | public | table 3 >> | 198706578066 | >> 1259514828344 | 21083719219 | 1456776226532 | 90446 >> | 331680 | 30850 | 367441 >> 2056347 | public | table 4 | 124086575669 | 554775474730 | >> 5236764894 | 614913194098 | 14978587 | 67085822 | >> 751951 | 134187396 >> 2057296 | public | table 5 >> | 68120702371 | 18134568265 | 1243107028 | 10537765277 | >> 350646003 | 87888306 | 119113944 | 629790819 >> 2058460 | public | table 6 | 35119995405 | 158177258403 | >> 1687014495 | 1395923096049 | 738542 | 214574 | >> 325765 | 1318187 >> 2058900 | public | table 7 >> | 34205345184 | 271135546976 | 21050227933 | 252099340046 >> | 3924 | 6567 | 422 | 16333 >> 2059195 | public | table 8 >> | 29169256266 | >> 152937546457 | 2263998316 | 127782237565 | 0 >> | 0 | 0 | 0 >> >> how you interpret these values? >> You check ratio between heap_read & heap_hit ? >> i've made ratio, it's about 89% > It shows blocks fetched from disk, or possibly from OS the buffer cache as "read", and blocks already in shared buffersas "hit". > > > See here:http://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STATIO-ALL-TABLES-VIEW > > > Also this thread appears to now be split over diagnosing the issue from two different standpoints; 1) Is the i/o performance(or some other aspect) of your new SSD server worse than that of your old live spinning disk system. 2) Are thereimprovements to be had with your database and application setup. > > For 1) Maybe if you could post the output of bonnie++ results from the data volumes on both servers when nothing else isgoing off on them. If you can't do that on your production server perhaps just the output from sar -d during a busy periodmight help. Hopefully that would help to clarify the specific disk performance differences for us. Other interestingtests would be the output of Greg Smiths stream scaling script (https://github.com/gregs1104/stream-scaling). We're still missing the exact specs of your old system and the CPUs in thenew one. I'll stay in this 1st point for now, because there is origin of my problem and most urgent right now. I can't do bonnie++ now, beacause I've already prepared server as a slave, and I have no sufficient disk space available (program says File size should be double RAM for good results output) My current sar output on productions (HDD) server is: 14:25:01 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 11:05:01 vg_data-lv_data 954.37 2417.35 18941.77 22.38 31.10 32.46 0.34 32.70 11:15:01 vg_data-lv_data 1155.79 8716.91 21995.77 26.57 25.15 21.74 0.40 46.70 11:25:01 vg_data-lv_data 1250.62 6478.67 23450.07 23.93 39.77 31.78 0.41 51.34 11:35:01 vg_data-lv_data 842.48 2051.11 17120.92 22.76 15.63 18.53 0.29 24.04 11:45:01 vg_data-lv_data 666.21 1403.32 14174.47 23.38 10.11 15.12 0.24 15.79 11:55:01 vg_data-lv_data 923.51 6763.36 15337.58 23.93 13.07 14.14 0.35 32.63 12:05:01 vg_data-lv_data 989.86 9148.71 16252.59 25.66 19.42 19.56 0.45 44.21 12:15:01 vg_data-lv_data 1369.24 8631.93 24737.60 24.37 35.04 25.54 0.45 61.33 12:25:01 vg_data-lv_data 1776.12 7070.01 39851.34 26.42 74.81 42.05 0.44 77.29 12:35:01 vg_data-lv_data 1529.15 6635.80 85865.14 60.49 54.11 35.34 0.48 72.89 12:45:01 vg_data-lv_data 1187.43 4528.74 40366.95 37.81 36.07 30.36 0.39 45.81 12:55:01 vg_data-lv_data 984.48 3520.06 21539.36 25.45 17.91 18.17 0.31 30.20 13:05:01 vg_data-lv_data 926.54 6304.44 16688.94 24.82 17.36 18.69 0.41 38.05 13:15:01 vg_data-lv_data 1232.46 7199.65 29852.49 30.06 40.17 32.53 0.42 51.60 13:25:01 vg_data-lv_data 1223.46 3945.05 27448.15 25.66 31.07 25.31 0.35 42.65 13:35:01 vg_data-lv_data 1126.91 2811.70 22067.19 22.08 24.33 21.55 0.32 36.00 13:45:01 vg_data-lv_data 833.33 1805.26 17274.43 22.90 24.40 29.25 0.30 25.41 13:55:02 vg_data-lv_data 1085.88 7616.75 19140.67 24.64 17.48 16.06 0.39 42.15 14:05:01 vg_data-lv_data 691.52 3852.50 13125.53 24.55 7.75 11.15 0.30 20.74 14:15:01 vg_data-lv_data 1288.88 5390.41 24171.07 22.94 33.31 25.76 0.36 46.31 14:25:01 vg_data-lv_data 1592.88 3637.77 29836.89 21.02 76.45 47.94 0.40 63.28 14:35:01 vg_data-lv_data 1652.78 9502.87 31587.68 24.86 58.97 35.58 0.44 72.46 14:45:01 vg_data-lv_data 1623.82 6249.52 34148.46 24.88 53.47 32.83 0.40 65.19 14:55:01 vg_data-lv_data 1330.44 6516.11 26828.59 25.06 55.66 41.81 0.42 55.46 Average: vg_data-lv_data 1176.55 5508.02 26324.37 27.06 33.86 28.72 0.39 45.59 Output of stream-scaling is CPU /sys/devices/system/cpu/cpu0 Level 1 Cache: 32K (Data) CPU /sys/devices/system/cpu/cpu0 Level 1 Cache: 32K (Instruction) CPU /sys/devices/system/cpu/cpu0 Level 2 Cache: 256K (Unified) CPU /sys/devices/system/cpu/cpu0 Level 3 Cache: 25600K (Unified) CPU /sys/devices/system/cpu/cpu1 Level 1 Cache: 32K (Data) CPU /sys/devices/system/cpu/cpu1 Level 1 Cache: 32K (Instruction) CPU /sys/devices/system/cpu/cpu1 Level 2 Cache: 256K (Unified) CPU /sys/devices/system/cpu/cpu1 Level 3 Cache: 25600K (Unified) ... (for 40 cpus) Total CPU system cache: 1060372480 bytes Suggested minimum array elements needed: 481987490 WARNING: Array size may not fit into a 32 bit structure. If stream files to compile, you may need to uncomment the line in the script labeled and described by the "Size clamp code" comments in the stream-scaling script. Array elements used: 481987490 === CPU Core Summary === processor : 39 model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz cpu MHz : 1273.925 siblings : 20 === Check and build stream === /tmp/ccuVZwf5.o: In function `tuned_STREAM_Triad._omp_fn.0': stream.c:(.text+0x5a): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0x8b): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0x98): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0xe0): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0xe7): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0x14b): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0x15b): relocation truncated to fit: R_X86_64_32S against `.bss' /tmp/ccuVZwf5.o: In function `tuned_STREAM_Add._omp_fn.1': stream.c:(.text+0x1f4): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0x1fd): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0x23b): relocation truncated to fit: R_X86_64_32S against `.bss' stream.c:(.text+0x242): additional relocation overflows omitted from the output collect2: error: ld returned 1 exit status === Trying large memory model === (this can take a while to compile) === Testing up to 40 cores === ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 481987490, Offset = 0 Total memory required = 11031.8 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Number of Threads requested = 1 ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 520522 microseconds. (= 520522 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 9552.4291 0.8205 0.8073 0.8319 Scale: 9818.6724 0.8128 0.7854 0.8333 Add: 10600.6443 1.1113 1.0912 1.1557 Triad: 10598.8774 1.1073 1.0914 1.1301 ------------------------------------------------------------- Failed Validation on array c[] Expected : 148233746338593734656.000000 Observed : 148233747862815801344.000000 ------------------------------------------------------------- Number of Threads requested = 2 Function Rate (MB/s) Avg time Min time Max time Triad: 26541.4003 0.4667 0.4358 0.5130 Failed Validation on array c[] Expected : 148233746338593734656.000000 Observed : 148233747862815801344.000000 Number of Threads requested = 3 Function Rate (MB/s) Avg time Min time Max time Triad: 30890.3475 0.3798 0.3745 0.3842 Failed Validation on array c[] Expected : 148233746338593734656.000000 Observed : 148233747862815801344.000000 ... Number of Threads requested = 40 Function Rate (MB/s) Avg time Min time Max time Triad: 53770.7791 0.2186 0.2151 0.2258 Failed Validation on array c[] Expected : 148233746338593734656.000000 Observed : 148233747862815801344.000000 For hardware specs, both servers have same CPU & RAM ; this is 40 * Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz ; and 256 GB RAM. I'll do tomorrow morning switch between master & slave ; so I probably will have more information at this moment on SSD server. (sar, dstat reports, ..) > > For 2) I think we're still missing a lot of information about what's happening in your database to be able to say much,i.e. what's the size of the database, tables 1 - 8, typical read / write paterns, and transaction rate? Perhaps youcould give a simple description of what your apps(s) do on a daily basis, and then post a diff of the outputs of the pg_stat_database,pg_stat_user_tables and pg_staio_user_tables views over a specific period of time. > > You really need to dig into each of these yourself to try and see where your new system might be lacking. Quick answer on principal values on db size ~ 150 GB. Standard comportement is 55% update, 40% select & 10 % insert ; around 170 TPS. I've set up recording of pg_stat_database for getting stats and will try to analyze other tables.
> From: Thomas SIMON <tsimon@neteven.com> >To: Glyn Astill <glynastill@yahoo.co.uk> >Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >Sent: Tuesday, 26 May 2015, 15:44 >Subject: Re: [ADMIN] Performances issues with SSD volume ? > >I can't do bonnie++ now, beacause I've already prepared server as a >slave, and I have no sufficient disk space available (program says File >size should be double RAM for good results output) > >My current sar output on productions (HDD) server is: >14:25:01 DEV tps rd_sec/s wr_sec/s >avgrq-sz avgqu-sz await svctm %util > >11:05:01 vg_data-lv_data 954.37 2417.35 18941.77 22.38 >31.10 32.46 0.34 32.70 >11:15:01 vg_data-lv_data 1155.79 8716.91 21995.77 26.57 >25.15 21.74 0.40 46.70 >11:25:01 vg_data-lv_data 1250.62 6478.67 23450.07 23.93 >39.77 31.78 0.41 51.34 >11:35:01 vg_data-lv_data 842.48 2051.11 17120.92 22.76 >15.63 18.53 0.29 24.04 >11:45:01 vg_data-lv_data 666.21 1403.32 14174.47 23.38 >10.11 15.12 0.24 15.79 >11:55:01 vg_data-lv_data 923.51 6763.36 15337.58 23.93 >13.07 14.14 0.35 32.63 >12:05:01 vg_data-lv_data 989.86 9148.71 16252.59 25.66 >19.42 19.56 0.45 44.21 >12:15:01 vg_data-lv_data 1369.24 8631.93 24737.60 24.37 >35.04 25.54 0.45 61.33 >12:25:01 vg_data-lv_data 1776.12 7070.01 39851.34 26.42 >74.81 42.05 0.44 77.29 >12:35:01 vg_data-lv_data 1529.15 6635.80 85865.14 60.49 >54.11 35.34 0.48 72.89 >12:45:01 vg_data-lv_data 1187.43 4528.74 40366.95 37.81 >36.07 30.36 0.39 45.81 >12:55:01 vg_data-lv_data 984.48 3520.06 21539.36 25.45 >17.91 18.17 0.31 30.20 >13:05:01 vg_data-lv_data 926.54 6304.44 16688.94 24.82 >17.36 18.69 0.41 38.05 >13:15:01 vg_data-lv_data 1232.46 7199.65 29852.49 30.06 >40.17 32.53 0.42 51.60 >13:25:01 vg_data-lv_data 1223.46 3945.05 27448.15 25.66 >31.07 25.31 0.35 42.65 >13:35:01 vg_data-lv_data 1126.91 2811.70 22067.19 22.08 >24.33 21.55 0.32 36.00 >13:45:01 vg_data-lv_data 833.33 1805.26 17274.43 22.90 >24.40 29.25 0.30 25.41 >13:55:02 vg_data-lv_data 1085.88 7616.75 19140.67 24.64 >17.48 16.06 0.39 42.15 >14:05:01 vg_data-lv_data 691.52 3852.50 13125.53 24.55 >7.75 11.15 0.30 20.74 >14:15:01 vg_data-lv_data 1288.88 5390.41 24171.07 22.94 >33.31 25.76 0.36 46.31 >14:25:01 vg_data-lv_data 1592.88 3637.77 29836.89 21.02 >76.45 47.94 0.40 63.28 >14:35:01 vg_data-lv_data 1652.78 9502.87 31587.68 24.86 >58.97 35.58 0.44 72.46 >14:45:01 vg_data-lv_data 1623.82 6249.52 34148.46 24.88 >53.47 32.83 0.40 65.19 >14:55:01 vg_data-lv_data 1330.44 6516.11 26828.59 25.06 >55.66 41.81 0.42 55.46 >Average: vg_data-lv_data 1176.55 5508.02 26324.37 27.06 >33.86 28.72 0.39 45.59 So the i/o is read heavy, it would be interesting to see why that might be and some insight into running queries would goa long way there. >Quick answer on principal values on db >size ~ 150 GB. >Standard comportement is 55% update, 40% select & 10 % insert ; around >170 TPS. > Again, if this is a dedicated postgres server I'd be hoping for there to be pretty much 0 reads from disk as your whole databaseshould fit in os buffer cache. Only 170 TPS, are these large operations like bulk loads and big selects? Can you post the ouput of "free -m" and "cat /proc/meminfo"?
----- Original Message ----- > From: Glyn Astill <glynastill@yahoo.co.uk> > To: Thomas SIMON <tsimon@neteven.com> > Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> > Sent: Tuesday, 26 May 2015, 16:27 > Subject: Re: [ADMIN] Performances issues with SSD volume ? > >> From: Thomas SIMON <tsimon@neteven.com> >> To: Glyn Astill <glynastill@yahoo.co.uk> >> Cc: "pgsql-admin@postgresql.org" > <pgsql-admin@postgresql.org> >> Sent: Tuesday, 26 May 2015, 15:44 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >> I can't do bonnie++ now, beacause I've already prepared server as a >> slave, and I have no sufficient disk space available (program says File >> size should be double RAM for good results output) >> >> My current sar output on productions (HDD) server is: >> 14:25:01 DEV tps rd_sec/s wr_sec/s >> avgrq-sz avgqu-sz await svctm %util >> >> 11:05:01 vg_data-lv_data 954.37 2417.35 18941.77 22.38 >> 31.10 32.46 0.34 32.70 >> 11:15:01 vg_data-lv_data 1155.79 8716.91 21995.77 26.57 >> 25.15 21.74 0.40 46.70 >> 11:25:01 vg_data-lv_data 1250.62 6478.67 23450.07 23.93 >> 39.77 31.78 0.41 51.34 >> 11:35:01 vg_data-lv_data 842.48 2051.11 17120.92 22.76 >> 15.63 18.53 0.29 24.04 >> 11:45:01 vg_data-lv_data 666.21 1403.32 14174.47 23.38 >> 10.11 15.12 0.24 15.79 >> 11:55:01 vg_data-lv_data 923.51 6763.36 15337.58 23.93 >> 13.07 14.14 0.35 32.63 >> 12:05:01 vg_data-lv_data 989.86 9148.71 16252.59 25.66 >> 19.42 19.56 0.45 44.21 >> 12:15:01 vg_data-lv_data 1369.24 8631.93 24737.60 24.37 >> 35.04 25.54 0.45 61.33 >> 12:25:01 vg_data-lv_data 1776.12 7070.01 39851.34 26.42 >> 74.81 42.05 0.44 77.29 >> 12:35:01 vg_data-lv_data 1529.15 6635.80 85865.14 60.49 >> 54.11 35.34 0.48 72.89 >> 12:45:01 vg_data-lv_data 1187.43 4528.74 40366.95 37.81 >> 36.07 30.36 0.39 45.81 >> 12:55:01 vg_data-lv_data 984.48 3520.06 21539.36 25.45 >> 17.91 18.17 0.31 30.20 >> 13:05:01 vg_data-lv_data 926.54 6304.44 16688.94 24.82 >> 17.36 18.69 0.41 38.05 >> 13:15:01 vg_data-lv_data 1232.46 7199.65 29852.49 30.06 >> 40.17 32.53 0.42 51.60 >> 13:25:01 vg_data-lv_data 1223.46 3945.05 27448.15 25.66 >> 31.07 25.31 0.35 42.65 >> 13:35:01 vg_data-lv_data 1126.91 2811.70 22067.19 22.08 >> 24.33 21.55 0.32 36.00 >> 13:45:01 vg_data-lv_data 833.33 1805.26 17274.43 22.90 >> 24.40 29.25 0.30 25.41 >> 13:55:02 vg_data-lv_data 1085.88 7616.75 19140.67 24.64 >> 17.48 16.06 0.39 42.15 >> 14:05:01 vg_data-lv_data 691.52 3852.50 13125.53 24.55 >> 7.75 11.15 0.30 20.74 >> 14:15:01 vg_data-lv_data 1288.88 5390.41 24171.07 22.94 >> 33.31 25.76 0.36 46.31 >> 14:25:01 vg_data-lv_data 1592.88 3637.77 29836.89 21.02 >> 76.45 47.94 0.40 63.28 >> 14:35:01 vg_data-lv_data 1652.78 9502.87 31587.68 24.86 >> 58.97 35.58 0.44 72.46 >> 14:45:01 vg_data-lv_data 1623.82 6249.52 34148.46 24.88 >> 53.47 32.83 0.40 65.19 >> 14:55:01 vg_data-lv_data 1330.44 6516.11 26828.59 25.06 >> 55.66 41.81 0.42 55.46 >> Average: vg_data-lv_data 1176.55 5508.02 26324.37 27.06 >> 33.86 28.72 0.39 45.59 > > > So the i/o is read heavy, it would be interesting to see why that might be and > some insight into running queries would go a long way there. > > Scratch that, I've obviously misaligned the output when reading it and I was looking at writes when I thought I was lookingat reads. Still would be nice to see what the reads are though.
Hi everyone, Coming back to you after my server switch, results are very positive ! For answering your last question Glyn, I figured from where reads was from, this is wal senders sent to slave, so it is "normal". No more significative reads when no archieving to slave. Seems that kernel change combined to raid optimisations have resolved the problem. No more crazy load peaks or io bound problem on sdd partition, SSD performances are great for now, pg badger total query duration decreased about 35% since I switch servers. So I guess this topic is solved now, thanks to many contributors who helped me, and special thanks to Glyn who spent a lot of time to answer me, and helped me a lot ! Thomas Le 26/05/2015 17:42, Glyn Astill a écrit : > ----- Original Message ----- > >> From: Glyn Astill <glynastill@yahoo.co.uk> >> To: Thomas SIMON <tsimon@neteven.com> >> Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org> >> Sent: Tuesday, 26 May 2015, 16:27 >> Subject: Re: [ADMIN] Performances issues with SSD volume ? >> >>> From: Thomas SIMON <tsimon@neteven.com> >>> To: Glyn Astill <glynastill@yahoo.co.uk> >>> Cc: "pgsql-admin@postgresql.org" >> <pgsql-admin@postgresql.org> >>> Sent: Tuesday, 26 May 2015, 15:44 >>> Subject: Re: [ADMIN] Performances issues with SSD volume ? >>> >>> I can't do bonnie++ now, beacause I've already prepared server as a >>> slave, and I have no sufficient disk space available (program says File >>> size should be double RAM for good results output) >>> >>> My current sar output on productions (HDD) server is: >>> 14:25:01 DEV tps rd_sec/s wr_sec/s >>> avgrq-sz avgqu-sz await svctm %util >>> >>> 11:05:01 vg_data-lv_data 954.37 2417.35 18941.77 22.38 >>> 31.10 32.46 0.34 32.70 >>> 11:15:01 vg_data-lv_data 1155.79 8716.91 21995.77 26.57 >>> 25.15 21.74 0.40 46.70 >>> 11:25:01 vg_data-lv_data 1250.62 6478.67 23450.07 23.93 >>> 39.77 31.78 0.41 51.34 >>> 11:35:01 vg_data-lv_data 842.48 2051.11 17120.92 22.76 >>> 15.63 18.53 0.29 24.04 >>> 11:45:01 vg_data-lv_data 666.21 1403.32 14174.47 23.38 >>> 10.11 15.12 0.24 15.79 >>> 11:55:01 vg_data-lv_data 923.51 6763.36 15337.58 23.93 >>> 13.07 14.14 0.35 32.63 >>> 12:05:01 vg_data-lv_data 989.86 9148.71 16252.59 25.66 >>> 19.42 19.56 0.45 44.21 >>> 12:15:01 vg_data-lv_data 1369.24 8631.93 24737.60 24.37 >>> 35.04 25.54 0.45 61.33 >>> 12:25:01 vg_data-lv_data 1776.12 7070.01 39851.34 26.42 >>> 74.81 42.05 0.44 77.29 >>> 12:35:01 vg_data-lv_data 1529.15 6635.80 85865.14 60.49 >>> 54.11 35.34 0.48 72.89 >>> 12:45:01 vg_data-lv_data 1187.43 4528.74 40366.95 37.81 >>> 36.07 30.36 0.39 45.81 >>> 12:55:01 vg_data-lv_data 984.48 3520.06 21539.36 25.45 >>> 17.91 18.17 0.31 30.20 >>> 13:05:01 vg_data-lv_data 926.54 6304.44 16688.94 24.82 >>> 17.36 18.69 0.41 38.05 >>> 13:15:01 vg_data-lv_data 1232.46 7199.65 29852.49 30.06 >>> 40.17 32.53 0.42 51.60 >>> 13:25:01 vg_data-lv_data 1223.46 3945.05 27448.15 25.66 >>> 31.07 25.31 0.35 42.65 >>> 13:35:01 vg_data-lv_data 1126.91 2811.70 22067.19 22.08 >>> 24.33 21.55 0.32 36.00 >>> 13:45:01 vg_data-lv_data 833.33 1805.26 17274.43 22.90 >>> 24.40 29.25 0.30 25.41 >>> 13:55:02 vg_data-lv_data 1085.88 7616.75 19140.67 24.64 >>> 17.48 16.06 0.39 42.15 >>> 14:05:01 vg_data-lv_data 691.52 3852.50 13125.53 24.55 >>> 7.75 11.15 0.30 20.74 >>> 14:15:01 vg_data-lv_data 1288.88 5390.41 24171.07 22.94 >>> 33.31 25.76 0.36 46.31 >>> 14:25:01 vg_data-lv_data 1592.88 3637.77 29836.89 21.02 >>> 76.45 47.94 0.40 63.28 >>> 14:35:01 vg_data-lv_data 1652.78 9502.87 31587.68 24.86 >>> 58.97 35.58 0.44 72.46 >>> 14:45:01 vg_data-lv_data 1623.82 6249.52 34148.46 24.88 >>> 53.47 32.83 0.40 65.19 >>> 14:55:01 vg_data-lv_data 1330.44 6516.11 26828.59 25.06 >>> 55.66 41.81 0.42 55.46 >>> Average: vg_data-lv_data 1176.55 5508.02 26324.37 27.06 >>> 33.86 28.72 0.39 45.59 >> >> So the i/o is read heavy, it would be interesting to see why that might be and >> some insight into running queries would go a long way there. >> > > Scratch that, I've obviously misaligned the output when reading it and I was looking at writes when I thought I was lookingat reads. Still would be nice to see what the reads are though. > >