Обсуждение: Slow queries / commits, mis-configuration or hardware issues?
Hi, running Postgres 9.1.1 on an EC2 m1.xlarge instance. Machine is a dedicated master with 2 streaming replication nodes. The machine has 16GB of RAM and 4 cores. We're starting to see some slow queries, especially COMMITs that are happening more frequently. The slow queries are against seemingly well-indexed tables. I have log_min_duration = 150ms Slow commits like: 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG: duration: 3062.784 ms statement: COMMIT 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG: duration: 2593.351 ms statement: COMMIT These slow COMMITs are against tables that received a large number of UPDATEs and are growing fairly rapidly. And slow queries like: 2011-11-14 17:50:20 UTC pid:6519 (16/0-0) LOG: duration: 1694.456 ms statement: SELECT "facebook_wall_posts".* FROM "facebook_wall_posts" WHERE "facebook_wall_posts"."token" = '984c44e75975b224b38197cf8f8fc76a' LIMIT 1 query plan: http://explain.depesz.com/s/wbm The # of rows in facebook_wall_posts is 5841 Structure of facebook_wall_posts: Table "public.facebook_wall_posts" Column | Type | Modifiers --------------------+-----------------------------+------------------------------------------------------------------ id | integer | not null default nextval('facebook_wall_posts_id_seq'::regclass) album_id | integer | not null token | character varying(32) | not null fb_recipient_id | character varying(64) | not null post_id | character varying(100) | not null fb_post_created_at | timestamp without time zone | data | text | created_at | timestamp without time zone | updated_at | timestamp without time zone | fb_post_deleted_at | timestamp without time zone | Indexes: "facebook_wall_posts_pkey" PRIMARY KEY, btree (id) "index_facebook_wall_posts_on_token" UNIQUE, btree (token) "index_facebook_wall_posts_on_album_id" btree (album_id) And another slow query: 2011-11-14 17:52:44 UTC pid:14912 (58/0-0) LOG: duration: 979.740 ms statement: SELECT "facebook_friends"."friend_id" FROM "facebook_friends" WHERE "facebook_friends"."user_id" = 9134671 Query plan: http://explain.depesz.com/s/x1q # of rows in facebook_friends is 27075088 Structure of facebook_friends: Table "public.facebook_friends" Column | Type | Modifiers ------------+-----------------------------+--------------------------------------------------------------- id | integer | not null default nextval('facebook_friends_id_seq'::regclass) user_id | integer | not null friend_id | integer | not null created_at | timestamp without time zone | Indexes: "facebook_friends_pkey" PRIMARY KEY, btree (id) "index_facebook_friends_on_user_id_and_friend_id" UNIQUE, btree (user_id, friend_id) We have auto-vacuum enabled and running. But yesterday I manually ran vacuum on the database. Autovacuum settings: autovacuum | on autovacuum_analyze_scale_factor | 0.1 autovacuum_analyze_threshold | 50 autovacuum_freeze_max_age | 200000000 autovacuum_max_workers | 3 autovacuum_naptime | 60 autovacuum_vacuum_cost_delay | 20 autovacuum_vacuum_cost_limit | -1 autovacuum_vacuum_scale_factor | 0.2 autovacuum_vacuum_threshold | 50 other postgresql.conf settings: shared_buffers = 3584MB wal_buffers = 16MB checkpoint_segments = 32 max_wal_senders = 10 checkpoint_completion_target = 0.9 wal_keep_segments = 1024 maintenance_work_mem = 256MB work_mem = 88MB shared_buffers = 3584MB effective_cache_size = 10GB The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak) drives. I ran some dd tests and received the following output: --- WRITING --- root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile bs=8k count=4000000 && sync" 4000000+0 records in 4000000+0 records out 32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s real 11m52.199s user 0m2.720s sys 0m45.330s --- READING --- root@sql03:/data# time dd of=/dev/zero if=/data/tmp/bigfile bs=8k 4000000+0 records in 4000000+0 records out 32768000000 bytes (33 GB) copied, 155.429 s, 211 MB/s real 2m35.434s user 0m2.400s sys 0m33.160s I have enabled log_checkpoints and here is a recent sample from the log: 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync files=60, longest=1.484 s, average=0.049 s 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync files=82, longest=2.943 s, average=0.050 s 2011-11-14 17:44:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:48:47 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 14159 buffers (3.1%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=269.818 s, sync=2.119 s, total=271.948 s; sync files=71, longest=1.192 s, average=0.029 s 2011-11-14 17:49:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:53:47 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 11337 buffers (2.5%); 0 transaction log file(s) added, 0 removed, 7 recycled; write=269.901 s, sync=2.508 s, total=272.419 s; sync files=71, longest=1.867 s, average=0.035 s 2011-11-14 17:54:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:58:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 15706 buffers (3.4%); 0 transaction log file(s) added, 0 removed, 7 recycled; write=270.104 s, sync=3.612 s, total=273.727 s; sync files=67, longest=3.051 s, average=0.053 s 2011-11-14 17:59:15 UTC pid:3965 (-0) LOG: checkpoint starting: time I've been collecting random samples from pg_stat_bgwriter: https://gist.github.com/4faec2ca9a79ede281e1 So given all this information (if you need more just let me know), is there something fundamentally wrong or mis-configured? Do I have an I/O issue? Thanks for any insight.
On 14 Listopad 2011, 19:16, Cody Caughlan wrote: > shared_buffers = 3584MB > wal_buffers = 16MB > checkpoint_segments = 32 > max_wal_senders = 10 > checkpoint_completion_target = 0.9 > wal_keep_segments = 1024 > maintenance_work_mem = 256MB > work_mem = 88MB > shared_buffers = 3584MB > effective_cache_size = 10GB Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is usually too low). > The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak) > drives. I ran some dd tests and received the following output: > > --- WRITING --- > root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile > bs=8k count=4000000 && sync" > 4000000+0 records in > 4000000+0 records out > 32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s > > real 11m52.199s > user 0m2.720s > sys 0m45.330s This measures sequential write performance (and the same holds for the read test). We need to know the random I/O performance too - use bonnie++ or similar tool. Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for sequential read/writes, and about twice that performance for a 4-drive RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes (50MB/s) are rather slow. Have you measured this on an idle system, or when the db was running? See for example this: [1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes [2] http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html > I have enabled log_checkpoints and here is a recent sample from the log: > 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time > 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote > 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 > recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync > files=82, longest=2.943 s, average=0.050 s Nothing special here - this just says that the checkpoints were timed and finished on time (the default checkpoint timeout is 5 minutes, with completion target 0.9 the expected checkpoint time is about 270s). Not a checkpoint issue, probably. > I've been collecting random samples from pg_stat_bgwriter: > https://gist.github.com/4faec2ca9a79ede281e1 Although it's a bit difficult to interpret this (collect the data in regular intervals - e.g. every hour - and post the differences, please), but it seems reasonable. > So given all this information (if you need more just let me know), is > there something fundamentally wrong or mis-configured? Do I have an > I/O issue? Probably - the discrepancy between read/write performance is a bit suspicious. Try to watch the I/O performance when this happens, i.e. run "iostat -x" and watch the output (especially %util, r_await, w_await) and post several lines of the output. Tomas
Thanks for your response. Please see below for answers to your questions. On Mon, Nov 14, 2011 at 11:22 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 14 Listopad 2011, 19:16, Cody Caughlan wrote: >> shared_buffers = 3584MB >> wal_buffers = 16MB >> checkpoint_segments = 32 >> max_wal_senders = 10 >> checkpoint_completion_target = 0.9 >> wal_keep_segments = 1024 >> maintenance_work_mem = 256MB >> work_mem = 88MB >> shared_buffers = 3584MB >> effective_cache_size = 10GB > > Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is > usually too low). Ok, will do. > >> The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak) >> drives. I ran some dd tests and received the following output: >> >> --- WRITING --- >> root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile >> bs=8k count=4000000 && sync" >> 4000000+0 records in >> 4000000+0 records out >> 32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s >> >> real 11m52.199s >> user 0m2.720s >> sys 0m45.330s > > This measures sequential write performance (and the same holds for the > read test). We need to know the random I/O performance too - use bonnie++ > or similar tool. > > Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for > sequential read/writes, and about twice that performance for a 4-drive > RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes > (50MB/s) are rather slow. Have you measured this on an idle system, or > when the db was running? > I ran bonnie++ on a slave node, doing active streaming replication but otherwise idle: http://batch-files-test.s3.amazonaws.com/sql03.prod.html bonnie++ on the master node: http://batch-files-test.s3.amazonaws.com/sql01.prod.html If I am reading this right, this is my first time using it, the numbers dont look too good. > See for example this: > > [1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes > [2] > http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html > >> I have enabled log_checkpoints and here is a recent sample from the log: >> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time >> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote >> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 >> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync >> files=82, longest=2.943 s, average=0.050 s > > Nothing special here - this just says that the checkpoints were timed and > finished on time (the default checkpoint timeout is 5 minutes, with > completion target 0.9 the expected checkpoint time is about 270s). Not a > checkpoint issue, probably. > >> I've been collecting random samples from pg_stat_bgwriter: >> https://gist.github.com/4faec2ca9a79ede281e1 > > Although it's a bit difficult to interpret this (collect the data in > regular intervals - e.g. every hour - and post the differences, please), > but it seems reasonable. Ok, I have a cron running every hour to grab this data. I will post back in a few hours or tomorrow. > >> So given all this information (if you need more just let me know), is >> there something fundamentally wrong or mis-configured? Do I have an >> I/O issue? > > Probably - the discrepancy between read/write performance is a bit > suspicious. > > Try to watch the I/O performance when this happens, i.e. run "iostat -x" > and watch the output (especially %util, r_await, w_await) and post several > lines of the output. > Heres a gist of running "iostat -x 3" for about a few minutes: https://gist.github.com/f94d98f2ef498a522ac2 Indeed, the %iowat and await values can spike up drastically.
On 14 Listopad 2011, 22:58, Cody Caughlan wrote: >> Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is >> usually too low). > > Ok, will do. Yes, but find out what that means and think about the possible impact first. It usually improves the checkpoint behaviour but increases the recovery time and you may need more checkpoint segments. And I doubt this will fix the issue you've described. > I ran bonnie++ on a slave node, doing active streaming replication but > otherwise idle: > http://batch-files-test.s3.amazonaws.com/sql03.prod.html > > bonnie++ on the master node: > http://batch-files-test.s3.amazonaws.com/sql01.prod.html > > If I am reading this right, this is my first time using it, the > numbers dont look too good. Are those instances equal, i.e. use tha same RAID10 config etc.? It surprises me a bit that the slave performs much better than the master, for example the sequential reads are much faster (210MB/s vs. 60MB/s) and it handles about twice the number of seeks (345 vs. 170). But this may be skewed because of the workload. > Heres a gist of running "iostat -x 3" for about a few minutes: > > https://gist.github.com/f94d98f2ef498a522ac2 > > Indeed, the %iowat and await values can spike up drastically. OK, so xvdb-xvde are individual drives and dm-0 is the RAID10 device, right? According to the log_checkpoint info, you're writing about 15000 (120MB) buffers in 270s, i.e. about 440kB/s. But according to the iostat you're writing up to 4MB/s, so it's not just about the checkpoints. What else is going on there? How much WAL do you write? Do you have iotop installed? That might give you a hint what processes are writing data etc. I'm a bit confused that the w/s don't add up while r/s do. Tomas
On Mon, Nov 14, 2011 at 2:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 14 Listopad 2011, 22:58, Cody Caughlan wrote: >>> Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is >>> usually too low). >> >> Ok, will do. > > Yes, but find out what that means and think about the possible impact > first. It usually improves the checkpoint behaviour but increases the > recovery time and you may need more checkpoint segments. And I doubt this > will fix the issue you've described. > Ok, I understand the ramifications with increasing the checkpoint timeout. But I will investigate more before I jump in. >> I ran bonnie++ on a slave node, doing active streaming replication but >> otherwise idle: >> http://batch-files-test.s3.amazonaws.com/sql03.prod.html >> >> bonnie++ on the master node: >> http://batch-files-test.s3.amazonaws.com/sql01.prod.html >> >> If I am reading this right, this is my first time using it, the >> numbers dont look too good. > > Are those instances equal, i.e. use tha same RAID10 config etc.? It > surprises me a bit that the slave performs much better than the master, > for example the sequential reads are much faster (210MB/s vs. 60MB/s) and > it handles about twice the number of seeks (345 vs. 170). But this may be > skewed because of the workload. Yes, these two machines are the same. sql01 is the master node and is quite busy. Running bonnie++ on it during its normal workload spiked I/O for the duration. sql03 is a pure slave and is quite idle, save for receiving WAL segments. > >> Heres a gist of running "iostat -x 3" for about a few minutes: >> >> https://gist.github.com/f94d98f2ef498a522ac2 >> >> Indeed, the %iowat and await values can spike up drastically. > > OK, so xvdb-xvde are individual drives and dm-0 is the RAID10 device, > right? According to the log_checkpoint info, you're writing about 15000 > (120MB) buffers in 270s, i.e. about 440kB/s. But according to the iostat > you're writing up to 4MB/s, so it's not just about the checkpoints. > > What else is going on there? How much WAL do you write? Yes, dm-0 is the RAID10 device. The WAL config is: wal_buffers = 16MB checkpoint_segments = 32 max_wal_senders = 10 checkpoint_completion_target = 0.9 checkpoint_timeout = 300 wal_keep_segments = 1024 > > Do you have iotop installed? That might give you a hint what processes are > writing data etc. I do have iotop and have been watching it. The only I/O users are postgres and its backends. I dont see anything else consuming any I/O. By eyeballing iotop, big consumers of disk writes are: idle in transaction, SELECT, COMMIT The first two are what I would think would be largely read operations (certainly the SELECT) so its not clear why a SELECT consumes write time. Here is the output of some pg_stat_bgwriter stats from the last couple of hours: https://gist.github.com/41ee26caca01471a9b77 One thing that I might not have made very clear earlier is that this DB, especially a single table receives a very large number of UPDATEs. However, it seems to be well cached, I have shared_buffers = 3584MB and a view of pg_buffercache shows: https://gist.github.com/53c520571290cae14613 Is it possible that we're just trying to handle too many UPDATEs and they are all trying to hit disk all at once - causing this I/O contention? Here is a view of pg_stat_user_tables that shows the amount of live/dead tuples: https://gist.github.com/5ac1ae7d11facd72913f
Dne 15.11.2011 01:13, Cody Caughlan napsal(a): > The first two are what I would think would be largely read operations > (certainly the SELECT) so its not clear why a SELECT consumes write > time. > > Here is the output of some pg_stat_bgwriter stats from the last couple of hours: > > https://gist.github.com/41ee26caca01471a9b77 Hmm, the difference between 22:00 and 23:00 seems a bit suspucious. There are 11 checkpoints (timed + requested), i.e. 352 segments, i.e. about 5632 MB of WAL data. The checkpoints had to write 504.135 buffers (i.e. 4 GB of data) and background writer additional 10.494 buffers (100MB). But the backends themselves had to write 173.208 buffers. That's way too much I guess and it's probably one of the reasons why the queries take so long. So your primary goal probably should be to write less buffers from the backends. Spread checkpoints are the most effective way and background writer is fine. Try to eliminate the requested checkpoints (increase the number of checkpoint segments), and eliminate the buffers written by backends. Increase the shared buffers and watch if that helps. If that does not help, make the background writer a bit more aggressive. Increase bgwriter_lru_maxpages / decrease bgwriter_delay, that should write the buffers a bit more aggressive. But if there was something extraordinary happening between 22:00 and 23:00 (e.g. a backup, a batch job), this may be completely nonsense. > One thing that I might not have made very clear earlier is that this > DB, especially a single table receives a very large number of UPDATEs. > However, it seems to be well cached, I have shared_buffers = 3584MB > and a view of pg_buffercache shows: > https://gist.github.com/53c520571290cae14613 That means almost 100% of the buffers is used. But this is not a very interesting view - the buffer may be occupied and not used. Can you group the buffers by "isdirty" so we can see what portion of the buffers was modified (and needs to be written)? Much more interesting is the view from the other side - how many requests were handled from shared buffers. For a database you can do that like this select datname, 100* blks_hit / (blks_hit + blks_read + 1) from pg_stat_database; The "+1" is there just because of zero values, and you should evaluate that using two snapshots (difference of). And you can compute the same thing (cache hit ratio) for tables/indexes. The problem here is that there's a separate cache (page cache), and that's not included here. So a read may be a "miss" and yet not require an actual I/O. > Is it possible that we're just trying to handle too many UPDATEs and > they are all trying to hit disk all at once - causing this I/O Not sure what you mean by "trying to hit disk all at once"? The updates are written to a WAL (which is mostly sequential I/O) and the actual pages are updated in memory (shared buffers). And those shared buffers are written only when a checkpoint happens, but this is a single write of the whole block, not many small writes. Sure, the clients need to grab a lock on a buffer when modifying it, and a lock on WAL, but that wouldn't demonstrate as an I/O utilization. In short - I don't think this is happening here. What actually matters here is that the dirty buffers are spread across the drive - that's where the random I/O comes from. And the fact that the backends need to flush the dirty buffers on their own. > contention? Here is a view of pg_stat_user_tables that shows the > amount of live/dead tuples: > > https://gist.github.com/5ac1ae7d11facd72913f What could happen due to UPDATES is a table bloat (table growing due to updates), but I think that's not happening there. The number of dead tuples is very low - less than 1% in most cases. For example the largest table "users" has less than 0.5% of dead tuples and most of the updates are handled by HOT. So I don't think this is an issue here. kind regards Tomas
Dne 14.11.2011 22:58, Cody Caughlan napsal(a): > I ran bonnie++ on a slave node, doing active streaming replication but > otherwise idle: > http://batch-files-test.s3.amazonaws.com/sql03.prod.html > > bonnie++ on the master node: > http://batch-files-test.s3.amazonaws.com/sql01.prod.html > > If I am reading this right, this is my first time using it, the > numbers dont look too good. I've done some benchmarks on my own (m1.xlarge instance), and the results are these (http://pastebin.com/T1LXHru0): single drive ------------ dd writes: 62 MB/s dd reads: 110 MB/s bonnie seq. writes: 55 MB/s bonnie seq. rewrite: 33 MB/s bonnie seq. reads: 91 MB/s bonnie seeks: 370/s raid 0 (4 devices) ----------------------------- dd writes: 220 MB/s dd reads: 380 MB/s bonnie seq. writes: 130 MB/s bonnie seq. rewrite: 114 MB/s bonnie seq. reads: 280 MB/s bonnie seeks: 570/s raid 10 (4 devices) ----------------------------- dd writes: 90 MB/s dd reads: 200 MB/s bonnie seq. writes: 49 MB/s bonnie seq. rewrite: 56 MB/s bonnie seq. reads: 160 MB/s bonnie seeks: 590/s So the results are rather different from your results (both master and slave). What surprises me a bit is the decrease of write performance between sigle drive and RAID 10. I've used bonnie++ 1.03e, so I'm wondering if the 1.96 would give different results ... All those benchmarks were performed with ext3. Tomas
On Tue, Nov 15, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 14.11.2011 22:58, Cody Caughlan napsal(a): >> I ran bonnie++ on a slave node, doing active streaming replication but >> otherwise idle: >> http://batch-files-test.s3.amazonaws.com/sql03.prod.html >> >> bonnie++ on the master node: >> http://batch-files-test.s3.amazonaws.com/sql01.prod.html >> >> If I am reading this right, this is my first time using it, the >> numbers dont look too good. > > I've done some benchmarks on my own (m1.xlarge instance), and the > results are these (http://pastebin.com/T1LXHru0): > > single drive > ------------ > dd writes: 62 MB/s > dd reads: 110 MB/s > bonnie seq. writes: 55 MB/s > bonnie seq. rewrite: 33 MB/s > bonnie seq. reads: 91 MB/s > bonnie seeks: 370/s > > raid 0 (4 devices) > ----------------------------- > dd writes: 220 MB/s > dd reads: 380 MB/s > bonnie seq. writes: 130 MB/s > bonnie seq. rewrite: 114 MB/s > bonnie seq. reads: 280 MB/s > bonnie seeks: 570/s > > raid 10 (4 devices) > ----------------------------- > dd writes: 90 MB/s > dd reads: 200 MB/s > bonnie seq. writes: 49 MB/s > bonnie seq. rewrite: 56 MB/s > bonnie seq. reads: 160 MB/s > bonnie seeks: 590/s > Interesting. I spun up a new m1.xlarge and did the same RAID10 config (4 drives) except with a chunk size of 512K (instead of 256K) and the machine was completely idle. Bonnie: http://batch-files-test.s3.amazonaws.com/idle-512k-chunk.html Which has similar-ish performance as yours, except for worse seeks but a bit better seq. reads. The other bonnies I sent over were NOT on idle systems. This one is the master, which receives a heavy stream of writes and some reads http://batch-files-test.s3.amazonaws.com/sql01.prod.html And this is the slave, which is all writes and no reads: http://batch-files-test.s3.amazonaws.com/sql03.prod.html How did you build your RAID array? Maybe I have a fundamental flaw / misconfiguration. I am doing it via: $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde $ pvcreate /dev/md0 $ vgcreate lvm-raid10 /dev/md0 $ lvcreate -l 215021 lvm-raid10 -n lvm0 $ blockdev --setra 65536 /dev/lvm-raid10/lvm0 $ mkfs.xfs -f /dev/lvm-raid10/lvm0 $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data
On 16 Listopad 2011, 2:21, Cody Caughlan wrote: > How did you build your RAID array? Maybe I have a fundamental flaw / > misconfiguration. I am doing it via: > > $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4 > /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde > $ pvcreate /dev/md0 > $ vgcreate lvm-raid10 /dev/md0 > $ lvcreate -l 215021 lvm-raid10 -n lvm0 > $ blockdev --setra 65536 /dev/lvm-raid10/lvm0 > $ mkfs.xfs -f /dev/lvm-raid10/lvm0 > $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data > > -- I don't think you have a flaw there. The workload probably skews the results a bit on the master and slave, so it's difficult to compare it to results from an idle instance. The amount of data written seems small, but a random i/o can saturated the devices quite easily. I went with a very simple raid config - no LVM, default stripe size (better seeks, worse sequential performance), default read-ahead (could give better seq. performance). Tomas
On 11/14/2011 01:16 PM, Cody Caughlan wrote: > We're starting to see some slow queries, especially COMMITs that are > happening more frequently. The slow queries are against seemingly > well-indexed tables. > Slow commits like: > > 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG: duration: 3062.784 ms > statement: COMMIT > 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG: duration: 2593.351 ms > statement: COMMIT > > These slow COMMITs are against tables that received a large number of > UPDATEs and are growing fairly rapidly. > Linux will fill its write cache with all of the writes coming out of each checkpoint. With a 16GB instance, I would expect that 5% * 16GB ~= 800MB of writes are batched up when your system is slow. You should be able to confirm that by looking at the "Dirty:" line in /proc/meminfo With 800MB queued up and I/O that is lucky to get 50MB/s, the sync calls at the end of each checkpoint are sometimes blocking for multiple seconds: > 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote > 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8 > recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync > files=60, longest=1.484 s, average=0.049 s > 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time > 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote > 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 > recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync > files=82, longest=2.943 s, average=0.050 s > When an individual sync call gets stuck for that long, clients can easily get stuck behind it too. There are a couple of techniques that might help: -Switch filesystems if you're running a slow one. ext3 has bad latency behavior here, XFS and ext4 are better. -Lower the dirty_* tunables like dirty_background_ratio or its bytes version. This will reduce average throughput, but can lower latency. -Spread checkpoints out more so that less average writes are happening. -Decrease shared_buffers so less data is getting pushed out at checkpoint time. -Reduce your reliability expectations and turn off synchronous_commit. Your server is sometimes showing multi-second latency issues with bonnie++ too; that suggests how this problem is not even specific to PostgreSQL. Linux is hard to tune for low latency under all circumstances; fighting latency down under a heavy update workload is hard to do even with good hardware to accelerate write performance. In an EC2 environment, it may not even be possible to do without making trade-offs like disabling synchronous writes. I can easily get transactions hung for 10 to 15 seconds on one of their servers if I try to make that problem bad, you're only seeing the middle range of latency issues so far. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On 16 Listopad 2011, 5:27, Greg Smith wrote: > On 11/14/2011 01:16 PM, Cody Caughlan wrote: >> We're starting to see some slow queries, especially COMMITs that are >> happening more frequently. The slow queries are against seemingly >> well-indexed tables. >> Slow commits like: >> >> 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG: duration: 3062.784 ms >> statement: COMMIT >> 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG: duration: 2593.351 ms >> statement: COMMIT >> >> These slow COMMITs are against tables that received a large number of >> UPDATEs and are growing fairly rapidly. >> > Linux will fill its write cache with all of the writes coming out of > each checkpoint. With a 16GB instance, I would expect that 5% * 16GB ~= > 800MB of writes are batched up when your system is slow. You should be > able to confirm that by looking at the "Dirty:" line in /proc/meminfo > > With 800MB queued up and I/O that is lucky to get 50MB/s, the sync calls > at the end of each checkpoint are sometimes blocking for multiple seconds: > >> 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote >> 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8 >> recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync >> files=60, longest=1.484 s, average=0.049 s >> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time >> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote >> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 >> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync >> files=82, longest=2.943 s, average=0.050 s >> > > When an individual sync call gets stuck for that long, clients can > easily get stuck behind it too. There are a couple of techniques that > might help: The sync times I see there seem quite acceptable - 4.2s is not perfect, but I wouldn't rate it as terrible. What actually annoys me is the amount of data written - it's just 17000 pages, i.e. about 130 MB for a checkpoint (spread over 5 minutes). So it's just like 0.5 MB/s. > -Switch filesystems if you're running a slow one. ext3 has bad latency > behavior here, XFS and ext4 are better. He's using xfs, IIRC. That's one of the better behaving ones, when it comes to sync. > -Lower the dirty_* tunables like dirty_background_ratio or its bytes > version. This will reduce average throughput, but can lower latency. > -Spread checkpoints out more so that less average writes are happening. > -Decrease shared_buffers so less data is getting pushed out at > checkpoint time. > -Reduce your reliability expectations and turn off synchronous_commit. The question here probably is whether those high latencies are caused or significantly influenced by the checkpoint, or are a "feature" of the storage. Because if it's a feature, then all this is a futile attempt to fix it. I don't think he has problems with checkpoints - he's complaining about regular queries being slow (even plain SELECT, i.e. something that usually does not require a sync). No doubt this may be connected, but a regular SELECT usually does not perform a sync, right? It may need to fetch some data and if the I/O is saturated by a checkpoint, this may take time. But again - those bonnie results were collected with on a running system, i.e. with checkpoints in progress and all of that. And I'd expect most of the SELECT queries to be handled without actually touching the devices, but by connecting https://gist.github.com/5ac1ae7d11facd72913f and https://gist.github.com/5ac1ae7d11facd72913f it seems that the larges table (users) is almost completely in shared buffers, while the two other large tables (external_user and facebook_friends) are cached by about 30%. And I'd expect the rest of those tables to be in the page cache, so SELECT queries on those tables should be fast. A commit obviously requires a sync on the WAL - I wonder if moving the WAL would improve the performance here. This is obviously based on an incomplete set of stats, and maybe I'm missing something. > Your server is sometimes showing multi-second latency issues with > bonnie++ too; that suggests how this problem is not even specific to > PostgreSQL. Linux is hard to tune for low latency under all Don't forget those data were collected on a production system, i.e. it was actually under load. That probably skews the results a lot. > circumstances; fighting latency down under a heavy update workload is > hard to do even with good hardware to accelerate write performance. In > an EC2 environment, it may not even be possible to do without making > trade-offs like disabling synchronous writes. I can easily get > transactions hung for 10 to 15 seconds on one of their servers if I try > to make that problem bad, you're only seeing the middle range of latency > issues so far. Are you talking about EBS or ephemeral storage? Because all this is about ephemeral (something like a "virtualized local" storage). Tomas
On 16 Listopad 2011, 2:21, Cody Caughlan wrote: > How did you build your RAID array? Maybe I have a fundamental flaw / > misconfiguration. I am doing it via: > > $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4 > /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde > $ pvcreate /dev/md0 > $ vgcreate lvm-raid10 /dev/md0 > $ lvcreate -l 215021 lvm-raid10 -n lvm0 > $ blockdev --setra 65536 /dev/lvm-raid10/lvm0 > $ mkfs.xfs -f /dev/lvm-raid10/lvm0 > $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data I'm not using EC2 much, and those were my first attempts with ephemeral storage, so this may be a stupid question, but why are you building a RAID-10 array on an ephemeral storage, anyway? You already have a standby, so if the primary instance fails you can easily failover. What are you going to do in case of a drive failure? With a server this is rather easy - just put there a new drive and you're done, but can you do that on EC2? I guess you can't do that when the instance is running, so you'll have to switch to the standby anyway, right? Have you ever tried this (how it affects the performance etc.)? So what additional protection does that give you? Wouldn't a RAID-0 be a better utilization of the resources? Tomas
On Nov 16, 2011, at 8:52 AM, Tomas Vondra wrote: > On 16 Listopad 2011, 2:21, Cody Caughlan wrote: >> How did you build your RAID array? Maybe I have a fundamental flaw / >> misconfiguration. I am doing it via: >> >> $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4 >> /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde >> $ pvcreate /dev/md0 >> $ vgcreate lvm-raid10 /dev/md0 >> $ lvcreate -l 215021 lvm-raid10 -n lvm0 >> $ blockdev --setra 65536 /dev/lvm-raid10/lvm0 >> $ mkfs.xfs -f /dev/lvm-raid10/lvm0 >> $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data > > I'm not using EC2 much, and those were my first attempts with ephemeral > storage, so this may be a stupid question, but why are you building a > RAID-10 array on an ephemeral storage, anyway? > > You already have a standby, so if the primary instance fails you can > easily failover. > Yes, the slave will become master if master goes down. We have no plan to try and resurrect the master in the case of failure,hence the choice of ephemeral vs EBS. We chose RAID10 over RAID0 to get the best combination of performance and minimizing probability of a single drive failurebringing down the house. So, yes, RAID0 would ultimately deliver the best performance, with more risk. > What are you going to do in case of a drive failure? With a server this is > rather easy - just put there a new drive and you're done, but can you do > that on EC2? I guess you can't do that when the instance is running, so > you'll have to switch to the standby anyway, right? Have you ever tried > this (how it affects the performance etc.)? > As far as I know one cannot alter the ephemeral drives in a running instance, so yes, the whole instance would have to bewritten off. > So what additional protection does that give you? Wouldn't a RAID-0 be a > better utilization of the resources? > Too much risk. > Tomas >
On 16 Listopad 2011, 18:31, Cody Caughlan wrote: > > On Nov 16, 2011, at 8:52 AM, Tomas Vondra wrote: > >> On 16 Listopad 2011, 2:21, Cody Caughlan wrote: >>> How did you build your RAID array? Maybe I have a fundamental flaw / >>> misconfiguration. I am doing it via: >>> >>> $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4 >>> /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde >>> $ pvcreate /dev/md0 >>> $ vgcreate lvm-raid10 /dev/md0 >>> $ lvcreate -l 215021 lvm-raid10 -n lvm0 >>> $ blockdev --setra 65536 /dev/lvm-raid10/lvm0 >>> $ mkfs.xfs -f /dev/lvm-raid10/lvm0 >>> $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data >> >> I'm not using EC2 much, and those were my first attempts with ephemeral >> storage, so this may be a stupid question, but why are you building a >> RAID-10 array on an ephemeral storage, anyway? >> >> You already have a standby, so if the primary instance fails you can >> easily failover. >> > > Yes, the slave will become master if master goes down. We have no plan to > try and resurrect the master in the case of failure, hence the choice of > ephemeral vs EBS. > > We chose RAID10 over RAID0 to get the best combination of performance and > minimizing probability of a single drive failure bringing down the house. > > So, yes, RAID0 would ultimately deliver the best performance, with more > risk. > >> What are you going to do in case of a drive failure? With a server this >> is >> rather easy - just put there a new drive and you're done, but can you do >> that on EC2? I guess you can't do that when the instance is running, so >> you'll have to switch to the standby anyway, right? Have you ever tried >> this (how it affects the performance etc.)? >> > > As far as I know one cannot alter the ephemeral drives in a running > instance, so yes, the whole instance would have to be written off. > >> So what additional protection does that give you? Wouldn't a RAID-0 be a >> better utilization of the resources? >> > > Too much risk. Why? If I understand that correctly, the only case where a RAID-10 actually helps is when an ephemeral drive fails, but not the whole instance. Do you have some numbers how often this happens, i.e. how often a drive fails without the instance? But you can't actually replace the failed drive, so the only option you have is to failover to the standby - right? Sure - with async replication, you could loose a the not-yet-sent transactions. I see two possible solutions: a) use sync rep, available in 9.1 (you already run 9.1.1) b) place WAL on an EBS, mounted as part of the failover The EBS are not exactly fast, but it seems (e.g. http://www.mysqlperformanceblog.com/2009/08/06/ec2ebs-single-and-raid-volumes-io-bencmark/) the sequential performance might be acceptable. According to the stats you've posted, you've written about 5632 MB of WAL data per hour. That's about 1.5 MB/s on average, and that might be handled by the EBS. Yes, if you have a peak where you need to write much more data, this is going to be a bottleneck. Tomas