Обсуждение: Slow queries / commits, mis-configuration or hardware issues?

Поиск

Список

Период

Сортировка

Slow queries / commits, mis-configuration or hardware issues?

От

Cody Caughlan

Дата:

14 ноября 2011 г., 17:17:12

Hi, running Postgres 9.1.1 on an EC2 m1.xlarge instance. Machine is a
dedicated master with 2 streaming replication nodes.

The machine has 16GB of RAM and 4 cores.

We're starting to see some slow queries, especially COMMITs that are
happening more frequently. The slow queries are against seemingly
well-indexed tables.

I have log_min_duration = 150ms

Slow commits like:

2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG:  duration: 3062.784 ms
 statement: COMMIT
2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG:  duration: 2593.351 ms
 statement: COMMIT

These slow COMMITs are against tables that received a large number of
UPDATEs and are growing fairly rapidly.

And slow queries like:

2011-11-14 17:50:20 UTC pid:6519 (16/0-0) LOG:  duration: 1694.456 ms
statement: SELECT  "facebook_wall_posts".* FROM "facebook_wall_posts"
WHERE "facebook_wall_posts"."token" =
'984c44e75975b224b38197cf8f8fc76a' LIMIT 1

query plan: http://explain.depesz.com/s/wbm
The # of rows in facebook_wall_posts is 5841

Structure of facebook_wall_posts:
                                         Table "public.facebook_wall_posts"
       Column       |            Type             |
        Modifiers
--------------------+-----------------------------+------------------------------------------------------------------
 id                 | integer                     | not null default
nextval('facebook_wall_posts_id_seq'::regclass)
 album_id           | integer                     | not null
 token              | character varying(32)       | not null
 fb_recipient_id    | character varying(64)       | not null
 post_id            | character varying(100)      | not null
 fb_post_created_at | timestamp without time zone |
 data               | text                        |
 created_at         | timestamp without time zone |
 updated_at         | timestamp without time zone |
 fb_post_deleted_at | timestamp without time zone |
Indexes:
    "facebook_wall_posts_pkey" PRIMARY KEY, btree (id)
    "index_facebook_wall_posts_on_token" UNIQUE, btree (token)
    "index_facebook_wall_posts_on_album_id" btree (album_id)


And another slow query:

2011-11-14 17:52:44 UTC pid:14912 (58/0-0) LOG:  duration: 979.740 ms
statement: SELECT "facebook_friends"."friend_id" FROM
"facebook_friends"  WHERE "facebook_friends"."user_id" = 9134671

Query plan: http://explain.depesz.com/s/x1q
# of rows in facebook_friends is 27075088

Structure of facebook_friends:
                                     Table "public.facebook_friends"
   Column   |            Type             |
Modifiers
------------+-----------------------------+---------------------------------------------------------------
 id         | integer                     | not null default
nextval('facebook_friends_id_seq'::regclass)
 user_id    | integer                     | not null
 friend_id  | integer                     | not null
 created_at | timestamp without time zone |
Indexes:
    "facebook_friends_pkey" PRIMARY KEY, btree (id)
    "index_facebook_friends_on_user_id_and_friend_id" UNIQUE, btree
(user_id, friend_id)

We have auto-vacuum enabled and running. But yesterday I manually ran
vacuum on the database. Autovacuum settings:

autovacuum                      | on
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold    | 50
autovacuum_freeze_max_age       | 200000000
autovacuum_max_workers          | 3
autovacuum_naptime              | 60
autovacuum_vacuum_cost_delay    | 20
autovacuum_vacuum_cost_limit    | -1
autovacuum_vacuum_scale_factor  | 0.2
autovacuum_vacuum_threshold     | 50

other postgresql.conf settings:

shared_buffers = 3584MB
wal_buffers = 16MB
checkpoint_segments = 32
max_wal_senders = 10
checkpoint_completion_target = 0.9
wal_keep_segments = 1024
maintenance_work_mem = 256MB
work_mem = 88MB
shared_buffers = 3584MB
effective_cache_size = 10GB

The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak)
drives. I ran some dd tests and received the following output:

--- WRITING ---
root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile
bs=8k count=4000000 && sync"
4000000+0 records in
4000000+0 records out
32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s

real    11m52.199s
user    0m2.720s
sys    0m45.330s


--- READING ---
root@sql03:/data# time dd of=/dev/zero if=/data/tmp/bigfile bs=8k
4000000+0 records in
4000000+0 records out
32768000000 bytes (33 GB) copied, 155.429 s, 211 MB/s

real    2m35.434s
user    0m2.400s
sys    0m33.160s


I have enabled log_checkpoints and here is a recent sample from the log:

2011-11-14 17:38:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
files=60, longest=1.484 s, average=0.049 s
2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
files=82, longest=2.943 s, average=0.050 s
2011-11-14 17:44:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
2011-11-14 17:48:47 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
14159 buffers (3.1%); 0 transaction log file(s) added, 0 removed, 6
recycled; write=269.818 s, sync=2.119 s, total=271.948 s; sync
files=71, longest=1.192 s, average=0.029 s
2011-11-14 17:49:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
2011-11-14 17:53:47 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
11337 buffers (2.5%); 0 transaction log file(s) added, 0 removed, 7
recycled; write=269.901 s, sync=2.508 s, total=272.419 s; sync
files=71, longest=1.867 s, average=0.035 s
2011-11-14 17:54:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
2011-11-14 17:58:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
15706 buffers (3.4%); 0 transaction log file(s) added, 0 removed, 7
recycled; write=270.104 s, sync=3.612 s, total=273.727 s; sync
files=67, longest=3.051 s, average=0.053 s
2011-11-14 17:59:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time


I've been collecting random samples from pg_stat_bgwriter:
https://gist.github.com/4faec2ca9a79ede281e1

So given all this information (if you need more just let me know), is
there something fundamentally wrong or mis-configured? Do I have an
I/O issue?

Thanks for any insight.

Re: Slow queries / commits, mis-configuration or hardware issues?

От

"Tomas Vondra"

Дата:

14 ноября 2011 г., 18:23:02

On 14 Listopad 2011, 19:16, Cody Caughlan wrote:
> shared_buffers = 3584MB
> wal_buffers = 16MB
> checkpoint_segments = 32
> max_wal_senders = 10
> checkpoint_completion_target = 0.9
> wal_keep_segments = 1024
> maintenance_work_mem = 256MB
> work_mem = 88MB
> shared_buffers = 3584MB
> effective_cache_size = 10GB

Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
usually too low).

> The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak)
> drives. I ran some dd tests and received the following output:
>
> --- WRITING ---
> root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile
> bs=8k count=4000000 && sync"
> 4000000+0 records in
> 4000000+0 records out
> 32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s
>
> real    11m52.199s
> user    0m2.720s
> sys    0m45.330s

This measures sequential write performance (and the same holds for the
read test). We need to know the random I/O performance too - use bonnie++
or similar tool.

Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for
sequential read/writes, and about twice that performance for a 4-drive
RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes
(50MB/s) are rather slow. Have you measured this on an idle system, or
when the db was running?

See for example this:

[1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes
[2]
http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html

> I have enabled log_checkpoints and here is a recent sample from the log:
> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
> files=82, longest=2.943 s, average=0.050 s

Nothing special here - this just says that the checkpoints were timed and
finished on time (the default checkpoint timeout is 5 minutes, with
completion target 0.9 the expected checkpoint time is about 270s). Not a
checkpoint issue, probably.

> I've been collecting random samples from pg_stat_bgwriter:
> https://gist.github.com/4faec2ca9a79ede281e1

Although it's a bit difficult to interpret this (collect the data in
regular intervals - e.g. every hour - and post the differences, please),
but it seems reasonable.

> So given all this information (if you need more just let me know), is
> there something fundamentally wrong or mis-configured? Do I have an
> I/O issue?

Probably - the discrepancy between read/write performance is a bit
suspicious.

Try to watch the I/O performance when this happens, i.e. run "iostat -x"
and watch the output (especially %util, r_await, w_await) and post several
lines of the output.

Tomas

Re: Slow queries / commits, mis-configuration or hardware issues?

От

Cody Caughlan

Дата:

14 ноября 2011 г., 21:07:42

Thanks for your response. Please see below for answers to your questions.

On Mon, Nov 14, 2011 at 11:22 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 14 Listopad 2011, 19:16, Cody Caughlan wrote:
>> shared_buffers = 3584MB
>> wal_buffers = 16MB
>> checkpoint_segments = 32
>> max_wal_senders = 10
>> checkpoint_completion_target = 0.9
>> wal_keep_segments = 1024
>> maintenance_work_mem = 256MB
>> work_mem = 88MB
>> shared_buffers = 3584MB
>> effective_cache_size = 10GB
>
> Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
> usually too low).

Ok, will do.

>
>> The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak)
>> drives. I ran some dd tests and received the following output:
>>
>> --- WRITING ---
>> root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile
>> bs=8k count=4000000 && sync"
>> 4000000+0 records in
>> 4000000+0 records out
>> 32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s
>>
>> real  11m52.199s
>> user  0m2.720s
>> sys   0m45.330s
>
> This measures sequential write performance (and the same holds for the
> read test). We need to know the random I/O performance too - use bonnie++
> or similar tool.
>
> Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for
> sequential read/writes, and about twice that performance for a 4-drive
> RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes
> (50MB/s) are rather slow. Have you measured this on an idle system, or
> when the db was running?
>

I ran bonnie++ on a slave node, doing active streaming replication but
otherwise idle:
http://batch-files-test.s3.amazonaws.com/sql03.prod.html

bonnie++ on the master node:
http://batch-files-test.s3.amazonaws.com/sql01.prod.html

If I am reading this right, this is my first time using it, the
numbers dont look too good.

> See for example this:
>
> [1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes
> [2]
> http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html
>
>> I have enabled log_checkpoints and here is a recent sample from the log:
>> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
>> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
>> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
>> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
>> files=82, longest=2.943 s, average=0.050 s
>
> Nothing special here - this just says that the checkpoints were timed and
> finished on time (the default checkpoint timeout is 5 minutes, with
> completion target 0.9 the expected checkpoint time is about 270s). Not a
> checkpoint issue, probably.
>
>> I've been collecting random samples from pg_stat_bgwriter:
>> https://gist.github.com/4faec2ca9a79ede281e1
>
> Although it's a bit difficult to interpret this (collect the data in
> regular intervals - e.g. every hour - and post the differences, please),
> but it seems reasonable.

Ok, I have a cron running every hour to grab this data. I will post
back in a few hours or tomorrow.

>
>> So given all this information (if you need more just let me know), is
>> there something fundamentally wrong or mis-configured? Do I have an
>> I/O issue?
>
> Probably - the discrepancy between read/write performance is a bit
> suspicious.
>
> Try to watch the I/O performance when this happens, i.e. run "iostat -x"
> and watch the output (especially %util, r_await, w_await) and post several
> lines of the output.
>

Heres a gist of running "iostat -x 3" for about a few minutes:

https://gist.github.com/f94d98f2ef498a522ac2

Indeed, the %iowat and await values can spike up drastically.

Re: Slow queries / commits, mis-configuration or hardware issues?

От

"Tomas Vondra"

Дата:

14 ноября 2011 г., 22:06:31

On 14 Listopad 2011, 22:58, Cody Caughlan wrote:
>> Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
>> usually too low).
>
> Ok, will do.

Yes, but find out what that means and think about the possible impact
first. It usually improves the checkpoint behaviour but increases the
recovery time and you may need more checkpoint segments. And I doubt this
will fix the issue you've described.

> I ran bonnie++ on a slave node, doing active streaming replication but
> otherwise idle:
> http://batch-files-test.s3.amazonaws.com/sql03.prod.html
>
> bonnie++ on the master node:
> http://batch-files-test.s3.amazonaws.com/sql01.prod.html
>
> If I am reading this right, this is my first time using it, the
> numbers dont look too good.

Are those instances equal, i.e. use tha same RAID10 config etc.? It
surprises me a bit that the slave performs much better than the master,
for example the sequential reads are much faster (210MB/s vs. 60MB/s) and
it handles about twice the number of seeks (345 vs. 170). But this may be
skewed because of the workload.

> Heres a gist of running "iostat -x 3" for about a few minutes:
>
> https://gist.github.com/f94d98f2ef498a522ac2
>
> Indeed, the %iowat and await values can spike up drastically.

OK, so xvdb-xvde are individual drives and dm-0 is the RAID10 device,
right? According to the log_checkpoint info, you're writing about 15000
(120MB) buffers in 270s, i.e. about 440kB/s. But according to the iostat
you're writing up to 4MB/s, so it's not just about the checkpoints.

What else is going on there? How much WAL do you write?

Do you have iotop installed? That might give you a hint what processes are
writing data etc.

I'm a bit confused that the w/s don't add up while r/s do.

Tomas

Re: Slow queries / commits, mis-configuration or hardware issues?

От

Cody Caughlan

Дата:

14 ноября 2011 г., 23:14:23

On Mon, Nov 14, 2011 at 2:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 14 Listopad 2011, 22:58, Cody Caughlan wrote:
>>> Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
>>> usually too low).
>>
>> Ok, will do.
>
> Yes, but find out what that means and think about the possible impact
> first. It usually improves the checkpoint behaviour but increases the
> recovery time and you may need more checkpoint segments. And I doubt this
> will fix the issue you've described.
>

Ok, I understand the ramifications with increasing the checkpoint
timeout. But I will investigate more before I jump in.

>> I ran bonnie++ on a slave node, doing active streaming replication but
>> otherwise idle:
>> http://batch-files-test.s3.amazonaws.com/sql03.prod.html
>>
>> bonnie++ on the master node:
>> http://batch-files-test.s3.amazonaws.com/sql01.prod.html
>>
>> If I am reading this right, this is my first time using it, the
>> numbers dont look too good.
>
> Are those instances equal, i.e. use tha same RAID10 config etc.? It
> surprises me a bit that the slave performs much better than the master,
> for example the sequential reads are much faster (210MB/s vs. 60MB/s) and
> it handles about twice the number of seeks (345 vs. 170). But this may be
> skewed because of the workload.

Yes, these two machines are the same. sql01 is the master node and is
quite busy. Running bonnie++ on it during its normal workload spiked
I/O for the duration. sql03 is a pure slave and is quite idle, save
for receiving WAL segments.

>
>> Heres a gist of running "iostat -x 3" for about a few minutes:
>>
>> https://gist.github.com/f94d98f2ef498a522ac2
>>
>> Indeed, the %iowat and await values can spike up drastically.
>
> OK, so xvdb-xvde are individual drives and dm-0 is the RAID10 device,
> right? According to the log_checkpoint info, you're writing about 15000
> (120MB) buffers in 270s, i.e. about 440kB/s. But according to the iostat
> you're writing up to 4MB/s, so it's not just about the checkpoints.
>
> What else is going on there? How much WAL do you write?

Yes, dm-0 is the RAID10 device. The WAL config is:

wal_buffers = 16MB
checkpoint_segments = 32
max_wal_senders = 10
checkpoint_completion_target = 0.9
checkpoint_timeout = 300
wal_keep_segments = 1024

>
> Do you have iotop installed? That might give you a hint what processes are
> writing data etc.

I do have iotop and have been watching it. The only I/O users are
postgres and its backends. I dont see anything else consuming any I/O.
By eyeballing iotop, big consumers of disk writes are:

idle in transaction, SELECT, COMMIT

The first two are what I would think would be largely read operations
(certainly the SELECT) so its not clear why a SELECT consumes write
time.

Here is the output of some pg_stat_bgwriter stats from the last couple of hours:

https://gist.github.com/41ee26caca01471a9b77

One thing that I might not have made very clear earlier is that this
DB, especially a single table receives a very large number of UPDATEs.
However, it seems to be well cached, I have shared_buffers = 3584MB
and a view of pg_buffercache shows:
https://gist.github.com/53c520571290cae14613

Is it possible that we're just trying to handle too many UPDATEs and
they are all trying to hit disk all at once - causing this I/O
contention? Here is a view of pg_stat_user_tables that shows the
amount of live/dead tuples:

https://gist.github.com/5ac1ae7d11facd72913f

Re: Slow queries / commits, mis-configuration or hardware issues?

От

Tomas Vondra

Дата:

15 ноября 2011 г., 23:13:29

Dne 15.11.2011 01:13, Cody Caughlan napsal(a):
> The first two are what I would think would be largely read operations
> (certainly the SELECT) so its not clear why a SELECT consumes write
> time.
>
> Here is the output of some pg_stat_bgwriter stats from the last couple of hours:
>
> https://gist.github.com/41ee26caca01471a9b77

Hmm, the difference between 22:00 and 23:00 seems a bit suspucious.
There are 11 checkpoints (timed + requested), i.e. 352 segments, i.e.
about 5632 MB of WAL data. The checkpoints had to write 504.135 buffers
(i.e. 4 GB of data) and background writer additional 10.494 buffers (100MB).

But the backends themselves had to write 173.208 buffers. That's way too
much I guess and it's probably one of the reasons why the queries take
so long.

So your primary goal probably should be to write less buffers from the
backends. Spread checkpoints are the most effective way and background
writer is fine.

Try to eliminate the requested checkpoints (increase the number of
checkpoint segments), and eliminate the buffers written by backends.
Increase the shared buffers and watch if that helps.

If that does not help, make the background writer a bit more aggressive.
Increase bgwriter_lru_maxpages / decrease bgwriter_delay, that should
write the buffers a bit more aggressive.

But if there was something extraordinary happening between 22:00 and
23:00 (e.g. a backup, a batch job), this may be completely nonsense.

> One thing that I might not have made very clear earlier is that this
> DB, especially a single table receives a very large number of UPDATEs.
> However, it seems to be well cached, I have shared_buffers = 3584MB
> and a view of pg_buffercache shows:
> https://gist.github.com/53c520571290cae14613

That means almost 100% of the buffers is used. But this is not a very
interesting view - the buffer may be occupied and not used. Can you
group the buffers by "isdirty" so we can see what portion of the buffers
was modified (and needs to be written)?

Much more interesting is the view from the other side - how many
requests were handled from shared buffers. For a database you can do
that like this

  select datname, 100* blks_hit / (blks_hit + blks_read + 1)
    from pg_stat_database;

The "+1" is there just because of zero values, and you should evaluate
that using two snapshots (difference of). And you can compute the same
thing (cache hit ratio) for tables/indexes.

The problem here is that there's a separate cache (page cache), and
that's not included here. So a read may be a "miss" and yet not require
an actual I/O.

> Is it possible that we're just trying to handle too many UPDATEs and
> they are all trying to hit disk all at once - causing this I/O

Not sure what you mean by "trying to hit disk all at once"? The updates
are written to a WAL (which is mostly sequential I/O) and the actual
pages are updated in memory (shared buffers). And those shared buffers
are written only when a checkpoint happens, but this is a single write
of the whole block, not many small writes.

Sure, the clients need to grab a lock on a buffer when modifying it, and
a lock on WAL, but that wouldn't demonstrate as an I/O utilization.

In short - I don't think this is happening here.

What actually matters here is that the dirty buffers are spread across
the drive - that's where the random I/O comes from. And the fact that
the backends need to flush the dirty buffers on their own.

> contention? Here is a view of pg_stat_user_tables that shows the
> amount of live/dead tuples:
>
> https://gist.github.com/5ac1ae7d11facd72913f

What could happen due to UPDATES is a table bloat (table growing due to
updates), but I think that's not happening there. The number of dead
tuples is very low - less than 1% in most cases.

For example the largest table "users" has less than 0.5% of dead tuples
and most of the updates are handled by HOT. So I don't think this is an
issue here.

kind regards
Tomas

Re: Slow queries / commits, mis-configuration or hardware issues?

От

Tomas Vondra

Дата:

15 ноября 2011 г., 23:19:30

Dne 14.11.2011 22:58, Cody Caughlan napsal(a):
> I ran bonnie++ on a slave node, doing active streaming replication but
> otherwise idle:
> http://batch-files-test.s3.amazonaws.com/sql03.prod.html
>
> bonnie++ on the master node:
> http://batch-files-test.s3.amazonaws.com/sql01.prod.html
>
> If I am reading this right, this is my first time using it, the
> numbers dont look too good.

I've done some benchmarks on my own (m1.xlarge instance), and the
results are these (http://pastebin.com/T1LXHru0):

single drive
------------
dd writes: 62 MB/s
dd reads: 110 MB/s
bonnie seq. writes: 55 MB/s
bonnie seq. rewrite: 33 MB/s
bonnie seq. reads: 91 MB/s
bonnie seeks: 370/s

raid 0 (4 devices)
-----------------------------
dd writes: 220 MB/s
dd reads: 380 MB/s
bonnie seq. writes: 130 MB/s
bonnie seq. rewrite: 114 MB/s
bonnie seq. reads: 280 MB/s
bonnie seeks: 570/s

raid 10 (4 devices)
-----------------------------
dd writes: 90 MB/s
dd reads: 200 MB/s
bonnie seq. writes: 49 MB/s
bonnie seq. rewrite: 56 MB/s
bonnie seq. reads: 160 MB/s
bonnie seeks: 590/s

So the results are rather different from your results (both master and
slave).

What surprises me a bit is the decrease of write performance between
sigle drive and RAID 10. I've used bonnie++ 1.03e, so I'm wondering if
the 1.96 would give different results ...

All those benchmarks were performed with ext3.

Tomas

Re: Slow queries / commits, mis-configuration or hardware issues?

От

Cody Caughlan

Дата:

16 ноября 2011 г., 00:23:03

On Tue, Nov 15, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 14.11.2011 22:58, Cody Caughlan napsal(a):
>> I ran bonnie++ on a slave node, doing active streaming replication but
>> otherwise idle:
>> http://batch-files-test.s3.amazonaws.com/sql03.prod.html
>>
>> bonnie++ on the master node:
>> http://batch-files-test.s3.amazonaws.com/sql01.prod.html
>>
>> If I am reading this right, this is my first time using it, the
>> numbers dont look too good.
>
> I've done some benchmarks on my own (m1.xlarge instance), and the
> results are these (http://pastebin.com/T1LXHru0):
>
> single drive
> ------------
> dd writes: 62 MB/s
> dd reads: 110 MB/s
> bonnie seq. writes: 55 MB/s
> bonnie seq. rewrite: 33 MB/s
> bonnie seq. reads: 91 MB/s
> bonnie seeks: 370/s
>
> raid 0 (4 devices)
> -----------------------------
> dd writes: 220 MB/s
> dd reads: 380 MB/s
> bonnie seq. writes: 130 MB/s
> bonnie seq. rewrite: 114 MB/s
> bonnie seq. reads: 280 MB/s
> bonnie seeks: 570/s
>
> raid 10 (4 devices)
> -----------------------------
> dd writes: 90 MB/s
> dd reads: 200 MB/s
> bonnie seq. writes: 49 MB/s
> bonnie seq. rewrite: 56 MB/s
> bonnie seq. reads: 160 MB/s
> bonnie seeks: 590/s
>

Interesting. I spun up a new m1.xlarge and did the same RAID10 config
(4 drives) except with a chunk size of 512K (instead of 256K) and the
machine was completely idle. Bonnie:

http://batch-files-test.s3.amazonaws.com/idle-512k-chunk.html

Which has similar-ish performance as yours, except for worse seeks but
a bit better seq. reads.

The other bonnies I sent over were NOT on idle systems. This one is
the master, which receives a heavy stream of writes and some reads

http://batch-files-test.s3.amazonaws.com/sql01.prod.html

And this is the slave, which is all writes and no reads:
http://batch-files-test.s3.amazonaws.com/sql03.prod.html

How did you build your RAID array? Maybe I have a fundamental flaw /
misconfiguration. I am doing it via:

$ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
/dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
$ pvcreate /dev/md0
$ vgcreate lvm-raid10 /dev/md0
$ lvcreate -l 215021 lvm-raid10 -n lvm0
$ blockdev --setra 65536 /dev/lvm-raid10/lvm0
$ mkfs.xfs -f /dev/lvm-raid10/lvm0
$ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data

Re: Slow queries / commits, mis-configuration or hardware issues?

От

"Tomas Vondra"

Дата:

16 ноября 2011 г., 01:16:43

On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
> How did you build your RAID array? Maybe I have a fundamental flaw /
> misconfiguration. I am doing it via:
>
> $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
> /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
> $ pvcreate /dev/md0
> $ vgcreate lvm-raid10 /dev/md0
> $ lvcreate -l 215021 lvm-raid10 -n lvm0
> $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
> $ mkfs.xfs -f /dev/lvm-raid10/lvm0
> $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data
>
> --

I don't think you have a flaw there. The workload probably skews the
results a bit on the master and slave, so it's difficult to compare it to
results from an idle instance. The amount of data written seems small, but
a random i/o can saturated the devices quite easily.

I went with a very simple raid config - no LVM, default stripe size
(better seeks, worse sequential performance),  default read-ahead (could
give better seq. performance).

Tomas

Re: Slow queries / commits, mis-configuration or hardware issues?

От

Greg Smith

Дата:

16 ноября 2011 г., 03:27:57

On 11/14/2011 01:16 PM, Cody Caughlan wrote:
> We're starting to see some slow queries, especially COMMITs that are
> happening more frequently. The slow queries are against seemingly
> well-indexed tables.
> Slow commits like:
>
> 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG:  duration: 3062.784 ms
>   statement: COMMIT
> 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG:  duration: 2593.351 ms
>   statement: COMMIT
>
> These slow COMMITs are against tables that received a large number of
> UPDATEs and are growing fairly rapidly.
>

Linux will fill its write cache with all of the writes coming out of
each checkpoint.  With a 16GB instance, I would expect that 5% * 16GB ~=
800MB of writes are batched up when your system is slow.  You should be
able to confirm that by looking at the "Dirty:" line in /proc/meminfo

With 800MB queued up and I/O that is lucky to get 50MB/s, the sync calls
at the end of each checkpoint are sometimes blocking for multiple seconds:

> 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
> 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
> recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
> files=60, longest=1.484 s, average=0.049 s
> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
> files=82, longest=2.943 s, average=0.050 s
>

When an individual sync call gets stuck for that long, clients can
easily get stuck behind it too.  There are a couple of techniques that
might help:

-Switch filesystems if you're running a slow one.  ext3 has bad latency
behavior here, XFS and ext4 are better.
-Lower the dirty_* tunables like dirty_background_ratio or its bytes
version.  This will reduce average throughput, but can lower latency.
-Spread checkpoints out more so that less average writes are happening.
-Decrease shared_buffers so less data is getting pushed out at
checkpoint time.
-Reduce your reliability expectations and turn off synchronous_commit.

Your server is sometimes showing multi-second latency issues with
bonnie++ too; that suggests how this problem is not even specific to
PostgreSQL.  Linux is hard to tune for low latency under all
circumstances; fighting latency down under a heavy update workload is
hard to do even with good hardware to accelerate write performance.  In
an EC2 environment, it may not even be possible to do without making
trade-offs like disabling synchronous writes.  I can easily get
transactions hung for 10 to 15 seconds on one of their servers if I try
to make that problem bad, you're only seeing the middle range of latency
issues so far.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Slow queries / commits, mis-configuration or hardware issues?

От

"Tomas Vondra"

Дата:

16 ноября 2011 г., 15:34:35

On 16 Listopad 2011, 5:27, Greg Smith wrote:
> On 11/14/2011 01:16 PM, Cody Caughlan wrote:
>> We're starting to see some slow queries, especially COMMITs that are
>> happening more frequently. The slow queries are against seemingly
>> well-indexed tables.
>> Slow commits like:
>>
>> 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG:  duration: 3062.784 ms
>>   statement: COMMIT
>> 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG:  duration: 2593.351 ms
>>   statement: COMMIT
>>
>> These slow COMMITs are against tables that received a large number of
>> UPDATEs and are growing fairly rapidly.
>>
> Linux will fill its write cache with all of the writes coming out of
> each checkpoint.  With a 16GB instance, I would expect that 5% * 16GB ~=
> 800MB of writes are batched up when your system is slow.  You should be
> able to confirm that by looking at the "Dirty:" line in /proc/meminfo
>
> With 800MB queued up and I/O that is lucky to get 50MB/s, the sync calls
> at the end of each checkpoint are sometimes blocking for multiple seconds:
>
>> 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
>> 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
>> recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
>> files=60, longest=1.484 s, average=0.049 s
>> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
>> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
>> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
>> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
>> files=82, longest=2.943 s, average=0.050 s
>>
>
> When an individual sync call gets stuck for that long, clients can
> easily get stuck behind it too.  There are a couple of techniques that
> might help:

The sync times I see there seem quite acceptable - 4.2s is not perfect,
but I wouldn't rate it as terrible. What actually annoys me is the amount
of data written - it's just 17000 pages, i.e. about 130 MB for a
checkpoint (spread over 5 minutes). So it's just like 0.5 MB/s.

> -Switch filesystems if you're running a slow one.  ext3 has bad latency
> behavior here, XFS and ext4 are better.

He's using xfs, IIRC. That's one of the better behaving ones, when it
comes to sync.

> -Lower the dirty_* tunables like dirty_background_ratio or its bytes
> version.  This will reduce average throughput, but can lower latency.
> -Spread checkpoints out more so that less average writes are happening.
> -Decrease shared_buffers so less data is getting pushed out at
> checkpoint time.
> -Reduce your reliability expectations and turn off synchronous_commit.

The question here probably is whether those high latencies are caused or
significantly influenced by the checkpoint, or are a "feature" of the
storage. Because if it's a feature, then all this is a futile attempt to
fix it.

I don't think he has problems with checkpoints - he's complaining about
regular queries being slow (even plain SELECT, i.e. something that usually
does not require a sync).

No doubt this may be connected, but a regular SELECT usually does not
perform a sync, right? It may need to fetch some data and if the I/O is
saturated by a checkpoint, this may take time. But again - those bonnie
results were collected with on a running system, i.e. with checkpoints in
progress and all of that.

And I'd expect most of the SELECT queries to be handled without actually
touching the devices, but by connecting
https://gist.github.com/5ac1ae7d11facd72913f and
https://gist.github.com/5ac1ae7d11facd72913f it seems that the larges
table (users) is almost completely in shared buffers, while the two other
large tables (external_user and facebook_friends) are cached by about 30%.
And I'd expect the rest of those tables to be in the page cache, so SELECT
queries on those tables should be fast.

A commit obviously requires a sync on the WAL - I wonder if moving the WAL
would improve the performance here.

This is obviously based on an incomplete set of stats, and maybe I'm
missing something.

> Your server is sometimes showing multi-second latency issues with
> bonnie++ too; that suggests how this problem is not even specific to
> PostgreSQL.  Linux is hard to tune for low latency under all

Don't forget those data were collected on a production system, i.e. it was
actually under load. That probably skews the results a lot.

> circumstances; fighting latency down under a heavy update workload is
> hard to do even with good hardware to accelerate write performance.  In
> an EC2 environment, it may not even be possible to do without making
> trade-offs like disabling synchronous writes.  I can easily get
> transactions hung for 10 to 15 seconds on one of their servers if I try
> to make that problem bad, you're only seeing the middle range of latency
> issues so far.

Are you talking about EBS or ephemeral storage? Because all this is about
ephemeral (something like a "virtualized local" storage).

Tomas

Re: Slow queries / commits, mis-configuration or hardware issues?

От

"Tomas Vondra"

Дата:

16 ноября 2011 г., 15:52:53

On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
> How did you build your RAID array? Maybe I have a fundamental flaw /
> misconfiguration. I am doing it via:
>
> $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
> /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
> $ pvcreate /dev/md0
> $ vgcreate lvm-raid10 /dev/md0
> $ lvcreate -l 215021 lvm-raid10 -n lvm0
> $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
> $ mkfs.xfs -f /dev/lvm-raid10/lvm0
> $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data

I'm not using EC2 much, and those were my first attempts with ephemeral
storage, so this may be a stupid question, but why are you building a
RAID-10 array on an ephemeral storage, anyway?

You already have a standby, so if the primary instance fails you can
easily failover.

What are you going to do in case of a drive failure? With a server this is
rather easy - just put there a new drive and you're done, but can you do
that on EC2? I guess you can't do that when the instance is running, so
you'll have to switch to the standby anyway, right? Have you ever tried
this (how it affects the performance etc.)?

So what additional protection does that give you? Wouldn't a RAID-0 be a
better utilization of the resources?

Tomas

Re: Slow queries / commits, mis-configuration or hardware issues?

От

Cody Caughlan

Дата:

16 ноября 2011 г., 16:32:14

On Nov 16, 2011, at 8:52 AM, Tomas Vondra wrote:

> On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
>> How did you build your RAID array? Maybe I have a fundamental flaw /
>> misconfiguration. I am doing it via:
>>
>> $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
>> /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
>> $ pvcreate /dev/md0
>> $ vgcreate lvm-raid10 /dev/md0
>> $ lvcreate -l 215021 lvm-raid10 -n lvm0
>> $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
>> $ mkfs.xfs -f /dev/lvm-raid10/lvm0
>> $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data
>
> I'm not using EC2 much, and those were my first attempts with ephemeral
> storage, so this may be a stupid question, but why are you building a
> RAID-10 array on an ephemeral storage, anyway?
>
> You already have a standby, so if the primary instance fails you can
> easily failover.
>

Yes, the slave will become master if master goes down. We have no plan to try and resurrect the master in the case of
failure,hence the choice of ephemeral vs EBS.  

We chose RAID10 over RAID0 to get the best combination of performance and minimizing probability of a single drive
failurebringing down the house. 

So, yes, RAID0 would ultimately deliver the best performance, with more risk.

> What are you going to do in case of a drive failure? With a server this is
> rather easy - just put there a new drive and you're done, but can you do
> that on EC2? I guess you can't do that when the instance is running, so
> you'll have to switch to the standby anyway, right? Have you ever tried
> this (how it affects the performance etc.)?
>

As far as I know one cannot alter the ephemeral drives in a running instance, so yes, the whole instance would have to
bewritten off. 

> So what additional protection does that give you? Wouldn't a RAID-0 be a
> better utilization of the resources?
>

Too much risk.

> Tomas
>

Re: Slow queries / commits, mis-configuration or hardware issues?

От

"Tomas Vondra"

Дата:

16 ноября 2011 г., 17:17:09

On 16 Listopad 2011, 18:31, Cody Caughlan wrote:
>
> On Nov 16, 2011, at 8:52 AM, Tomas Vondra wrote:
>
>> On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
>>> How did you build your RAID array? Maybe I have a fundamental flaw /
>>> misconfiguration. I am doing it via:
>>>
>>> $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
>>> /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
>>> $ pvcreate /dev/md0
>>> $ vgcreate lvm-raid10 /dev/md0
>>> $ lvcreate -l 215021 lvm-raid10 -n lvm0
>>> $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
>>> $ mkfs.xfs -f /dev/lvm-raid10/lvm0
>>> $ mkdir -p /data && mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data
>>
>> I'm not using EC2 much, and those were my first attempts with ephemeral
>> storage, so this may be a stupid question, but why are you building a
>> RAID-10 array on an ephemeral storage, anyway?
>>
>> You already have a standby, so if the primary instance fails you can
>> easily failover.
>>
>
> Yes, the slave will become master if master goes down. We have no plan to
> try and resurrect the master in the case of failure, hence the choice of
> ephemeral vs EBS.
>
> We chose RAID10 over RAID0 to get the best combination of performance and
> minimizing probability of a single drive failure bringing down the house.
>
> So, yes, RAID0 would ultimately deliver the best performance, with more
> risk.
>
>> What are you going to do in case of a drive failure? With a server this
>> is
>> rather easy - just put there a new drive and you're done, but can you do
>> that on EC2? I guess you can't do that when the instance is running, so
>> you'll have to switch to the standby anyway, right? Have you ever tried
>> this (how it affects the performance etc.)?
>>
>
> As far as I know one cannot alter the ephemeral drives in a running
> instance, so yes, the whole instance would have to be written off.
>
>> So what additional protection does that give you? Wouldn't a RAID-0 be a
>> better utilization of the resources?
>>
>
> Too much risk.

Why? If I understand that correctly, the only case where a RAID-10
actually helps is when an ephemeral drive fails, but not the whole
instance. Do you have some numbers how often this happens, i.e. how often
a drive fails without the instance?

But you can't actually replace the failed drive, so the only option you
have is to failover to the standby - right? Sure - with async replication,
you could loose a the not-yet-sent transactions. I see two possible
solutions:

a) use sync rep, available in 9.1 (you already run 9.1.1)

b) place WAL on an EBS, mounted as part of the failover

The EBS are not exactly fast, but it seems (e.g.
http://www.mysqlperformanceblog.com/2009/08/06/ec2ebs-single-and-raid-volumes-io-bencmark/)
the sequential performance might be acceptable.

According to the stats you've posted, you've written about 5632 MB of WAL
data per hour. That's about 1.5 MB/s on average, and that might be handled
by the EBS. Yes, if you have a peak where you need to write much more
data, this is going to be a bottleneck.

Tomas

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Slow queries / commits, mis-configuration or hardware issues?