Обсуждение: sniff test on some PG 8.4 numbers

Поиск
Список
Период
Сортировка

sniff test on some PG 8.4 numbers

От
Jon Nelson
Дата:
I was hoping to just get a "gut reaction" on some pgbench numbers I have, to
see if I'm in the ballpark.

OS:  ScientificLinux 6.3, x86_64
Hardware: 4x real disks (not SSD) behind an LSI 9260 in raid10, Xeon E5-2680
with hyperthreading OFF, 128GB of RAM.
Setup: postgresql 8.4.13, ext4, barriers ON, disk write cache *off*, write-
back enabled on the LSI.
I initialized with sizes of 100, 200, and 400.

I've done some tuning of the postgresql config, but mostly I'm just trying to
find out if I'm in the right ballpark.

I ran pgbench from another (similar) host:

pgbench -h BLAH -c 32 -M prepared -t 100000 -S
I get 95,000 to 100,000 tps.

pgbench -h BLAH -c 32 -M prepared -t 100000
seems to hover around 6,200 tps (size 100) to 13,700 (size 400)

Do these basically sniff right?
(NOTE: with barriers off, I get a slight increase - 10% - in the
read-write test, and a larger *decrease* - 15% - with the read-only
test @ 400. No change @ 100)

--
Jon


Re: sniff test on some PG 8.4 numbers

От
Josh Berkus
Дата:
> Do these basically sniff right?

Well, the read test seems reasonable.  I'm impressed by the speed of the
write test ... how large is the raid card cache?

And why 8.4?  Can you try 9.2?

> (NOTE: with barriers off, I get a slight increase - 10% - in the
> read-write test, and a larger *decrease* - 15% - with the read-only
> test @ 400. No change @ 100)

Oh, interesting.  Can you reproduce that? I wonder what would cause
read-only to drop without barriers ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: sniff test on some PG 8.4 numbers

От
Jon Nelson
Дата:
On Tue, Mar 5, 2013 at 7:02 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> Do these basically sniff right?
>
> Well, the read test seems reasonable.  I'm impressed by the speed of the
> write test ... how large is the raid card cache?
>
> And why 8.4?  Can you try 9.2?

8.4 because it's what I've got, basically. I might be able to try 9.2
later, but I'm targeting 8.4 right now.
512MB of memory on the card.

>> (NOTE: with barriers off, I get a slight increase - 10% - in the
>> read-write test, and a larger *decrease* - 15% - with the read-only
>> test @ 400. No change @ 100)
>
> Oh, interesting.  Can you reproduce that? I wonder what would cause
> read-only to drop without barriers ...

I'll try to test again soon.
I know that if I use writethrough instead of writeback mode the
performance nosedives.
Does anybody have suggestions for stripe size? (remember: *4* disks)

--
Jon


Re: sniff test on some PG 8.4 numbers

От
Jon Nelson
Дата:
On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>
> pgbench -h BLAH -c 32 -M prepared -t 100000 -S
> I get 95,000 to 100,000 tps.
>
> pgbench -h BLAH -c 32 -M prepared -t 100000
> seems to hover around 6,200 tps (size 100) to 13,700 (size 400)

Some followup:
The read test goes (up to) 133K tps, and the read-write test to 22k
tps when performed over localhost.

--
Jon


Re: sniff test on some PG 8.4 numbers

От
Greg Smith
Дата:
On 3/5/13 10:00 PM, Jon Nelson wrote:
> On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>>
>> pgbench -h BLAH -c 32 -M prepared -t 100000 -S
>> I get 95,000 to 100,000 tps.
>>
>> pgbench -h BLAH -c 32 -M prepared -t 100000
>> seems to hover around 6,200 tps (size 100) to 13,700 (size 400)
>
> Some followup:
> The read test goes (up to) 133K tps, and the read-write test to 22k
> tps when performed over localhost.

All your write numbers are inflated because the test is too short.  This
hardware will be lucky to sustain 7500 TPS on writes.  But you're only
writing 100,000 transactions, which means the entire test run isn't even
hitting the database--only the WAL writes are.  When your test run is
finished, look at /proc/meminfo  I'd wager a large sum you'll find
"Dirty:" has hundreds of megabytes, if not gigabytes, of unwritten
information.  Basically, 100,000 writes on this sort of server can all
be cached in Linux's write cache, and pgbench won't force them out of
there.  So you're not simulating sustained database writes, only how
fast of a burst the server can handle for a little bit.

For a write test, you must run for long enough to start and complete a
checkpoint before the numbers are of any use, and 2 checkpoints are even
better.  The minimum useful length is a 10 minute run, so "-T 600"
instead of using -t.  If you want something that does every trick
possible to make it hard to cheat at this, as well as letting you graph
size and client data, try my pgbench-tools:
https://github.com/gregs1104/pgbench-tools  (Note that there is a bug in
that program right now, it spawns vmstat and iostat processes but they
don't get killed at the end correctly.  "killall vmstat iostat" after
running is a good idea until I fix that).

Your read test numbers are similarly inflated, but read test errors
aren't as large.  Around 133K TPS on select-only is probably accurate.
For a read test, use "-T 30" to let it run for 30 seconds to get a more
accurate number.  The read read bottleneck on your hardware is going to
be the pgbench client itself, which on 8.4 is running as a single
thread.  On 9.0+ you can have multiple pgbench workers.  It normally
takes 4 to 8 of them to saturate a larger server.

I hope you're not considering deploying a new application with 8.4.
Take a look at http://www.postgresql.org/support/versioning/ and you'll
see 8.4 only has a little over a year before it won't get bug fixes
anymore.  Also, your server would really appreciate the performance
gains added to 9.2.  If that's a bit too leading edge for you, I don't
recommend deploying at version below 9.1 anymore.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: sniff test on some PG 8.4 numbers

От
Jon Nelson
Дата:
On Sun, Mar 10, 2013 at 10:46 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 3/5/13 10:00 PM, Jon Nelson wrote:
>>
>> On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson <jnelson+pgsql@jamponi.net>
>> wrote:
>>>
>>>
>>> pgbench -h BLAH -c 32 -M prepared -t 100000 -S
>>> I get 95,000 to 100,000 tps.
>>>
>>> pgbench -h BLAH -c 32 -M prepared -t 100000
>>> seems to hover around 6,200 tps (size 100) to 13,700 (size 400)
>>
>>
>> Some followup:
>> The read test goes (up to) 133K tps, and the read-write test to 22k
>> tps when performed over localhost.
>
>
> All your write numbers are inflated because the test is too short.  This
> hardware will be lucky to sustain 7500 TPS on writes.  But you're only
> writing 100,000 transactions, which means the entire test run isn't even
> hitting the database--only the WAL writes are.  When your test run is
> finished, look at /proc/meminfo  I'd wager a large sum you'll find "Dirty:"
> has hundreds of megabytes, if not gigabytes, of unwritten information.
> Basically, 100,000 writes on this sort of server can all be cached in
> Linux's write cache, and pgbench won't force them out of there.  So you're
> not simulating sustained database writes, only how fast of a burst the
> server can handle for a little bit.
>
> For a write test, you must run for long enough to start and complete a
> checkpoint before the numbers are of any use, and 2 checkpoints are even
> better.  The minimum useful length is a 10 minute run, so "-T 600" instead
> of using -t.  If you want something that does every trick possible to make
> it hard to cheat at this, as well as letting you graph size and client data,
> try my pgbench-tools: https://github.com/gregs1104/pgbench-tools  (Note that
> there is a bug in that program right now, it spawns vmstat and iostat
> processes but they don't get killed at the end correctly.  "killall vmstat
> iostat" after running is a good idea until I fix that).

I (briefly!) acquired an identical machine as last but this time with
an Areca instead of an LSI (4 drives).

The following is with ext4, nobarrier, and noatime. As noted in the
original post, I have done a fair bit of system tuning. I have the
dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
respectively.

I built 9.2 and using 9.2 and the following pgbench invocation:

pgbench  -j 8  -c 32 -M prepared -T 600

transaction type: TPC-B (sort of)
scaling factor: 400
query mode: prepared
number of clients: 32
number of threads: 8
duration: 600 s
number of transactions actually processed: 16306693
tps = 27176.566608 (including connections establishing)
tps = 27178.518841 (excluding connections establishing)

> Your read test numbers are similarly inflated, but read test errors aren't
> as large.  Around 133K TPS on select-only is probably accurate. For a read
> test, use "-T 30" to let it run for 30 seconds to get a more accurate
> number.  The read read bottleneck on your hardware is going to be the
> pgbench client itself, which on 8.4 is running as a single thread.  On 9.0+
> you can have multiple pgbench workers.  It normally takes 4 to 8 of them to
> saturate a larger server.

The 'select-only' test (same as above with '-S'):

starting vacuum...end.
transaction type: SELECT only
scaling factor: 400
query mode: prepared
number of clients: 32
number of threads: 8
duration: 600 s
number of transactions actually processed: 127513307
tps = 212514.337971 (including connections establishing)
tps = 212544.392278 (excluding connections establishing)

These are the *only* changes I've made to the config file:

shared_buffers = 32GB
wal_buffers = 16MB
checkpoint_segments = 1024

I can run either or both of these again with different options, but
mostly I'm looking for a sniff test.
However, I'm a bit confused, now.

It seems as though you say the write numbers are not believable,
suggesting a value of 7,500 (roughly 1/4 what I'm getting). If I run
the read test for 30 seconds I get - highly variable - between 300K
and 400K tps. Why are these tps so high compared to your expectations?
Note: I did get better results with HT on vs. with HT off, so I've
left HT on for now.



--
Jon


Re: sniff test on some PG 8.4 numbers

От
Scott Marlowe
Дата:
On Sun, Mar 10, 2013 at 7:18 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> It seems as though you say the write numbers are not believable,
> suggesting a value of 7,500 (roughly 1/4 what I'm getting). If I run
> the read test for 30 seconds I get - highly variable - between 300K
> and 400K tps. Why are these tps so high compared to your expectations?
> Note: I did get better results with HT on vs. with HT off, so I've
> left HT on for now.

go back and re-read greg's post. He explains why he thinks you'll
sustain less.  Basically it's caching effects because no pg_xlog / wal
log writing happening.  Once you get a feel for how fast it is, run
the test for 30 minutes to several hours to see how it goes. Then when
you have a weekend just leave it running a couple days. Still pretty
good numbers so far.


Re: sniff test on some PG 8.4 numbers

От
Greg Smith
Дата:
On 3/10/13 9:18 PM, Jon Nelson wrote:

> The following is with ext4, nobarrier, and noatime. As noted in the
> original post, I have done a fair bit of system tuning. I have the
> dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
> respectively.

That's good, but be aware those values are still essentially unlimited
write caches.  A server with 4 good but regular hard drives might do as
little as 10MB/s of random writes on a real workload.  If 2GB of data
ends up dirty, the flushing that happens at the end of a database
checkpoint will need to clear all of that out of RAM.  When that
happens, you're looking at a 3 minute long cache flush to push out 2GB.
  It's not unusual for pgbench tests to pause for over a minute straight
when that happens.  With your setup, where checkpoints happen every 5
minutes, this is only happening once per test run.  The disruption isn't
easily visible if you look at the average rate; it's outweighed by the
periods where writes happen very fast because the cache isn't full yet.
  You have to get pgbench to plot latency over time to see them and then
analyze that data.  This problem is the main reason I put together the
pgbench-tools set for running things, because once you get to processing
the latency files and make graphs from them it starts to be a pain to
look at the results.

> I built 9.2 and using 9.2 and the following pgbench invocation:
>
> pgbench  -j 8  -c 32 -M prepared -T 600
>
> transaction type: TPC-B (sort of)
> scaling factor: 400

I misread this completely in your message before; I thought you wrote
4000.  A scaling factor of 400 is making a database that's 6GB in size.
  Your test is basically seeing how fast the system memory and the RAID
cache can move things around.  In that situation, your read and write
numbers are reasonable.  They aren't actually telling you anything
useful about the disks though, because they're barely involved here.
You've sniffed the CPU, memory, and RAID controller and they smell fine.
  You'll need at least an order of magnitude increase in scale to get a
whiff of the disks.

pgbench scale numbers give approximately 16MB per scale factor.  You
don't actually stress the drives until that total number is at least 2X
as big as RAM.  We had to raise the limit on the pgbench scales recently
because it only goes up to ~20,000 on earlier versions, and that's not a
big enough scale to test many servers now.

On the select-only tests, much of the increase from ~100K to ~200K is
probably going from 8.4 to 9.2.  There's two major and several minor
tuning changes that make it much more efficient at that specific task.

> These are the *only* changes I've made to the config file:
>
> shared_buffers = 32GB
> wal_buffers = 16MB
> checkpoint_segments = 1024

Note that these are the only changes that actually impact pgbench
results.  The test doesn't stress very many parts of the system, such as
the query optimizer.

Also be aware these values may not be practical to use in production.
You can expect bad latency issues due to having shared_buffers so large.
  All that memory has to be reconciled and written to disk if it's been
modified at each checkpoint, and 32GB of such work is a lot.  I have
systems where we can't make shared_buffers any bigger than 4GB before
checkpoint pauses get too bad.

Similarly, setting checkpoint_segments to 1024 means that you might go
through 16GB of writes before a checkpoint happens.  That's great for
average performance...but when that checkpoint does hit, you're facing a
large random I/O backlog.

There's not much you can do about all this on the Linux side.  If you
drop the dirty_* parameters too much, maintenance operations like VACUUM
start to get slow.  Really all you can do is avoid setting
shared_buffers and checkpoint_segments too high, so the checkpoint
backlog never gets gigantic.  The tuning you've done is using higher
values than we normally recommend because it's not quite practical to
deploy like that.  That and the very small database are probably why
your numbers are so high.

> Note: I did get better results with HT on vs. with HT off, so I've
> left HT on for now.

pgbench select-only in particular does like hyper-threading.  We get
occasional reports of more memory-bound workloads actually slowing when
it's turned on.  I think it's a wash and leave it on.  Purchasing and
management people tend to get annoyed if they discover the core count of
the server is half what they thought they were buying.  The potential
downside of HT isn't so big that its worth opening that can of worms,
unless you've run real application level tests to prove it hurts.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: sniff test on some PG 8.4 numbers

От
Jon Nelson
Дата:
On Sun, Mar 10, 2013 at 11:28 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 3/10/13 9:18 PM, Jon Nelson wrote:
>
>> The following is with ext4, nobarrier, and noatime. As noted in the
>> original post, I have done a fair bit of system tuning. I have the
>> dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
>> respectively.
>
>
> That's good, but be aware those values are still essentially unlimited write
> caches.  A server with 4 good but regular hard drives might do as little as
> 10MB/s of random writes on a real workload.  If 2GB of data ends up dirty,
> the flushing that happens at the end of a database checkpoint will need to
> clear all of that out of RAM.  When that happens, you're looking at a 3
> minute long cache flush to push out 2GB.  It's not unusual for pgbench tests
> to pause for over a minute straight when that happens.  With your setup,
> where checkpoints happen every 5 minutes, this is only happening once per
> test run.  The disruption isn't easily visible if you look at the average
> rate; it's outweighed by the periods where writes happen very fast because
> the cache isn't full yet.  You have to get pgbench to plot latency over time
> to see them and then analyze that data.  This problem is the main reason I
> put together the pgbench-tools set for running things, because once you get
> to processing the latency files and make graphs from them it starts to be a
> pain to look at the results.

I'll try to find time for this, but it may need to wait until the weekend again.

>> I built 9.2 and using 9.2 and the following pgbench invocation:
>>
>> pgbench  -j 8  -c 32 -M prepared -T 600
>>
>> transaction type: TPC-B (sort of)
>> scaling factor: 400
>
>
> I misread this completely in your message before; I thought you wrote 4000.
> A scaling factor of 400 is making a database that's 6GB in size.  Your test
> is basically seeing how fast the system memory and the RAID cache can move
> things around.  In that situation, your read and write numbers are
> reasonable.  They aren't actually telling you anything useful about the
> disks though, because they're barely involved here. You've sniffed the CPU,
> memory, and RAID controller and they smell fine.  You'll need at least an
> order of magnitude increase in scale to get a whiff of the disks.

LOL! Your phrasing is humourous and the information useful.

I ran for 8.0 hours and go this:

transaction type: TPC-B (sort of)
scaling factor: 400
query mode: prepared
number of clients: 32
number of threads: 8
duration: 28800 s
number of transactions actually processed: 609250619
tps = 21154.058025 (including connections establishing)
tps = 21154.075922 (excluding connections establishing)

> pgbench scale numbers give approximately 16MB per scale factor.  You don't
> actually stress the drives until that total number is at least 2X as big as
> RAM.  We had to raise the limit on the pgbench scales recently because it
> only goes up to ~20,000 on earlier versions, and that's not a big enough
> scale to test many servers now.
>
> On the select-only tests, much of the increase from ~100K to ~200K is
> probably going from 8.4 to 9.2.  There's two major and several minor tuning
> changes that make it much more efficient at that specific task.
>
>
>> These are the *only* changes I've made to the config file:
>>
>> shared_buffers = 32GB
>> wal_buffers = 16MB
>> checkpoint_segments = 1024
>
>
> Note that these are the only changes that actually impact pgbench results.
> The test doesn't stress very many parts of the system, such as the query
> optimizer.
>
> Also be aware these values may not be practical to use in production. You
> can expect bad latency issues due to having shared_buffers so large.  All
> that memory has to be reconciled and written to disk if it's been modified
> at each checkpoint, and 32GB of such work is a lot.  I have systems where we
> can't make shared_buffers any bigger than 4GB before checkpoint pauses get
> too bad.
>
> Similarly, setting checkpoint_segments to 1024 means that you might go
> through 16GB of writes before a checkpoint happens.  That's great for
> average performance...but when that checkpoint does hit, you're facing a
> large random I/O backlog.

I thought the bgwriter mitigated most of the problems here? Often I'll
see the actual checkpoints with 'sync' times typically below a few
seconds (when there is anything to do at all). I can't say I've seen
checkpoint pauses in my workloads.

> There's not much you can do about all this on the Linux side.  If you drop
> the dirty_* parameters too much, maintenance operations like VACUUM start to
> get slow.  Really all you can do is avoid setting shared_buffers and
> checkpoint_segments too high, so the checkpoint backlog never gets gigantic.
> The tuning you've done is using higher values than we normally recommend
> because it's not quite practical to deploy like that.  That and the very
> small database are probably why your numbers are so high.

Mostly I do data warehouse type of workloads with very little (if any)
data modification after initial load time. Extensive benchmarking of
the actual applications involved has shown that - for me - a large
(but not too large) shared_buffers (32GB is right about the sweet spot
for me, perhaps a bit on the high side) works well. Additionally, the
large checkpoint_segments value really appears to help as well (again,
this is very workload dependent).

>> Note: I did get better results with HT on vs. with HT off, so I've
>> left HT on for now.
>
>
> pgbench select-only in particular does like hyper-threading.  We get
> occasional reports of more memory-bound workloads actually slowing when it's
> turned on.  I think it's a wash and leave it on.  Purchasing and management
> people tend to get annoyed if they discover the core count of the server is
> half what they thought they were buying.  The potential downside of HT isn't
> so big that its worth opening that can of worms, unless you've run real
> application level tests to prove it hurts.

Glad to get an "it's a wash" confirmation here.


--
Jon