Обсуждение: write scalability

Поиск
Список
Период
Сортировка

write scalability

От
Robert Haas
Дата:
I've long harbored a suspicion, based on some testing I did on my home
machine, that WALInsertLock is a big performance bottleneck.  But I
just did some benchmarking that doesn't entirely support that
contention.  This is on Nate Boley's 32-core machine, with the
following settings:

max_connections = 100
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 100
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9

I did 5-minute pgbench runs with unlogged tables and with permanent
tables, restarting the database server and reinitializing the tables
between each run.  The number at the beginning of each line is the
number of clients, while the p/u indicates which type of tables were
used:

01p tps = 628.626815 (including connections establishing)
01p tps = 635.394288 (including connections establishing)
01p tps = 634.972789 (including connections establishing)
08p tps = 3342.787325 (including connections establishing)
08p tps = 3883.876813 (including connections establishing)
08p tps = 3941.253567 (including connections establishing)
32p tps = 5597.472192 (including connections establishing)
32p tps = 5738.139895 (including connections establishing)
32p tps = 5794.490934 (including connections establishing)
80p tps = 4499.685286 (including connections establishing)
80p tps = 4917.060441 (including connections establishing)
80p tps = 5050.931933 (including connections establishing)

01u tps = 672.469142 (including connections establishing)
01u tps = 671.256686 (including connections establishing)
01u tps = 670.421003 (including connections establishing)
08u tps = 4087.749529 (including connections establishing)
08u tps = 3797.750851 (including connections establishing)
08u tps = 4181.393560 (including connections establishing)
32u tps = 8956.346905 (including connections establishing)
32u tps = 8898.442517 (including connections establishing)
32u tps = 8971.591569 (including connections establishing)
80u tps = 7287.550952 (including connections establishing)
80u tps = 7266.816989 (including connections establishing)
80u tps = 7255.968109 (including connections establishing)

The speed-up from using unlogged tables was not as large as I
expected.  Turning off synchronous_commit here removes commit rate as
the bottleneck, and I think perhaps the main benefit of unlogged
tables in that case is the avoidance of I/O, and apparently this
machine has enough I/O bandwidth, and just enough memory in general,
that that's not an issue.

With either type of tables, the 8 client results are about 6.1 times
the single core results - not great, but not terrible, either.  With
32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x
is a long way from linear.  vmstat reveals that CPU usage is
substantially less than 100%.  After some investigation, I found that
using unlogged tables wasn't actually getting rid of all the
write-ahead logging - the commit records were still being issued.  So
I hacked up RecordTransactionCommit() not to emit transaction commit
records in that case.  That doesn't actually completely eliminate the
WAL activity, because it still emits records for zeroing new SLRU
pages for CLOG, but it takes a big chunk out of it.  The upshot is
that this improved both raw performance and scalability, but not
dramatically.  Unlogged table results, with this change:

01h tps = 708.189337 (including connections establishing)
01h tps = 704.030175 (including connections establishing)
01h tps = 701.644199 (including connections establishing)
08h tps = 5196.615955 (including connections establishing)
08h tps = 5126.162200 (including connections establishing)
08h tps = 5067.568727 (including connections establishing)
32h tps = 10661.275636 (including connections establishing)
32h tps = 10621.085226 (including connections establishing)
32h tps = 10575.267197 (including connections establishing)
80h tps = 7557.965666 (including connections establishing)
80h tps = 7545.697547 (including connections establishing)
80h tps = 7556.379921 (including connections establishing)

Now the 32-client numbers have risen to 15.1x the single-client
numbers, but that's still not great.

What does this mean?  Well, XLogInsert does strikes me as an awfully
complex piece of code to be running with a hot LWLock held in
exclusive mode.  But even so, I think this certainly means that
WALInsertLock, at least on this workload, is not the whole problem...
in this test, I'm not only eliminating the overhead of inserting the
WAL, but also the overhead of writing it, flushing it, and generating
it.   So there is something, other than WAL insertion, which is taking
a big bite out of performance here.

As to what that something might be, I reran this last test with
LWLOCK_STATS enabled and here are the top things that are blocking:

lwlock 310: shacq 96846 exacq 108433 blk 16275
lwlock 3: shacq 64 exacq 2628381 blk 36027
lwlock 7: shacq 0 exacq 2628615 blk 85220
lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
lwlock 4: shacq 28667307 exacq 2628524 blk 3356651

During this 5-minute test run, an LWLock acquisition blocked 6180335
times.  As you can see from the above results, ProcArrayLock accounts
for 54% of that blocking, and CLogControlLock accounts for another
34%.  lwlock 7 is WALInsertLock, which manages to account for more
than 1% of the blocking despite the fact that WAL has been largely
eliminated in this test; lwlock 3 is XidGenLock.  I dunno what LWLock
310 is but it doesn't matter.  After that there is a long tail of
other stuff with gradually decreasing amounts of blocking.  Experience
with the read scalability stuff has taught me also to look at which
LWLocks have the most shared acquisitions, as that can cause spinlock
and cache line contention.  The top few culprits are:

lwlock 504: shacq 5315030 exacq 0 blk 0
lwlock 45: shacq 5967317 exacq 13284 blk 1722
lwlock 39: shacq 8219988 exacq 13342 blk 2291
lwlock 5: shacq 26306020 exacq 0 blk 0
lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
lwlock 11: shacq 84913908 exacq 4539551 blk 2119423

In all, there were 238777533 shared LWLock acquisitions during this
test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon
to be dealt with, as discussed elsewhere on-list), and then it gets
down into the lock manager locks and a few others.

I'm mulling over what to do about all of this, but thought I'd share
the raw numbers first, in case anyone's interested.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: write scalability

От
Merlin Moncure
Дата:
On Mon, Jul 25, 2011 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've long harbored a suspicion, based on some testing I did on my home
> machine, that WALInsertLock is a big performance bottleneck.  But I
> just did some benchmarking that doesn't entirely support that
> contention.  This is on Nate Boley's 32-core machine, with the
> following settings:
>
> max_connections = 100
> shared_buffers = 8GB
> synchronous_commit = off
> checkpoint_segments = 100
> checkpoint_timeout = 15min
> checkpoint_completion_target = 0.9
>
> I did 5-minute pgbench runs with unlogged tables and with permanent
> tables, restarting the database server and reinitializing the tables
> between each run.  The number at the beginning of each line is the
> number of clients, while the p/u indicates which type of tables were
> used:
>
> 01p tps = 628.626815 (including connections establishing)
> 01p tps = 635.394288 (including connections establishing)
> 01p tps = 634.972789 (including connections establishing)
> 08p tps = 3342.787325 (including connections establishing)
> 08p tps = 3883.876813 (including connections establishing)
> 08p tps = 3941.253567 (including connections establishing)
> 32p tps = 5597.472192 (including connections establishing)
> 32p tps = 5738.139895 (including connections establishing)
> 32p tps = 5794.490934 (including connections establishing)
> 80p tps = 4499.685286 (including connections establishing)
> 80p tps = 4917.060441 (including connections establishing)
> 80p tps = 5050.931933 (including connections establishing)
>
> 01u tps = 672.469142 (including connections establishing)
> 01u tps = 671.256686 (including connections establishing)
> 01u tps = 670.421003 (including connections establishing)
> 08u tps = 4087.749529 (including connections establishing)
> 08u tps = 3797.750851 (including connections establishing)
> 08u tps = 4181.393560 (including connections establishing)
> 32u tps = 8956.346905 (including connections establishing)
> 32u tps = 8898.442517 (including connections establishing)
> 32u tps = 8971.591569 (including connections establishing)
> 80u tps = 7287.550952 (including connections establishing)
> 80u tps = 7266.816989 (including connections establishing)
> 80u tps = 7255.968109 (including connections establishing)
>
> The speed-up from using unlogged tables was not as large as I
> expected.  Turning off synchronous_commit here removes commit rate as
> the bottleneck, and I think perhaps the main benefit of unlogged
> tables in that case is the avoidance of I/O, and apparently this
> machine has enough I/O bandwidth, and just enough memory in general,
> that that's not an issue.
>
> With either type of tables, the 8 client results are about 6.1 times
> the single core results - not great, but not terrible, either.  With
> 32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x
> is a long way from linear.  vmstat reveals that CPU usage is
> substantially less than 100%.  After some investigation, I found that
> using unlogged tables wasn't actually getting rid of all the
> write-ahead logging - the commit records were still being issued.  So
> I hacked up RecordTransactionCommit() not to emit transaction commit
> records in that case.  That doesn't actually completely eliminate the
> WAL activity, because it still emits records for zeroing new SLRU
> pages for CLOG, but it takes a big chunk out of it.  The upshot is
> that this improved both raw performance and scalability, but not
> dramatically.  Unlogged table results, with this change:
>
> 01h tps = 708.189337 (including connections establishing)
> 01h tps = 704.030175 (including connections establishing)
> 01h tps = 701.644199 (including connections establishing)
> 08h tps = 5196.615955 (including connections establishing)
> 08h tps = 5126.162200 (including connections establishing)
> 08h tps = 5067.568727 (including connections establishing)
> 32h tps = 10661.275636 (including connections establishing)
> 32h tps = 10621.085226 (including connections establishing)
> 32h tps = 10575.267197 (including connections establishing)
> 80h tps = 7557.965666 (including connections establishing)
> 80h tps = 7545.697547 (including connections establishing)
> 80h tps = 7556.379921 (including connections establishing)
>
> Now the 32-client numbers have risen to 15.1x the single-client
> numbers, but that's still not great.
>
> What does this mean?  Well, XLogInsert does strikes me as an awfully
> complex piece of code to be running with a hot LWLock held in
> exclusive mode.  But even so, I think this certainly means that
> WALInsertLock, at least on this workload, is not the whole problem...
> in this test, I'm not only eliminating the overhead of inserting the
> WAL, but also the overhead of writing it, flushing it, and generating
> it.   So there is something, other than WAL insertion, which is taking
> a big bite out of performance here.
>
> As to what that something might be, I reran this last test with
> LWLOCK_STATS enabled and here are the top things that are blocking:
>
> lwlock 310: shacq 96846 exacq 108433 blk 16275
> lwlock 3: shacq 64 exacq 2628381 blk 36027
> lwlock 7: shacq 0 exacq 2628615 blk 85220
> lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
> lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
>
> During this 5-minute test run, an LWLock acquisition blocked 6180335
> times.  As you can see from the above results, ProcArrayLock accounts
> for 54% of that blocking, and CLogControlLock accounts for another
> 34%.  lwlock 7 is WALInsertLock, which manages to account for more
> than 1% of the blocking despite the fact that WAL has been largely
> eliminated in this test; lwlock 3 is XidGenLock.  I dunno what LWLock
> 310 is but it doesn't matter.  After that there is a long tail of
> other stuff with gradually decreasing amounts of blocking.  Experience
> with the read scalability stuff has taught me also to look at which
> LWLocks have the most shared acquisitions, as that can cause spinlock
> and cache line contention.  The top few culprits are:
>
> lwlock 504: shacq 5315030 exacq 0 blk 0
> lwlock 45: shacq 5967317 exacq 13284 blk 1722
> lwlock 39: shacq 8219988 exacq 13342 blk 2291
> lwlock 5: shacq 26306020 exacq 0 blk 0
> lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
> lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
>
> In all, there were 238777533 shared LWLock acquisitions during this
> test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon
> to be dealt with, as discussed elsewhere on-list), and then it gets
> down into the lock manager locks and a few others.

hm, all the CLogControlLock acquisitions in clog.c appear to be
exclusive...or did you mean shared in some other sense?
TransactionIdGetStatus is taking an exclusive lock which is a red flag
and a good optimization target, I think.

merlin


Re: write scalability

От
Alvaro Herrera
Дата:
Excerpts from Merlin Moncure's message of lun jul 25 17:19:58 -0400 2011:
> On Mon, Jul 25, 2011 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> > Experience
> > with the read scalability stuff has taught me also to look at which
> > LWLocks have the most shared acquisitions, as that can cause spinlock
> > and cache line contention.  The top few culprits are:
> >
> > lwlock 504: shacq 5315030 exacq 0 blk 0
> > lwlock 45: shacq 5967317 exacq 13284 blk 1722
> > lwlock 39: shacq 8219988 exacq 13342 blk 2291
> > lwlock 5: shacq 26306020 exacq 0 blk 0
> > lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
> > lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
> >
> > In all, there were 238777533 shared LWLock acquisitions during this
> > test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon
> > to be dealt with, as discussed elsewhere on-list), and then it gets
> > down into the lock manager locks and a few others.
> 
> hm, all the CLogControlLock acquisitions in clog.c appear to be
> exclusive...or did you mean shared in some other sense?

SLRU control locks are also acquired indirectly by slru.c, see
SimpleLruReadPage_ReadOnly.

> TransactionIdGetStatus is taking an exclusive lock which is a red flag
> and a good optimization target, I think.

In fact, if the page that TransactionIdGetStatus is looking for is in
the buffers, it'll only take a shared lock.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: write scalability

От
Robert Haas
Дата:
On Mon, Jul 25, 2011 at 4:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> As to what that something might be, I reran this last test with
> LWLOCK_STATS enabled and here are the top things that are blocking:
>
> lwlock 310: shacq 96846 exacq 108433 blk 16275
> lwlock 3: shacq 64 exacq 2628381 blk 36027
> lwlock 7: shacq 0 exacq 2628615 blk 85220
> lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
> lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
>
> During this 5-minute test run, an LWLock acquisition blocked 6180335
> times.  As you can see from the above results, ProcArrayLock accounts
> for 54% of that blocking, and CLogControlLock accounts for another
> 34%.  lwlock 7 is WALInsertLock, which manages to account for more
> than 1% of the blocking despite the fact that WAL has been largely
> eliminated in this test...

I reran this test on master with permanent (not unlogged) tables, and
got the following stats for blocking:

lwlock 11: shacq 58376164 exacq 2122076 blk 1130974
lwlock 4: shacq 15168924 exacq 1367303 blk 1555799
lwlock 7: shacq 0 exacq 8440615 blk 1726896
grand total: shacq 144563929 exacq 41618873 blk 4886107

So that's 35% WALInsertLock, 32% ProcArrayLock, and 23%
CLogControlLock.  No other single lock accounted for more than 1% of
the blocking.  It's a good guess we're going to have to fix more than
one thing to really make this zippy.

Also note that CLogControlLock accounts for better than 40% of the
shared-lock acquisitions across all LWLocks.  The next-most-frequently
share-locked LWLock is ProcArrayLock, with 10% of the total
shared-lock acquisitions, followed by SInvalReadLock, at 9%.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: write scalability

От
Greg Smith
Дата:
On 07/25/2011 04:07 PM, Robert Haas wrote:
> I did 5-minute pgbench runs with unlogged tables and with permanent
> tables, restarting the database server and reinitializing the tables
> between each run.

Database scale?  One or multiple pgbench worker threads?  A reminder on 
the amount of RAM in the server would be helpful for interpreting the 
results too.

The other thing I'd recommend if you're running more write-heavy tests 
is to turn off autovacuum.  Whether or not it kicks in depends on the 
test duration and the TPS rate, which adds a source of variability 
better avoided here. It also means that faster tests end up getting 
penalized by having it run near their end, which makes them no longer 
look like fast results.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



Re: write scalability

От
Robert Haas
Дата:
On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 07/25/2011 04:07 PM, Robert Haas wrote:
>>
>> I did 5-minute pgbench runs with unlogged tables and with permanent
>> tables, restarting the database server and reinitializing the tables
>> between each run.
>
> Database scale?  One or multiple pgbench worker threads?  A reminder on the
> amount of RAM in the server would be helpful for interpreting the results
> too.

Ah, sorry.  scale = 100, so small.  pgbench invocation is:

pgbench -T $time -c $clients -j $clients

128GB RAM.

> The other thing I'd recommend if you're running more write-heavy tests is to
> turn off autovacuum.  Whether or not it kicks in depends on the test
> duration and the TPS rate, which adds a source of variability better avoided
> here. It also means that faster tests end up getting penalized by having it
> run near their end, which makes them no longer look like fast results.

Well, I kind of don't want to do that, because surely you wouldn't run
that way in production.  Anyway, it seems clear that nearly all the
contention is in CLogControlLock, ProcArrayLock, and WALInsertLock,
and I don't think shutting off autovacuum is going to change that.
The point of my benchmarking here may be a bit different from the
benchmarking you normally do: I'm not trying to figure out whether
I've got the best hardware, or how good it is, so I don't really care
about squeezing out every last tps.  I just need number that can be
reproduced quickly with different patches to see whether a given code
change helps and by how much.  I wasn't sure that 5-minute pgbench
runs would be good enough for that, but so far the results have been
stable enough that I'm fairly confident I'm seeing real effects rather
than random noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: write scalability

От
Pavan Deolasee
Дата:
On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> On 07/25/2011 04:07 PM, Robert Haas wrote:
>>>
>>> I did 5-minute pgbench runs with unlogged tables and with permanent
>>> tables, restarting the database server and reinitializing the tables
>>> between each run.
>>
>> Database scale?  One or multiple pgbench worker threads?  A reminder on the
>> amount of RAM in the server would be helpful for interpreting the results
>> too.
>
> Ah, sorry.  scale = 100, so small.  pgbench invocation is:
>

It might be worthwhile to test only with the accounts and history
table and also increasing the number of statements in a transaction.
Otherwise the tiny tables can quickly become a bottleneck.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: write scalability

От
Robert Haas
Дата:
On Tue, Jul 26, 2011 at 11:40 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> On 07/25/2011 04:07 PM, Robert Haas wrote:
>>>>
>>>> I did 5-minute pgbench runs with unlogged tables and with permanent
>>>> tables, restarting the database server and reinitializing the tables
>>>> between each run.
>>>
>>> Database scale?  One or multiple pgbench worker threads?  A reminder on the
>>> amount of RAM in the server would be helpful for interpreting the results
>>> too.
>>
>> Ah, sorry.  scale = 100, so small.  pgbench invocation is:
>>
>
> It might be worthwhile to test only with the accounts and history
> table and also increasing the number of statements in a transaction.
> Otherwise the tiny tables can quickly become a bottleneck.

What kind of bottleneck?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: write scalability

От
Pavan Deolasee
Дата:
On Tue, Jul 26, 2011 at 12:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 26, 2011 at 11:40 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>> On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>>> On 07/25/2011 04:07 PM, Robert Haas wrote:
>>>>>
>>>>> I did 5-minute pgbench runs with unlogged tables and with permanent
>>>>> tables, restarting the database server and reinitializing the tables
>>>>> between each run.
>>>>
>>>> Database scale?  One or multiple pgbench worker threads?  A reminder on the
>>>> amount of RAM in the server would be helpful for interpreting the results
>>>> too.
>>>
>>> Ah, sorry.  scale = 100, so small.  pgbench invocation is:
>>>
>>
>> It might be worthwhile to test only with the accounts and history
>> table and also increasing the number of statements in a transaction.
>> Otherwise the tiny tables can quickly become a bottleneck.
>
> What kind of bottleneck?
>

So many transactions trying to update a small set of rows in a table.
Is that what we really want to measure ? My thinking is that we might
see different result if they are updating different parts of the table
and the transaction start/stop overhead is spread across few
statements.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: write scalability

От
Pavan Deolasee
Дата:
On Tue, Jul 26, 2011 at 12:29 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

>
> So many transactions trying to update a small set of rows in a table.
> Is that what we really want to measure ? My thinking is that we might
> see different result if they are updating different parts of the table
> and the transaction start/stop overhead is spread across few
> statements.
>

I think what I am suggesting is that the default pgbench test setup
would probably not give you a good scalability as number of clients
are increased and one reason could be the contention in the small
table. So it might be a good idea to get rid of that and see if we get
much better scalability and understand other bottlenecks.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: write scalability

От
Simon Riggs
Дата:
On Tue, Jul 26, 2011 at 4:40M, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> On 07/25/2011 04:07 PM, Robert Haas wrote:
>>>>
>>>> I did 5-minute pgbench runs with unlogged tables and with permanent
>>>> tables, restarting the database server and reinitializing the tables
>>>> between each run.
>>>
>>> Database scale?  One or multiple pgbench worker threads?  A reminder on the
>>> amount of RAM in the server would be helpful for interpreting the results
>>> too.
>>
>> Ah, sorry.  scale = 100, so small.  pgbench invocation is:
>>
>
> It might be worthwhile to test only with the accounts and history
> table and also increasing the number of statements in a transaction.
> Otherwise the tiny tables can quickly become a bottleneck.

+1

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: write scalability

От
Greg Smith
Дата:
On 07/26/2011 12:33 PM, Pavan Deolasee wrote:
> I think what I am suggesting is that the default pgbench test setup
> would probably not give you a good scalability as number of clients
> are increased and one reason could be the contention in the small
> table. So it might be a good idea to get rid of that and see if we get
> much better scalability and understand other bottlenecks.
>    

You can easily see this form of contention pulling down results when the 
database itself is really small and the overall transaction rate is very 
high.  With Robert using a scale=100, no more than 80 clients, and 
transaction rates peaking at <10K TPS on a 24 core box, I wouldn't 
expect this form of contention to be a large issue.  It may be dropping 
results a few percent, but I'd be surprised if it was any more than that.

It's easy enough to use "-N" to skip the updates to the smaller tellers 
and branches table to pull that out of the way.  TPS will go up, because 
it's doing less per transaction.  That's not necessarily a better test 
case though, it's just a different one.  The regular case includes a lot 
of overwriting the same blocks in the hot branches and tellers tables.  
That effectively pushes a lot more I/O toward being likely to happen at 
checkpoint time.  Those tables rarely have any significant I/O outside 
of the checkpoint in the standard "TPC-B like" scenario, because their 
usage counts stay high most of the time.

Contention for small tables that are being heavily updated is still 
important to optimize too though.  Which type of test makes more sense 
depends on what aspect of performance you're trying to focus on.  I'll 
sometimes do a full pgbench-tools "sweep" (range of multiple scales and 
clients at each scale) in both regular write and "-N" write modes, just 
to collect the slightly different data each provides.  The form of 
checkpoint I/O spike you see at sync time is very different in the two 
cases, but both usage profiles are important to consider and optimize.  
That Robert has started with the regular case doesn't worry me too much 
now that I've seen the parameters he's using, he's not running it in a 
way where I'd expect branch/teller contention to be a major limiting 
factor on the results.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us