Обсуждение: SSI slows down over time

Поиск
Список
Период
Сортировка

SSI slows down over time

От
Ryan Johnson
Дата:
Hi all,

Disclaimer: this question probably belongs on the hackers list, but the
instructions say you have to try somewhere else first... toss-up between
this list and a bug report; list seemed more appropriate as a starting
point. Happy to file a bug if that's more appropriate, though.

This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure
--prefix=...' and gcc-4.7.
TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients.

I get a strange behavior across repeated runs: each 100-second run is a
bit slower than the one preceding it, when run with SSI (SERIALIZABLE).
Switching to SI (REPEATABLE_READ) removes the problem, so it's
apparently not due to the database growing. The database is completely
shut down (pg_ctl stop) between runs, but the data lives in tmpfs, so
there's no I/O problem here. 64GB RAM, so no paging, either.

Note that this slowdown is in addition to the 30% performance from using
SSI on my 24-core machine. I understand that the latter is a known
bottleneck; my question is why the bottleneck should get worse over time:

With SI, I get ~4.4ktps, consistently.
With SSI, I get 3.9, 3.8, 3.4. 3.3, 3.1, 2.9, ...

So the question: what should I look for to diagnose/triage this problem?
I'm willing to do some legwork, but have no idea where to go next.

I've tried linux perf, but all it says is that lots of time is going to
LWLock (but callgraph tracing doesn't work in my not-bleeding-edge
kernel). Looking through the logs, the abort rates due to SSI aren't
changing in any obvious way. I've been hacking on SSI for over a month
now as part of a research project, and am fairly familiar with
predicate.c, but I don't see any obvious reason this behavior should
arise (in particular, SLRU storage seems to be re-initialized every time
the postmaster restarts, so there shouldn't be any particular memory
effect due to SIREAD locks). I'm also familiar with both Cahill's and
Ports/Grittner's published descriptions of SSI, but again, nothing
obvious jumps out.

In my experience this sort of behavior indicates a type of bug where
fixing it would have a large impact on performance (because the early
"damage" is done so quickly that even the very first run doesn't live up
to its true potential).

$ cat pgsql.conf
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 64
max_pred_locks_per_transaction = 2000
default_statistics_target = 100
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
effective_cache_size = 40GB
work_mem = 1920MB
wal_buffers = 16MB

Thanks,
Ryan



Re: SSI slows down over time

От
Heikki Linnakangas
Дата:
On 04/06/2014 05:25 AM, Ryan Johnson wrote:
> I've tried linux perf, but all it says is that lots of time is going to
> LWLock (but callgraph tracing doesn't work in my not-bleeding-edge
> kernel).

Make sure you compile with the "-fno-omit-frame-pointer" flag.
- Heikki


Re: SSI slows down over time

От
Tom Lane
Дата:
Ryan Johnson <ryan.johnson@cs.utoronto.ca> writes:
> I get a strange behavior across repeated runs: each 100-second run is a
> bit slower than the one preceding it, when run with SSI (SERIALIZABLE).
> ... So the question: what should I look for to diagnose/triage this problem?

In the past I've seen behaviors like this that traced to the range of
"interesting" transaction IDs getting wider as time went on, so that
more pages of pg_clog were hot, leading to more I/O traffic in the
clog SLRU buffers.  Possibly this is some effect like that.

> I've tried linux perf, but all it says is that lots of time is going to
> LWLock (but callgraph tracing doesn't work in my not-bleeding-edge
> kernel).

You could recompile with -DLWLOCK_STATS to enable gathering stats on
which LWLocks are heavily contended.  That would be a starting point
for localizing the cause.

            regards, tom lane


Re: SSI slows down over time

От
Ryan Johnson
Дата:
On 06/04/2014 4:30 AM, Heikki Linnakangas wrote:
> On 04/06/2014 05:25 AM, Ryan Johnson wrote:
>> I've tried linux perf, but all it says is that lots of time is going to
>> LWLock (but callgraph tracing doesn't work in my not-bleeding-edge
>> kernel).
>
> Make sure you compile with the "-fno-omit-frame-pointer" flag.
Oh, right. Forgot about that. With -g in place, it shows 5% of time
going to LWLockAcquire, virtually all of that in PredicateLockPage.
However, top reports only 50% utilization, so I'd *really* like to have
stack traces and "counts" for stack traces the worker thread spends its
time blocked in. I don't know a way to get that out of perf, though.
It's near-trivial to do in dtrace, but unfortunately I don't have a
Solaris or BSD machine handy.

Does pgsql give a way to trace latencies due to LWLock?

Thanks,
Ryan





Re: SSI slows down over time

От
Ryan Johnson
Дата:
On 06/04/2014 10:55 AM, Tom Lane wrote:
Ryan Johnson <ryan.johnson@cs.utoronto.ca> writes:
I get a strange behavior across repeated runs: each 100-second run is a 
bit slower than the one preceding it, when run with SSI (SERIALIZABLE). 
... So the question: what should I look for to diagnose/triage this problem? 
In the past I've seen behaviors like this that traced to the range of
"interesting" transaction IDs getting wider as time went on, so that
more pages of pg_clog were hot, leading to more I/O traffic in the
clog SLRU buffers.  Possibly this is some effect like that.
The effect disappears completely if I run under SI instead of SSI, though. That makes me suspect strongly that the problem lurks in SSI-specific infrastrucure.

However, I did notice that the SLRU buffer that holds "old" SSI transactions sometimes spikes from hundreds to millions of entries (by annotating the source to ereport a warning whenever the difference between head and tail is at least 10% higher than the previous record). Not sure if that's related, though: I'm pretty sure SSI never scans SLRU, it's only used for random lookups.

I've tried linux perf, but all it says is that lots of time is going to 
LWLock (but callgraph tracing doesn't work in my not-bleeding-edge 
kernel).
You could recompile with -DLWLOCK_STATS to enable gathering stats on
which LWLocks are heavily contended.  That would be a starting point
for localizing the cause.

Here are the offenders (100-second run, 24 clients, ~2.2ktps):
lwlock        7 shacq         0 exacq 7002810 blk  896235 spindelay  213
lwlock       28 shacq  94984166 exacq   3938085 blk 572762 spindelay  163
lwlock       65 shacq   3347000 exacq   2933440 blk 255927 spindelay   90
lwlock       79 shacq   1742574 exacq   3009663 blk 216178 spindelay   41
lwlock       76 shacq   2293386 exacq   2892242 blk 205199 spindelay   70
lwlock       66 shacq    498909 exacq   2987485 blk 171508 spindelay   48
lwlock       80 shacq    512107 exacq   3181753 blk 165772 spindelay   43
lwlock       71 shacq    815733 exacq   3088157 blk 165579 spindelay   48
lwlock       74 shacq    603321 exacq   3065391 blk 159953 spindelay   56
lwlock       67 shacq    695465 exacq   2918970 blk 149339 spindelay   28
lwlock       69 shacq    411203 exacq   3044007 blk 148655 spindelay   34
lwlock       72 shacq    515260 exacq   2973321 blk 147533 spindelay   43
lwlock       30 shacq  41628636 exacq      8799 blk 143889 spindelay  186
lwlock       75 shacq    409472 exacq   2987227 blk 143196 spindelay   38
lwlock       77 shacq    409401 exacq   2946972 blk 139507 spindelay   34
lwlock       73 shacq    402544 exacq   2943467 blk 139380 spindelay   43
lwlock       78 shacq    404220 exacq   2912665 blk 137625 spindelay   21
lwlock       70 shacq    603643 exacq   2816730 blk 135851 spindelay   37
lwlock       68 shacq    403533 exacq   2862017 blk 131946 spindelay   30
lwlock       29 shacq         0 exacq    255302 blk 75838 spindelay    1
lwlock        0 shacq         0 exacq    561508 blk 12445 spindelay    3
lwlock       11 shacq   1245499 exacq    219717 blk 5501 spindelay   10
lwlock        4 shacq    381211 exacq    209146 blk 1273 spindelay    4
lwlock        3 shacq        16 exacq    209081 blk 522 spindelay    0
lwlock        8 shacq         0 exacq    137961 blk 50 spindelay    0
lwlock  2097366 shacq         0 exacq    384586 blk 1 spindelay    0
lwlock  2097365 shacq         0 exacq    370176 blk 1 spindelay    0
lwlock  2097359 shacq         0 exacq    363845 blk 1 spindelay    0

The above aggregates the per-lock stats from all processes, filters out locks with fewer than 10000 accesses (shared+exclusive) or with zero blk, then sorts by highest blk first.

According to [1], locks {28, 29, 30} are {SerializableXactHashLock, SerializableFinishedListLock, SerializablePredicateLockListLock}, all SSI-related; locks 65-80 are the sixteen PredicateLockMgrLocks that the post mentions. Looking in lwlock.h, lock 7 (which tops the list) is the WALInsertLock. That lock was *not* mentioned in the pgsql-hackers post.

Re-running the same analysis for SI instead of SSI gives 4.6ktps and a much shorter list:
lwlock        7 shacq         0 exacq 14050121 blk  3429384 spindelay  347
lwlock       11 shacq   3133994 exacq    450325 blk 23456 spindelay   29
lwlock        0 shacq         0 exacq    684775 blk 19158 spindelay    3
lwlock        4 shacq    780846 exacq    428771 blk 4539 spindelay    6
lwlock        3 shacq        19 exacq    428705 blk 1147 spindelay    0
lwlock       59 shacq         0 exacq    125943 blk 203 spindelay    0
lwlock        8 shacq         0 exacq    287629 blk 116 spindelay    0
lwlock  2097358 shacq         0 exacq    752361 blk 1 spindelay    0
lwlock  2097355 shacq         0 exacq    755086 blk 1 spindelay    0
lwlock  2097352 shacq         0 exacq    760232 blk 1 spindelay    0

However, all of this only confirms that SSI has a lock bottleneck; it doesn't say why the bottleneck gets worse over time.

[1] http://www.postgresql.org/message-id/CA+TgmoYAiSM2jWEndReY5PL0sKbhgg7dbDH6r=oXKYzi9B7KJA@mail.gmail.com

Thoughts?
Ryan

Re: SSI slows down over time

От
Ryan Johnson
Дата:
On 05/04/2014 10:25 PM, Ryan Johnson wrote:
> Hi all,
>
> Disclaimer: this question probably belongs on the hackers list, but
> the instructions say you have to try somewhere else first... toss-up
> between this list and a bug report; list seemed more appropriate as a
> starting point. Happy to file a bug if that's more appropriate, though.
>
> This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure
> --prefix=...' and gcc-4.7.
> TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients.
>
> I get a strange behavior across repeated runs: each 100-second run is
> a bit slower than the one preceding it, when run with SSI
> (SERIALIZABLE). Switching to SI (REPEATABLE_READ) removes the problem,
> so it's apparently not due to the database growing. The database is
> completely shut down (pg_ctl stop) between runs, but the data lives in
> tmpfs, so there's no I/O problem here. 64GB RAM, so no paging, either.

The plot thickens...

I just had a run die with an out of (tmpfs) disk space error; the
pg_serial directory occupies 16GB, or 64825 segments (just under the 65k
limit for SLRU). A bit of source diving confirms that this is the
backing store for the OldSerXid SLRU that SSI uses. I'm not sure what
would prevent SLRU space from being reclaimed, though, given that a
complete, clean, database shut-down happens between every run. In
theory, all SSI info can be forgotten any time there are no serializable
transactions in the system.

I nuked the pgsql data directory and started over, and started firing
off 30-second runs (with pg_ctl start/stop in between each). On about
the sixth run, throughput dropped to ~200tps and the benchmark harness
terminated with an assertion error. I didn't see anything interesting in
the server logs (the database shut down normally), but the pg_serial
directory had ballooned from ~100kB to 8GB.

I tried to repro, and a series of 30-second runs gave the following
throughputs (tps):
*4615
3155 3149 3115 3206 3162 3069 3005 2978 2953 **308
2871 2876 2838 2853 2817 2768 2736 2782 2732 2833
2749 2675 2771 2700 2675 2682 2647 2572 2626 2567
*4394

That ** entry was the 8GB blow-up again. All files in the directory had
been created at the same time (= not during a previous run), and
persisted through the runs that followed. There was also a run where
abort rates jumped through the roof (~40k aborts rather than the usual
2000 or so), with a huge number of "out of shared memory" errors;
apparently max_predicate_locks=2000 wasn't high enough.

The two * entries were produced by runs under SI, and confirm that the
rest of the system has not been slowing down nearly as much as SSI. SI
throughput dropped by 5% as the database quadrupled in size. SSI
throughput dropped by 23% during the same interval. And this was
actually one of the better sets of runs; I had a few last week that
dipped below 1ktps.

I'm not sure what to make of this, thoughts?

Ryan



Re: SSI slows down over time

От
Bruce Momjian
Дата:
On Mon, Apr  7, 2014 at 10:38:52AM -0400, Ryan Johnson wrote:
> The two * entries were produced by runs under SI, and confirm that
> the rest of the system has not been slowing down nearly as much as
> SSI. SI throughput dropped by 5% as the database quadrupled in size.
> SSI throughput dropped by 23% during the same interval. And this was
> actually one of the better sets of runs; I had a few last week that
> dipped below 1ktps.
>
> I'm not sure what to make of this, thoughts?

Seems it is time to ask on hackers.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: SSI slows down over time

От
Ryan Johnson
Дата:
On 06/04/2014 10:55 AM, Tom Lane wrote:
> Ryan Johnson <ryan.johnson@cs.utoronto.ca> writes:
>> I get a strange behavior across repeated runs: each 100-second run is a
>> bit slower than the one preceding it, when run with SSI (SERIALIZABLE).
>> ... So the question: what should I look for to diagnose/triage this problem?
> In the past I've seen behaviors like this that traced to the range of
> "interesting" transaction IDs getting wider as time went on, so that
> more pages of pg_clog were hot, leading to more I/O traffic in the
> clog SLRU buffers.  Possibly this is some effect like that.
The effect disappears completely if I run under SI instead of SSI,
though. That makes me suspect strongly that the problem lurks in
SSI-specific infrastrucure.

However, I did notice that the SLRU buffer that holds "old" SSI
transactions sometimes spikes from hundreds to millions of entries (by
annotating the source to ereport a warning whenever the difference
between buffer head and tale is at least 10% higher than the previous
record). Not sure if that's related, though: I'm pretty sure SSI never
scans SLRU, it's only used for random lookups.

>> I've tried linux perf, but all it says is that lots of time is going to
>> LWLock (but callgraph tracing doesn't work in my not-bleeding-edge
>> kernel).
> You could recompile with -DLWLOCK_STATS to enable gathering stats on
> which LWLocks are heavily contended.  That would be a starting point
> for localizing the cause.

Here are the offenders (100-second run, 24 clients, ~2.2ktps):
> lwlock        7 shacq         0 exacq 7002810 blk  896235 spindelay  213
> lwlock       28 shacq  94984166 exacq   3938085 blk 572762 spindelay  163
> lwlock       65 shacq   3347000 exacq   2933440 blk 255927 spindelay   90
> lwlock       79 shacq   1742574 exacq   3009663 blk 216178 spindelay   41
> lwlock       76 shacq   2293386 exacq   2892242 blk 205199 spindelay   70
> lwlock       66 shacq    498909 exacq   2987485 blk 171508 spindelay   48
> lwlock       80 shacq    512107 exacq   3181753 blk 165772 spindelay   43
> lwlock       71 shacq    815733 exacq   3088157 blk 165579 spindelay   48
> lwlock       74 shacq    603321 exacq   3065391 blk 159953 spindelay   56
> lwlock       67 shacq    695465 exacq   2918970 blk 149339 spindelay   28
> lwlock       69 shacq    411203 exacq   3044007 blk 148655 spindelay   34
> lwlock       72 shacq    515260 exacq   2973321 blk 147533 spindelay   43
> lwlock       30 shacq  41628636 exacq      8799 blk 143889 spindelay  186
> lwlock       75 shacq    409472 exacq   2987227 blk 143196 spindelay   38
> lwlock       77 shacq    409401 exacq   2946972 blk 139507 spindelay   34
> lwlock       73 shacq    402544 exacq   2943467 blk 139380 spindelay   43
> lwlock       78 shacq    404220 exacq   2912665 blk 137625 spindelay   21
> lwlock       70 shacq    603643 exacq   2816730 blk 135851 spindelay   37
> lwlock       68 shacq    403533 exacq   2862017 blk 131946 spindelay   30
> lwlock       29 shacq         0 exacq    255302 blk 75838 spindelay    1
> lwlock        0 shacq         0 exacq    561508 blk 12445 spindelay    3
> lwlock       11 shacq   1245499 exacq    219717 blk 5501 spindelay   10
> lwlock        4 shacq    381211 exacq    209146 blk 1273 spindelay    4
> lwlock        3 shacq        16 exacq    209081 blk 522 spindelay    0
> lwlock        8 shacq         0 exacq    137961 blk 50 spindelay    0
> lwlock  2097366 shacq         0 exacq    384586 blk 1 spindelay    0
> lwlock  2097365 shacq         0 exacq    370176 blk 1 spindelay    0
> lwlock  2097359 shacq         0 exacq    363845 blk 1 spindelay    0

The above aggregates the per-lock stats from all processes, filters out
locks with fewer than 10000 accesses (shared+exclusive) or with zero
blk, then sorts by highest blk first.

According to [1], locks {28, 29, 30} are {SerializableXactHashLock,
SerializableFinishedListLock, SerializablePredicateLockListLock}, all
SSI-related; locks 65-80 are the sixteen PredicateLockMgrLocks that the
post mentions. Looking in lwlock.h, lock 7 (which tops the list) is the
WALInsertLock. That lock was *not* mentioned in the pgsql-hackers post.

Re-running the same analysis for SI instead of SSI gives 4.6ktps and a
much shorter list:
> lwlock        7 shacq         0 exacq 14050121 blk  3429384 spindelay  347
> lwlock       11 shacq   3133994 exacq    450325 blk 23456 spindelay   29
> lwlock        0 shacq         0 exacq    684775 blk 19158 spindelay    3
> lwlock        4 shacq    780846 exacq    428771 blk 4539 spindelay    6
> lwlock        3 shacq        19 exacq    428705 blk 1147 spindelay    0
> lwlock       59 shacq         0 exacq    125943 blk 203 spindelay    0
> lwlock        8 shacq         0 exacq    287629 blk 116 spindelay    0
> lwlock  2097358 shacq         0 exacq    752361 blk 1 spindelay    0
> lwlock  2097355 shacq         0 exacq    755086 blk 1 spindelay    0
> lwlock  2097352 shacq         0 exacq    760232 blk 1 spindelay    0

However, all of this only confirms that SSI has a lock bottleneck; it
doesn't say why the bottleneck gets worse over time.

[1]
http://www.postgresql.org/message-id/CA+TgmoYAiSM2jWEndReY5PL0sKbhgg7dbDH6r=oXKYzi9B7KJA@mail.gmail.com

Thoughts?
Ryan



Re: SSI slows down over time

От
Ryan Johnson
Дата:
On 09/04/2014 5:21 PM, Bruce Momjian wrote:
> On Mon, Apr  7, 2014 at 10:38:52AM -0400, Ryan Johnson wrote:
>> The two * entries were produced by runs under SI, and confirm that
>> the rest of the system has not been slowing down nearly as much as
>> SSI. SI throughput dropped by 5% as the database quadrupled in size.
>> SSI throughput dropped by 23% during the same interval. And this was
>> actually one of the better sets of runs; I had a few last week that
>> dipped below 1ktps.
>>
>> I'm not sure what to make of this, thoughts?
> Seems it is time to ask on hackers.
Ack.