Обсуждение: SSI slows down over time
Hi all, Disclaimer: this question probably belongs on the hackers list, but the instructions say you have to try somewhere else first... toss-up between this list and a bug report; list seemed more appropriate as a starting point. Happy to file a bug if that's more appropriate, though. This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure --prefix=...' and gcc-4.7. TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients. I get a strange behavior across repeated runs: each 100-second run is a bit slower than the one preceding it, when run with SSI (SERIALIZABLE). Switching to SI (REPEATABLE_READ) removes the problem, so it's apparently not due to the database growing. The database is completely shut down (pg_ctl stop) between runs, but the data lives in tmpfs, so there's no I/O problem here. 64GB RAM, so no paging, either. Note that this slowdown is in addition to the 30% performance from using SSI on my 24-core machine. I understand that the latter is a known bottleneck; my question is why the bottleneck should get worse over time: With SI, I get ~4.4ktps, consistently. With SSI, I get 3.9, 3.8, 3.4. 3.3, 3.1, 2.9, ... So the question: what should I look for to diagnose/triage this problem? I'm willing to do some legwork, but have no idea where to go next. I've tried linux perf, but all it says is that lots of time is going to LWLock (but callgraph tracing doesn't work in my not-bleeding-edge kernel). Looking through the logs, the abort rates due to SSI aren't changing in any obvious way. I've been hacking on SSI for over a month now as part of a research project, and am fairly familiar with predicate.c, but I don't see any obvious reason this behavior should arise (in particular, SLRU storage seems to be re-initialized every time the postmaster restarts, so there shouldn't be any particular memory effect due to SIREAD locks). I'm also familiar with both Cahill's and Ports/Grittner's published descriptions of SSI, but again, nothing obvious jumps out. In my experience this sort of behavior indicates a type of bug where fixing it would have a large impact on performance (because the early "damage" is done so quickly that even the very first run doesn't live up to its true potential). $ cat pgsql.conf shared_buffers = 8GB synchronous_commit = off checkpoint_segments = 64 max_pred_locks_per_transaction = 2000 default_statistics_target = 100 maintenance_work_mem = 2GB checkpoint_completion_target = 0.9 effective_cache_size = 40GB work_mem = 1920MB wal_buffers = 16MB Thanks, Ryan
On 04/06/2014 05:25 AM, Ryan Johnson wrote: > I've tried linux perf, but all it says is that lots of time is going to > LWLock (but callgraph tracing doesn't work in my not-bleeding-edge > kernel). Make sure you compile with the "-fno-omit-frame-pointer" flag. - Heikki
Ryan Johnson <ryan.johnson@cs.utoronto.ca> writes: > I get a strange behavior across repeated runs: each 100-second run is a > bit slower than the one preceding it, when run with SSI (SERIALIZABLE). > ... So the question: what should I look for to diagnose/triage this problem? In the past I've seen behaviors like this that traced to the range of "interesting" transaction IDs getting wider as time went on, so that more pages of pg_clog were hot, leading to more I/O traffic in the clog SLRU buffers. Possibly this is some effect like that. > I've tried linux perf, but all it says is that lots of time is going to > LWLock (but callgraph tracing doesn't work in my not-bleeding-edge > kernel). You could recompile with -DLWLOCK_STATS to enable gathering stats on which LWLocks are heavily contended. That would be a starting point for localizing the cause. regards, tom lane
On 06/04/2014 4:30 AM, Heikki Linnakangas wrote: > On 04/06/2014 05:25 AM, Ryan Johnson wrote: >> I've tried linux perf, but all it says is that lots of time is going to >> LWLock (but callgraph tracing doesn't work in my not-bleeding-edge >> kernel). > > Make sure you compile with the "-fno-omit-frame-pointer" flag. Oh, right. Forgot about that. With -g in place, it shows 5% of time going to LWLockAcquire, virtually all of that in PredicateLockPage. However, top reports only 50% utilization, so I'd *really* like to have stack traces and "counts" for stack traces the worker thread spends its time blocked in. I don't know a way to get that out of perf, though. It's near-trivial to do in dtrace, but unfortunately I don't have a Solaris or BSD machine handy. Does pgsql give a way to trace latencies due to LWLock? Thanks, Ryan
On 06/04/2014 10:55 AM, Tom Lane wrote:
The effect disappears completely if I run under SI instead of SSI, though. That makes me suspect strongly that the problem lurks in SSI-specific infrastrucure.Ryan Johnson <ryan.johnson@cs.utoronto.ca> writes:I get a strange behavior across repeated runs: each 100-second run is a bit slower than the one preceding it, when run with SSI (SERIALIZABLE). ... So the question: what should I look for to diagnose/triage this problem?In the past I've seen behaviors like this that traced to the range of "interesting" transaction IDs getting wider as time went on, so that more pages of pg_clog were hot, leading to more I/O traffic in the clog SLRU buffers. Possibly this is some effect like that.
However, I did notice that the SLRU buffer that holds "old" SSI transactions sometimes spikes from hundreds to millions of entries (by annotating the source to ereport a warning whenever the difference between head and tail is at least 10% higher than the previous record). Not sure if that's related, though: I'm pretty sure SSI never scans SLRU, it's only used for random lookups.
I've tried linux perf, but all it says is that lots of time is going to LWLock (but callgraph tracing doesn't work in my not-bleeding-edge kernel).You could recompile with -DLWLOCK_STATS to enable gathering stats on which LWLocks are heavily contended. That would be a starting point for localizing the cause.
Here are the offenders (100-second run, 24 clients, ~2.2ktps):
lwlock 7 shacq 0 exacq 7002810 blk 896235 spindelay 213
lwlock 28 shacq 94984166 exacq 3938085 blk 572762 spindelay 163
lwlock 65 shacq 3347000 exacq 2933440 blk 255927 spindelay 90
lwlock 79 shacq 1742574 exacq 3009663 blk 216178 spindelay 41
lwlock 76 shacq 2293386 exacq 2892242 blk 205199 spindelay 70
lwlock 66 shacq 498909 exacq 2987485 blk 171508 spindelay 48
lwlock 80 shacq 512107 exacq 3181753 blk 165772 spindelay 43
lwlock 71 shacq 815733 exacq 3088157 blk 165579 spindelay 48
lwlock 74 shacq 603321 exacq 3065391 blk 159953 spindelay 56
lwlock 67 shacq 695465 exacq 2918970 blk 149339 spindelay 28
lwlock 69 shacq 411203 exacq 3044007 blk 148655 spindelay 34
lwlock 72 shacq 515260 exacq 2973321 blk 147533 spindelay 43
lwlock 30 shacq 41628636 exacq 8799 blk 143889 spindelay 186
lwlock 75 shacq 409472 exacq 2987227 blk 143196 spindelay 38
lwlock 77 shacq 409401 exacq 2946972 blk 139507 spindelay 34
lwlock 73 shacq 402544 exacq 2943467 blk 139380 spindelay 43
lwlock 78 shacq 404220 exacq 2912665 blk 137625 spindelay 21
lwlock 70 shacq 603643 exacq 2816730 blk 135851 spindelay 37
lwlock 68 shacq 403533 exacq 2862017 blk 131946 spindelay 30
lwlock 29 shacq 0 exacq 255302 blk 75838 spindelay 1
lwlock 0 shacq 0 exacq 561508 blk 12445 spindelay 3
lwlock 11 shacq 1245499 exacq 219717 blk 5501 spindelay 10
lwlock 4 shacq 381211 exacq 209146 blk 1273 spindelay 4
lwlock 3 shacq 16 exacq 209081 blk 522 spindelay 0
lwlock 8 shacq 0 exacq 137961 blk 50 spindelay 0
lwlock 2097366 shacq 0 exacq 384586 blk 1 spindelay 0
lwlock 2097365 shacq 0 exacq 370176 blk 1 spindelay 0
lwlock 2097359 shacq 0 exacq 363845 blk 1 spindelay 0
The above aggregates the per-lock stats from all processes, filters out locks with fewer than 10000 accesses (shared+exclusive) or with zero blk, then sorts by highest blk first.
According to [1], locks {28, 29, 30} are {SerializableXactHashLock, SerializableFinishedListLock, SerializablePredicateLockListLock}, all SSI-related; locks 65-80 are the sixteen PredicateLockMgrLocks that the post mentions. Looking in lwlock.h, lock 7 (which tops the list) is the WALInsertLock. That lock was *not* mentioned in the pgsql-hackers post.
Re-running the same analysis for SI instead of SSI gives 4.6ktps and a much shorter list:
lwlock 7 shacq 0 exacq 14050121 blk 3429384 spindelay 347
lwlock 11 shacq 3133994 exacq 450325 blk 23456 spindelay 29
lwlock 0 shacq 0 exacq 684775 blk 19158 spindelay 3
lwlock 4 shacq 780846 exacq 428771 blk 4539 spindelay 6
lwlock 3 shacq 19 exacq 428705 blk 1147 spindelay 0
lwlock 59 shacq 0 exacq 125943 blk 203 spindelay 0
lwlock 8 shacq 0 exacq 287629 blk 116 spindelay 0
lwlock 2097358 shacq 0 exacq 752361 blk 1 spindelay 0
lwlock 2097355 shacq 0 exacq 755086 blk 1 spindelay 0
lwlock 2097352 shacq 0 exacq 760232 blk 1 spindelay 0
However, all of this only confirms that SSI has a lock bottleneck; it doesn't say why the bottleneck gets worse over time.
[1] http://www.postgresql.org/message-id/CA+TgmoYAiSM2jWEndReY5PL0sKbhgg7dbDH6r=oXKYzi9B7KJA@mail.gmail.com
Thoughts?
Ryan
On 05/04/2014 10:25 PM, Ryan Johnson wrote: > Hi all, > > Disclaimer: this question probably belongs on the hackers list, but > the instructions say you have to try somewhere else first... toss-up > between this list and a bug report; list seemed more appropriate as a > starting point. Happy to file a bug if that's more appropriate, though. > > This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure > --prefix=...' and gcc-4.7. > TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients. > > I get a strange behavior across repeated runs: each 100-second run is > a bit slower than the one preceding it, when run with SSI > (SERIALIZABLE). Switching to SI (REPEATABLE_READ) removes the problem, > so it's apparently not due to the database growing. The database is > completely shut down (pg_ctl stop) between runs, but the data lives in > tmpfs, so there's no I/O problem here. 64GB RAM, so no paging, either. The plot thickens... I just had a run die with an out of (tmpfs) disk space error; the pg_serial directory occupies 16GB, or 64825 segments (just under the 65k limit for SLRU). A bit of source diving confirms that this is the backing store for the OldSerXid SLRU that SSI uses. I'm not sure what would prevent SLRU space from being reclaimed, though, given that a complete, clean, database shut-down happens between every run. In theory, all SSI info can be forgotten any time there are no serializable transactions in the system. I nuked the pgsql data directory and started over, and started firing off 30-second runs (with pg_ctl start/stop in between each). On about the sixth run, throughput dropped to ~200tps and the benchmark harness terminated with an assertion error. I didn't see anything interesting in the server logs (the database shut down normally), but the pg_serial directory had ballooned from ~100kB to 8GB. I tried to repro, and a series of 30-second runs gave the following throughputs (tps): *4615 3155 3149 3115 3206 3162 3069 3005 2978 2953 **308 2871 2876 2838 2853 2817 2768 2736 2782 2732 2833 2749 2675 2771 2700 2675 2682 2647 2572 2626 2567 *4394 That ** entry was the 8GB blow-up again. All files in the directory had been created at the same time (= not during a previous run), and persisted through the runs that followed. There was also a run where abort rates jumped through the roof (~40k aborts rather than the usual 2000 or so), with a huge number of "out of shared memory" errors; apparently max_predicate_locks=2000 wasn't high enough. The two * entries were produced by runs under SI, and confirm that the rest of the system has not been slowing down nearly as much as SSI. SI throughput dropped by 5% as the database quadrupled in size. SSI throughput dropped by 23% during the same interval. And this was actually one of the better sets of runs; I had a few last week that dipped below 1ktps. I'm not sure what to make of this, thoughts? Ryan
On Mon, Apr 7, 2014 at 10:38:52AM -0400, Ryan Johnson wrote: > The two * entries were produced by runs under SI, and confirm that > the rest of the system has not been slowing down nearly as much as > SSI. SI throughput dropped by 5% as the database quadrupled in size. > SSI throughput dropped by 23% during the same interval. And this was > actually one of the better sets of runs; I had a few last week that > dipped below 1ktps. > > I'm not sure what to make of this, thoughts? Seems it is time to ask on hackers. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 06/04/2014 10:55 AM, Tom Lane wrote: > Ryan Johnson <ryan.johnson@cs.utoronto.ca> writes: >> I get a strange behavior across repeated runs: each 100-second run is a >> bit slower than the one preceding it, when run with SSI (SERIALIZABLE). >> ... So the question: what should I look for to diagnose/triage this problem? > In the past I've seen behaviors like this that traced to the range of > "interesting" transaction IDs getting wider as time went on, so that > more pages of pg_clog were hot, leading to more I/O traffic in the > clog SLRU buffers. Possibly this is some effect like that. The effect disappears completely if I run under SI instead of SSI, though. That makes me suspect strongly that the problem lurks in SSI-specific infrastrucure. However, I did notice that the SLRU buffer that holds "old" SSI transactions sometimes spikes from hundreds to millions of entries (by annotating the source to ereport a warning whenever the difference between buffer head and tale is at least 10% higher than the previous record). Not sure if that's related, though: I'm pretty sure SSI never scans SLRU, it's only used for random lookups. >> I've tried linux perf, but all it says is that lots of time is going to >> LWLock (but callgraph tracing doesn't work in my not-bleeding-edge >> kernel). > You could recompile with -DLWLOCK_STATS to enable gathering stats on > which LWLocks are heavily contended. That would be a starting point > for localizing the cause. Here are the offenders (100-second run, 24 clients, ~2.2ktps): > lwlock 7 shacq 0 exacq 7002810 blk 896235 spindelay 213 > lwlock 28 shacq 94984166 exacq 3938085 blk 572762 spindelay 163 > lwlock 65 shacq 3347000 exacq 2933440 blk 255927 spindelay 90 > lwlock 79 shacq 1742574 exacq 3009663 blk 216178 spindelay 41 > lwlock 76 shacq 2293386 exacq 2892242 blk 205199 spindelay 70 > lwlock 66 shacq 498909 exacq 2987485 blk 171508 spindelay 48 > lwlock 80 shacq 512107 exacq 3181753 blk 165772 spindelay 43 > lwlock 71 shacq 815733 exacq 3088157 blk 165579 spindelay 48 > lwlock 74 shacq 603321 exacq 3065391 blk 159953 spindelay 56 > lwlock 67 shacq 695465 exacq 2918970 blk 149339 spindelay 28 > lwlock 69 shacq 411203 exacq 3044007 blk 148655 spindelay 34 > lwlock 72 shacq 515260 exacq 2973321 blk 147533 spindelay 43 > lwlock 30 shacq 41628636 exacq 8799 blk 143889 spindelay 186 > lwlock 75 shacq 409472 exacq 2987227 blk 143196 spindelay 38 > lwlock 77 shacq 409401 exacq 2946972 blk 139507 spindelay 34 > lwlock 73 shacq 402544 exacq 2943467 blk 139380 spindelay 43 > lwlock 78 shacq 404220 exacq 2912665 blk 137625 spindelay 21 > lwlock 70 shacq 603643 exacq 2816730 blk 135851 spindelay 37 > lwlock 68 shacq 403533 exacq 2862017 blk 131946 spindelay 30 > lwlock 29 shacq 0 exacq 255302 blk 75838 spindelay 1 > lwlock 0 shacq 0 exacq 561508 blk 12445 spindelay 3 > lwlock 11 shacq 1245499 exacq 219717 blk 5501 spindelay 10 > lwlock 4 shacq 381211 exacq 209146 blk 1273 spindelay 4 > lwlock 3 shacq 16 exacq 209081 blk 522 spindelay 0 > lwlock 8 shacq 0 exacq 137961 blk 50 spindelay 0 > lwlock 2097366 shacq 0 exacq 384586 blk 1 spindelay 0 > lwlock 2097365 shacq 0 exacq 370176 blk 1 spindelay 0 > lwlock 2097359 shacq 0 exacq 363845 blk 1 spindelay 0 The above aggregates the per-lock stats from all processes, filters out locks with fewer than 10000 accesses (shared+exclusive) or with zero blk, then sorts by highest blk first. According to [1], locks {28, 29, 30} are {SerializableXactHashLock, SerializableFinishedListLock, SerializablePredicateLockListLock}, all SSI-related; locks 65-80 are the sixteen PredicateLockMgrLocks that the post mentions. Looking in lwlock.h, lock 7 (which tops the list) is the WALInsertLock. That lock was *not* mentioned in the pgsql-hackers post. Re-running the same analysis for SI instead of SSI gives 4.6ktps and a much shorter list: > lwlock 7 shacq 0 exacq 14050121 blk 3429384 spindelay 347 > lwlock 11 shacq 3133994 exacq 450325 blk 23456 spindelay 29 > lwlock 0 shacq 0 exacq 684775 blk 19158 spindelay 3 > lwlock 4 shacq 780846 exacq 428771 blk 4539 spindelay 6 > lwlock 3 shacq 19 exacq 428705 blk 1147 spindelay 0 > lwlock 59 shacq 0 exacq 125943 blk 203 spindelay 0 > lwlock 8 shacq 0 exacq 287629 blk 116 spindelay 0 > lwlock 2097358 shacq 0 exacq 752361 blk 1 spindelay 0 > lwlock 2097355 shacq 0 exacq 755086 blk 1 spindelay 0 > lwlock 2097352 shacq 0 exacq 760232 blk 1 spindelay 0 However, all of this only confirms that SSI has a lock bottleneck; it doesn't say why the bottleneck gets worse over time. [1] http://www.postgresql.org/message-id/CA+TgmoYAiSM2jWEndReY5PL0sKbhgg7dbDH6r=oXKYzi9B7KJA@mail.gmail.com Thoughts? Ryan
On 09/04/2014 5:21 PM, Bruce Momjian wrote: > On Mon, Apr 7, 2014 at 10:38:52AM -0400, Ryan Johnson wrote: >> The two * entries were produced by runs under SI, and confirm that >> the rest of the system has not been slowing down nearly as much as >> SSI. SI throughput dropped by 5% as the database quadrupled in size. >> SSI throughput dropped by 23% during the same interval. And this was >> actually one of the better sets of runs; I had a few last week that >> dipped below 1ktps. >> >> I'm not sure what to make of this, thoughts? > Seems it is time to ask on hackers. Ack.