mosbench revisited
От | Robert Haas |
---|---|
Тема | mosbench revisited |
Дата | |
Msg-id | CA+TgmoZWdo9XrH=TN59GX8rJM9FgiezpAA-B57ZEVOGof49FVA@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: mosbench revisited
(Martijn van Oosterhout <kleptog@svana.org>)
Re: mosbench revisited (Jim Nasby <jim@nasby.net>) Re: mosbench revisited (Jeff Janes <jeff.janes@gmail.com>) Re: mosbench revisited (Dimitri Fontaine <dimitri@2ndQuadrant.fr>) Re: mosbench revisited (Greg Stark <stark@mit.edu>) |
Список | pgsql-hackers |
About nine months ago, we had a discussion of some benchmarking that was done by the mosbench folks at MIT: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php Although the authors used PostgreSQL as a test harness for driving load, it's pretty clear from reading the paper that their primary goal was to stress the Linux kernel, so the applicability of the paper to real-world PostgreSQL performance improvement is less than it might be. Still, having now actually investigated in some detail many of the same performance issues that they were struggling with, I have a much clearer understanding of what's really going on here. In PostgreSQL terms, here are the bottlenecks they ran into: 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache because PostgreSQL protects its free-list with a single lock and thus scales poorly with smaller caches." This is a complaint about BufFreeList lock which, in fact, I've seen as a huge point of contention on some workloads. In fact, on read-only workloads, with my lazy vxid lock patch applied, this is, I believe, the only remaining unpartitioned LWLock that is ever taken in exclusive mode; or at least the only one that's taken anywhere near often enough to matter. I think we're going to do something about this, although I don't have a specific idea in mind at the moment. 2. "PostgreSQL implements row- and table-level locks atop user-level mutexes; as a result, even a non-conflicting row- or table-level lock acquisition requires exclusively locking one of only 16 global mutexes." I think that the reference to row-level locks here is a red herring; or at least, I haven't seen any evidence that row-level locking is a meaningful source of contention on any workload I've tested. Table-level locks clearly are, and this is the problem that the now-committed fastlock patch addressed. So, fixed! 3. "Our workload creates one PostgreSQL connection per server core and sends queries (selects or updates) in batches of 256, aggregating successive read-only transac- tions into single transactions. This workload is intended to minimize application-level contention within PostgreSQL in order to maximize the stress PostgreSQL places on the kernel." I had no idea what this was talking about at the time, but it's now obvious in retrospect that they were working around the overhead imposed by acquiring and releasing relation and virtualxid locks. My pending "lazy vxids" patch will address the remaining issue here. 4. "With modified PostgreSQL on stock Linux, throughput for both workloads collapses at 36 cores .. The main reason is the kernel's lseek implementation." With the fastlock, sinval-hasmessages, and lazy-vxid patches applied (the first two are committed now), it's now much easier to run headlong into this bottleneck. Prior to those patches, for this to be an issue, you would need to batch your queries together in big groups to avoid getting whacked by the lock manager and/or sinval overhead first. With those problems and the recently discovered bottleneck in glibc's random() implementation fixed, good old pgbench -S is enough to hit this problem if you have enough clients and enough cores. And it turns out that the word "collapse" is not an exaggeration. On a 64-core Intel box running RHEL 6.1, performance ramped up from 24k TPS at 4 clients to 175k TPS at 32 clients and then to 207k TPS at 44 clients. After that it fell off a cliff, dropping to 93k TPS at 52 clients and 26k TPS at 64 clients, consuming truly horrifying amounts of system time in the process. A somewhat tedious investigation revealed that the problem is, in fact, contention on the inode mutex caused by lseek(). Results are much better with -M prepared (310k TPS at 48 clients, 294k TPS at 64 clients). All one-minute tests with scale factor 100, fitting inside 8GB of shared_buffers (clearly not enough for serious benchmarking, but enough to demonstrate this issue). It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode->i_size without taking a spinlock. A little Googling around suggests that some patches along these lines have been proposed and - for reasons that I don't fully understand - rejected. That now seems unfortunate. Barring a kernel-level fix, we could try to implement our own cache to work around this problem. However, any such cache would need to be darn cheap to check and update (since we can't assume that relation extension is an infrequent event) and must somehow having the same sort of mutex contention that's killing the kernel in this workload. 5. With all of the above problems fixed or worked around, the authors write, "PostgreSQL's overall scalability is primarily limited by contention for the spinlock protecting the buffer cache page for the root of the table index". This is the only problem on their list that I haven't yet encountered in testing. I'm kind of interested by the result, actually, as I had feared that the spinlock protecting ProcArrayLock was going to be a bigger problem sooner. But maybe not.I'm also concerned about the spinlock protecting thebuffer mapping lock that covers the root index page. I'll investigate further if and when I come up with a way to dodge the lseek() contention problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: