Обсуждение: 8.4 open item: copy performance regression?
Any objections if I add: http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php to the (currently empty) list of open items for 8.4? A 25-30% performance regression in our main bulk loading mechanism should at least be explained before the release... Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Any objections if I add: > http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php > to the (currently empty) list of open items for 8.4? I am unable to duplicate any slowdown on this test case. AFAICT 8.4 and 8.3 branch tip are about the same speed; if anything 8.4 is faster. Testing on x86_64 Fedora 10 ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> Any objections if I add: >> http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php >> to the (currently empty) list of open items for 8.4? > > I am unable to duplicate any slowdown on this test case. AFAICT > 8.4 and 8.3 branch tip are about the same speed; if anything 8.4 > is faster. Testing on x86_64 Fedora 10 ... I just ran the specified test on: Linux ATHENA 2.6.16.60-0.31-smp #1 SMP Tue Oct 7 16:16:29 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux Linux version 2.6.16.60-0.31-smp (geeko@buildhost) (gcc version 4.1.2 20070115 (SUSE Linux)) #1 SMP Tue Oct 7 16:16:29 UTC 2008 SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 PATCHLEVEL = 2 I did configure with a prefix and these options:--enable-integer-datetimes --enable-debug --disable-nls --with-libxml I did initdb with --no-locale 8.3.7 real 0m24.249s real 0m24.054s real 0m24.361s 8.4rc1 real 0m33.503s real 0m34.198s real 0m33.931s -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> I am unable to duplicate any slowdown on this test case.
> [ Kevin can ]
It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output?  Also, try turning
off autovacuum in both cases, just to see if that's related.
        regards, tom lane
			
		Tom Lane <tgl@sss.pgh.pa.us> wrote: > It'd be useful first off to figure out if it's a CPU or I/O issue. > Is there any visible difference in vmstat output? Also, try turning > off autovacuum in both cases, just to see if that's related. Both took slightly longer with autovacuum off, but probably just within the noise. It's hard to interpret the vmstat output, largely because I chose to run this on one of our biggest servers, which is not currently serving an application, per se, but as a replication target, and this being "off hours" is busy running the sync process to the source machines. This involves generating md5 sums on both sides for blocks of rows, which is pretty CPU-intensive. There is very little disk output from that right now, pretty light on the disk reads, but keeping a few CPUs pretty busy generating those md5 sums. I've got to go keep an appointment, but I'll come back and see if I can do more. For now, here's the raw vmstat, in case someone can pick out info I'm missing: -Kevin
Вложения
On Thu, Jun 18, 2009 at 05:20:08PM -0400, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > > Any objections if I add: > > http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php > > to the (currently empty) list of open items for 8.4? > > I am unable to duplicate any slowdown on this test case. AFAICT > 8.4 and 8.3 branch tip are about the same speed; if anything 8.4 > is faster. Testing on x86_64 Fedora 10 ... > > regards, tom lane > What is not clear from Stefen's function listing is how the 8.4 server could issue 33% more XLogInsert() and CopyReadLine() calls than the 8.3.7 server using the same input file. That would account for the slow down but now why it is happening. Cheers, Ken
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > I've got to go keep an appointment Sorry about that. Back now. Anything else I can do to help with this? -Kevin
Kenneth Marshall <ktm@rice.edu> wrote: > What is not clear from Stefen's function listing is how the 8.4 > server could issue 33% more XLogInsert() and CopyReadLine() > calls than the 8.3.7 server using the same input file. I thought those were profiling numbers -- the number of times a timer checked what was executing and found it in that method. Which suggests that those two methods are probably slower now than in 8.3.7, at least in some environments. -Kevin
			
				It doesn't look like it's related to autovacuum.  I re-ran the test against the two solaris boxes with autovacuum turned off and the results look about the same.
8.3.7 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m43.662s
user 0m0.001s
sys 0m0.003s
real 0m43.565s
user 0m0.001s
sys 0m0.003s
real 0m43.742s
user 0m0.001s
sys 0m0.003s
8.4rc1 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m59.304s
user 0m0.001s
sys 0m0.003s
real 0m58.653s
user 0m0.001s
sys 0m0.003s
real 1m0.253s
user 0m0.001s
sys 0m0.003s
8.3.7 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m38.981s
user 0m0.002s
sys 0m0.004s
real 0m39.879s
user 0m0.002s
sys 0m0.004s
real 0m39.111s
user 0m0.002s
sys 0m0.004s
8.4rc1 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m50.647s
user 0m0.002s
sys 0m0.004s
real 0m49.453s
user 0m0.002s
sys 0m0.004s
real 0m49.725s
user 0m0.002s
sys 0m0.004s
Alan
			
		
		
	8.3.7 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m43.662s
user 0m0.001s
sys 0m0.003s
real 0m43.565s
user 0m0.001s
sys 0m0.003s
real 0m43.742s
user 0m0.001s
sys 0m0.003s
8.4rc1 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m59.304s
user 0m0.001s
sys 0m0.003s
real 0m58.653s
user 0m0.001s
sys 0m0.003s
real 1m0.253s
user 0m0.001s
sys 0m0.003s
8.3.7 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m38.981s
user 0m0.002s
sys 0m0.004s
real 0m39.879s
user 0m0.002s
sys 0m0.004s
real 0m39.111s
user 0m0.002s
sys 0m0.004s
8.4rc1 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m50.647s
user 0m0.002s
sys 0m0.004s
real 0m49.453s
user 0m0.002s
sys 0m0.004s
real 0m49.725s
user 0m0.002s
sys 0m0.004s
Alan
 On Thu, Jun 18, 2009 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am unable to duplicate any slowdown on this test case.> [ Kevin can ]
It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output? Also, try turning
off autovacuum in both cases, just to see if that's related.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Yes, you are right. I thought that they were absolute function counts. The data makes more sense now. Regards, Ken On Thu, Jun 18, 2009 at 07:03:34PM -0500, Kevin Grittner wrote: > Kenneth Marshall <ktm@rice.edu> wrote: > > > What is not clear from Stefen's function listing is how the 8.4 > > server could issue 33% more XLogInsert() and CopyReadLine() > > calls than the 8.3.7 server using the same input file. > > I thought those were profiling numbers -- the number of times a timer > checked what was executing and found it in that method. Which > suggests that those two methods are probably slower now than in 8.3.7, > at least in some environments. > > -Kevin >
Kevin Grittner wrote: > > 8.3.7 > real 0m24.249s > real 0m24.054s > real 0m24.361s > > 8.4rc1 > real 0m33.503s > real 0m34.198s > real 0m33.931s > > > Ugh. This looks like a poster child case for a benchfarm ... Is there any chance you guys could triangulate this a bit? Good initial triangulation points might be the end of each commitfest. (I have a vested interest in making sure COPY performance doesn't regress, since it will affect parallel restore's speed in spades.) cheers andrew
On 6/19/09, Andrew Dunstan <andrew@dunslane.net> wrote: > Kevin Grittner wrote: > > 8.3.7 > > real 0m24.249s > > real 0m24.054s > > real 0m24.361s > > 8.4rc1 > > real 0m33.503s > > real 0m34.198s > > real 0m33.931s > > Ugh. This looks like a poster child case for a benchfarm ... > > Is there any chance you guys could triangulate this a bit? Good initial > triangulation points might be the end of each commitfest. (I have a vested > interest in making sure COPY performance doesn't regress, since it will > affect parallel restore's speed in spades.) git bisect? -- marko
Andrew Dunstan wrote: > > > Kevin Grittner wrote: >> >> 8.3.7 >> real 0m24.249s >> real 0m24.054s >> real 0m24.361s >> >> 8.4rc1 >> real 0m33.503s >> real 0m34.198s >> real 0m33.931s >> >> >> > > Ugh. This looks like a poster child case for a benchfarm ... indeed... > > Is there any chance you guys could triangulate this a bit? Good initial > triangulation points might be the end of each commitfest. (I have a > vested interest in making sure COPY performance doesn't regress, since > it will affect parallel restore's speed in spades.) Maybe parallel restore is the issue why we haven't noticed this earlier. The case that regressed this way is WAL logged COPY, COPY that can bypass WAL (which typically happens in parallel restore now) is actually a bit faster in my testing in 8.4. I will try and see if I can figure out what caused the regression... Stefan
Just eyeing the code ... another thing we changed since 8.3 is to enable
posix_fadvise() calls for WAL.  Any of the complaints want to try diking
out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)if (!XLogArchivingActive() &&
(get_sync_bit(sync_method)& PG_O_DIRECT) == 0)    (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 
#endif
        regards, tom lane
			
		Tom Lane wrote: > Just eyeing the code ... another thing we changed since 8.3 is to enable > posix_fadvise() calls for WAL. Any of the complaints want to try diking > out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)? > > #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) > if (!XLogArchivingActive() && > (get_sync_bit(sync_method) & PG_O_DIRECT) == 0) > (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); > #endif doesn't seem to cause any noticable difference for me... Stefan
Tom Lane wrote: > Just eyeing the code ... another thing we changed since 8.3 is to enable > posix_fadvise() calls for WAL. Any of the complaints want to try diking > out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)? > > #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) > if (!XLogArchivingActive() && > (get_sync_bit(sync_method) & PG_O_DIRECT) == 0) > (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); > #endif ok after a bit of bisecting I'm happy to announce the winner of the contest: http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php this patch causes a 25-30% performance regression for WAL logged copy, however in the WAL bypass case (maybe that was what got tested?) it results in a 20% performance increase. the raw numbers using the upthread posted minimal postgresql.conf are: post patch/wal logged: 4min10s/4min19/4min12 post patch/wal bypass: 1m55s/1m58s/2m00 prepatch/wal logged: 2m55s/3min00/2m59 prepatch/wal bypass: 2m22s/2m18s/2m20s Stefan
On Fri, Jun 19, 2009 at 07:49:31PM +0200, Stefan Kaltenbrunner wrote: > Tom Lane wrote: >> Just eyeing the code ... another thing we changed since 8.3 is to enable >> posix_fadvise() calls for WAL. Any of the complaints want to try diking >> out this bit of code (near line 2580 in >> src/backend/access/transam/xlog.c)? >> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) >> if (!XLogArchivingActive() && >> (get_sync_bit(sync_method) & PG_O_DIRECT) == 0) >> (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); >> #endif > > ok after a bit of bisecting I'm happy to announce the winner of the > contest: > > http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php > > this patch causes a 25-30% performance regression for WAL logged copy, > however in the WAL bypass case (maybe that was what got tested?) it results > in a 20% performance increase. > > the raw numbers using the upthread posted minimal postgresql.conf are: > > post patch/wal logged: 4min10s/4min19/4min12 > post patch/wal bypass: 1m55s/1m58s/2m00 > prepatch/wal logged: 2m55s/3min00/2m59 > prepatch/wal bypass: 2m22s/2m18s/2m20s > > > Stefan > Great! Maybe just increasing the size of the BULKWRITE ring, possibly as a function of the shared_memory is all that is needed. 256kB is the currently coded ring_size in storage/buffer/freelist.c Ken
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > ok after a bit of bisecting I'm happy to announce the winner of the contest: > http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php > this patch causes a 25-30% performance regression for WAL logged copy, > however in the WAL bypass case (maybe that was what got tested?) it > results in a 20% performance increase. Hmm. What that patch actually changes is that it prevents a bulk insert (ie COPY in) from trashing the entire shared-buffers arena. I think the reason for the WAL correlation is that once it's filled the ring buffer, creating new pages requires writing out old ones, and the WAL-before-data rule means that the copy process has to block waiting for WAL to go down to disk before it can write. When it's allowed to use the whole arena there is more chance for some of that writing to be done by the walwriter or bgwriter. But the details are going to depend on the platform's CPU vs I/O balance, which no doubt explains why some of us don't see it. I don't think we want to revert that patch --- not trashing the whole buffer arena seems like a Good Thing from a system-wide point of view, even if it makes individual COPY operations go slower. However, we could maybe play around with the tradeoffs a bit. In particular it seems like it would be useful to experiment with different ring buffer sizes. Could you try increasing the ring size allowed in src/backend/storage/buffer/freelist.c for the BULKWRITE case *************** *** 384,389 **** --- 384,392 ---- case BAS_BULKREAD: ring_size = 256 * 1024 / BLCKSZ; break; + case BAS_BULKWRITE: + ring_size = 256 * 1024 / BLCKSZ; + break; case BAS_VACUUM: ring_size = 256 * 1024 / BLCKSZ; break; and see if maybe we can buy back most of the loss with not too much of a ring size increase? regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> ok after a bit of bisecting I'm happy to announce the winner of the contest: >> http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php > >> this patch causes a 25-30% performance regression for WAL logged copy, >> however in the WAL bypass case (maybe that was what got tested?) it >> results in a 20% performance increase. > > Hmm. What that patch actually changes is that it prevents a bulk insert > (ie COPY in) from trashing the entire shared-buffers arena. I think the > reason for the WAL correlation is that once it's filled the ring buffer, > creating new pages requires writing out old ones, and the > WAL-before-data rule means that the copy process has to block waiting > for WAL to go down to disk before it can write. When it's allowed to > use the whole arena there is more chance for some of that writing to be > done by the walwriter or bgwriter. But the details are going to depend > on the platform's CPU vs I/O balance, which no doubt explains why some > of us don't see it. hmm - In my case both the CPU (an Intel E5530 Nehalem) and the IO subsystem (8GB Fiberchannel connected NetApp with 4GB cache) are pretty fast. and even with say fsync=off 8.4RC1 is only slightly faster than 8.3 with the same config and fsync=on so maybe there is a secondary effect at play too. > > I don't think we want to revert that patch --- not trashing the whole > buffer arena seems like a Good Thing from a system-wide point of view, > even if it makes individual COPY operations go slower. However, we > could maybe play around with the tradeoffs a bit. In particular it > seems like it would be useful to experiment with different ring buffer > sizes. Could you try increasing the ring size allowed in > src/backend/storage/buffer/freelist.c for the BULKWRITE case > > *************** > *** 384,389 **** > --- 384,392 ---- > case BAS_BULKREAD: > ring_size = 256 * 1024 / BLCKSZ; > break; > + case BAS_BULKWRITE: > + ring_size = 256 * 1024 / BLCKSZ; > + break; > case BAS_VACUUM: > ring_size = 256 * 1024 / BLCKSZ; > break; > > > and see if maybe we can buy back most of the loss with not too much > of a ring size increase? already started testing that once I found the offending commit. 256 * 1024 / BLCKSZ 4min10s/4min19/4min12 512 * 1024 / BLCKSZ 3min27s/3min32s 1024 * 1024 / BLCKSZ 3min14s/3min12s 2048 * 1024 / BLCKSZ 3min02/3min02 4096 * 1024 / BLCKSZ 2m59/2m58s 8192 * 1024 / BLCKSZ 2m59/2m59s so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in more or less the same performance that 8.3 had. Stefan
> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in > more or less the same performance that 8.3 had. Can some folks test this with different size COPYs? That's both larger/smaller tables, and larger/smaller rows. We should also test copy with large blob data. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Fri, 2009-06-19 at 14:11 -0400, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > > ok after a bit of bisecting I'm happy to announce the winner of the contest: > > http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php > > > this patch causes a 25-30% performance regression for WAL logged copy, > > however in the WAL bypass case (maybe that was what got tested?) it > > results in a 20% performance increase. > > Hmm. What that patch actually changes is that it prevents a bulk insert > (ie COPY in) from trashing the entire shared-buffers arena. I think the > reason for the WAL correlation is that once it's filled the ring buffer, > creating new pages requires writing out old ones, and the > WAL-before-data rule means that the copy process has to block waiting > for WAL to go down to disk before it can write. When it's allowed to > use the whole arena there is more chance for some of that writing to be > done by the walwriter or bgwriter. But the details are going to depend > on the platform's CPU vs I/O balance, which no doubt explains why some > of us don't see it. > > I don't think we want to revert that patch --- not trashing the whole > buffer arena seems like a Good Thing from a system-wide point of view, > even if it makes individual COPY operations go slower. However, we > could maybe play around with the tradeoffs a bit. In particular it > seems like it would be useful to experiment with different ring buffer > sizes. Could you try increasing the ring size allowed in > src/backend/storage/buffer/freelist.c for the BULKWRITE case Yes, that's definitely the cause. The ring buffer technique was prototyped in 8.3 and a ring of 32 blocks was found not to be sufficient, which was one reason we didn't try to commit that then. At the time, I also proposed a "filled buffer list" change to bufmgr to allow bgwriter to preferentially target COPY's filled blocks, which would also help with this effect. You'll get better benefit from decreasing wal_writer_delay also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 19 Jun 2009, Stefan Kaltenbrunner wrote: > In my case both the CPU (an Intel E5530 Nehalem) and the IO subsystem > (8GB Fiberchannel connected NetApp with 4GB cache) are pretty fast. The server Alan identified as "Solaris 10 8/07 s10x_u4wos_12b X86" has a Xeon E5320 (1.86GHz) and a single boring SAS drive in it (Sun X4150). The filesystem involved in that particular case is UFS, which I am suspicious of as being part of why the problem is so pronounced there--the default UFS tuning is pretty lightweight in terms of how much caching it does. Not sure if Alan ran any tests against the big ZFS volume on the other sever, I think all the results he posted were from the UFS boot drive there too. > so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in more > or less the same performance that 8.3 had. It looks like it's a touch higher on our 8/07 system, it levels out at 8192 * (haven't checked the other one yet). I'm seeing this, using Alan's original test set size (to make sure I was running exactly the same test) and just grabbing the low/high from a set of 3 runs: 8.3.7: 0m39.266s 0m43.269s (alan: 36.2 - 39.2) 256: 0m50.435s 0m51.944s (alan: 48.1 - 50.6) 1024: 0m47.299s 0m49.346s 4096: 0m43.725s 0m46.116s 8192: 0m40.715s 0m42.480s 16384: 0m41.318s 0m42.118s 65536: 0m41.675s 0m42.955s I collected some iostat data here as well for some of the runs (the vmstat data was harder to read, this being Solaris, and didn't seem to add anything). I'm seeing lines like this with the default ring buffer of 256 *: tty sd1 sd2 nfs1 cpu tin tout kps tps serv kps tps serv kps tps serv us sy wtid 0 322 12 1 0 41371 2754 0 0 0 0 12 11 0 78 0 166 0 0 0 46246 3380 0 0 0 0 14 10 0 76 0 164 0 0 0 44874 3068 1 0 0 0 13 9 0 78 Obviously sd2 is where the database and source file are at. Basically, about one core (out of four) tied up with a pretty even split of user/system time. Using the highest ring size I tried, 65536 *, gives lines that look like this: tty sd1 sd2 nfs1 cpu tin tout kps tps serv kps tps serv kps tps serv us sy wtid 0 163 0 0 0 56696 4291 0 0 0 0 20 12 0 68 0 166 0 0 0 58554 4542 0 0 0 0 21 12 0 67 0 168 0 0 0 56057 4308 0 0 0 0 21 12 0 67 So it seems like increasing the ring size helps saturate the disks better, went from ~45MB/s to 57MB/s. What's kind of interesting is to compare this against the 8.3.7 run, which is the fastest of them all, which I was expecting to find had the highest write rate of them all; tty sd1 sd2 nfs1 cpu tin tout kps tps serv kps tps serv kps tps serv us sy wtid 0 83 0 0 0 47654 2121 0 0 0 0 23 8 0 69 0 240 0 0 0 44198 2150 1 0 0 0 19 8 0 73 0 83 0 0 0 37750 1110 1 0 0 0 21 6 0 72 That's actually doing less I/O per capita, which is why it's also got less waiting for I/O%, but it's completing the most work. This makes me wonder if in addition to the ring buffering issue, there isn't just plain more writing per average completed transaction in 8.4 with this type of COPY. This might explain why even with the expanded ring buffer, both Stephan and my test runs still showed a bit of a regression against 8.3. I'm guessing we have a second, smaller shooter here involved as well. In any case, a bump of the ring multiplier to either 4096 or 8192 eliminates the worst of the regression here, good improvement so far. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 20 Jun 2009, Simon Riggs wrote: > At the time, I also proposed a "filled buffer list" change to bufmgr to > allow bgwriter to preferentially target COPY's filled blocks, which > would also help with this effect. One of the things I keep meaning to investigate is whether there's any benefit to pushing buffers the background writer cleans onto the free list, to speed up their subsequent allocation to needy backends. Both this and the sequential scan ring buffer implementation might both benefit from an approach where buffers are similarly pushed onto the free list when they're no longer needed by the process that pulled them in. Buffers could move from allocated->used->filled buffer list->free list in the COPY buffer case, and allocated->used->free list when executing a sequential scan. That would seem to me to be a more robust general approach for solving this class of problem than the whole ring buffer idea, which is a great start but bound to run into situations where the size of the buffer just isn't right anymore a few hardware generations down the road. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 2009-06-20 at 02:53 -0400, Greg Smith wrote: > On Sat, 20 Jun 2009, Simon Riggs wrote: > > > At the time, I also proposed a "filled buffer list" change to bufmgr to > > allow bgwriter to preferentially target COPY's filled blocks, which > > would also help with this effect. > > One of the things I keep meaning to investigate is whether there's any > benefit to pushing buffers the background writer cleans onto the free > list, to speed up their subsequent allocation to needy backends. Both > this and the sequential scan ring buffer implementation might both benefit > from an approach where buffers are similarly pushed onto the free list > when they're no longer needed by the process that pulled them in. > Buffers could move from allocated->used->filled buffer list->free list in > the COPY buffer case, and allocated->used->free list when executing a > sequential scan. The reason for not doing that would be that we don't know that the blocks are free to use; we know very little about them. The longer we leave them the more likely they are to be reused, so putting buffers onto the freelist when they aren't actually free would likely make the cache less effective, ISTM. With filled buffers from COPY we have a very good clue that the buffers will no longer be needed and can treat them differently from others. Also, if we can get the bgwriter to do some of the work then we would have the COPY process, bgwriter and WALwriter all working together on the data loading. (We need to make WALwriter a little smarter also, so that it can respond to changing WAL write workloads). > That would seem to me to be a more robust general approach for solving > this class of problem than the whole ring buffer idea, which is a great > start but bound to run into situations where the size of the buffer just > isn't right anymore a few hardware generations down the road. The ring buffer optimises L2 cache, not the buffer cache in general. If we put buffers back on the freelist that is the same as having one global ring buffer, which would then spoil the benefit for L2 on multi-CPU systems. We don't see any L2 benefit with COPY yet, but it is potentially there if we can overcome the stronger effect of the WAL costs. When the ring buffer expands to minimise WAL overheads we also minimise benefit for L2. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2009-06-19 at 22:03 -0400, Greg Smith wrote: > This makes me wonder if in addition to the ring buffering issue, there > isn't just plain more writing per average completed transaction in 8.4 > with this type of COPY. I would suggest that we check how much WAL has been written. There may be a secondary effect or a different regression hidden in these results. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sat, Jun 20, 2009 at 9:22 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >> That would seem to me to be a more robust general approach for solving >> this class of problem than the whole ring buffer idea, which is a great >> start but bound to run into situations where the size of the buffer just >> isn't right anymore a few hardware generations down the road. > > The ring buffer optimises L2 cache, not the buffer cache in general. If > we put buffers back on the freelist that is the same as having one > global ring buffer, which would then spoil the benefit for L2 on > multi-CPU systems. Incidentally a "ring buffer" is something else. We're talking about a clock-sweep algorithm. The whole point of the clock-sweep algorithm is that it's an approximated LRU but with no single point of contention like a list pointer. Doing this would be undermining that advantage. I don't understand what you mean by "size of the buffer" either. The only real parameter in the clock sweep algorithm is how many distinct counter values there are. The more values the closer to an LRU the result is. it isn't really tied to hardware in any way, just the usage pattern. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Smith wrote: > On Fri, 19 Jun 2009, Stefan Kaltenbrunner wrote: > >> In my case both the CPU (an Intel E5530 Nehalem) and the IO subsystem >> (8GB Fiberchannel connected NetApp with 4GB cache) are pretty fast. > > The server Alan identified as "Solaris 10 8/07 s10x_u4wos_12b X86" has a > Xeon E5320 (1.86GHz) and a single boring SAS drive in it (Sun X4150). > The filesystem involved in that particular case is UFS, which I am > suspicious of as being part of why the problem is so pronounced > there--the default UFS tuning is pretty lightweight in terms of how much > caching it does. Not sure if Alan ran any tests against the big ZFS > volume on the other sever, I think all the results he posted were from > the UFS boot drive there too. > >> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in >> more or less the same performance that 8.3 had. > > It looks like it's a touch higher on our 8/07 system, it levels out at > 8192 * (haven't checked the other one yet). I'm seeing this, using > Alan's original test set size (to make sure I was running exactly the > same test) and just grabbing the low/high from a set of 3 runs: > > 8.3.7: 0m39.266s 0m43.269s (alan: 36.2 - 39.2) > > 256: 0m50.435s 0m51.944s (alan: 48.1 - 50.6) > 1024: 0m47.299s 0m49.346s > 4096: 0m43.725s 0m46.116s > 8192: 0m40.715s 0m42.480s > 16384: 0m41.318s 0m42.118s > 65536: 0m41.675s 0m42.955s hmm interesting - I just did a bunch of runs using the lineitem table from the DBT3 tests (loading 60M rows in each run) and the same config Alan used. 8.4(postpatch - not RC1 but that one seems to behave exactly the same way) lineitem1 256 9min38s 512 9min20s 1024 7m44.667s/7m45.342s 2048 7m15.500s/7m17.910s 4096 7m11.424s/7m13.276s 8192 6m43.203s/6m48.293s 16384 6m24.980s/6m24.116s 32768 6m20.753s/6m22.083s 65536 6m22.913s/6m22.449s 1048576 6m23.765s/6m24.645s 8.3 6m45.650s/6m44.781s so on this workload the sweetspot seems to be much higher than on the one with the narrower rows. [...] > That's actually doing less I/O per capita, which is why it's also got > less waiting for I/O%, but it's completing the most work. This makes me > wonder if in addition to the ring buffering issue, there isn't just > plain more writing per average completed transaction in 8.4 with this > type of COPY. This might explain why even with the expanded ring buffer, > both Stephan and my test runs still showed a bit of a regression against > 8.3. I'm guessing we have a second, smaller shooter here involved as well. well yes I also suspect that there is some secondary effect at play here and I believe I have seen the "more IO with 8.4"thing here too but I have not actually paid enough attention yet to be sure. > > In any case, a bump of the ring multiplier to either 4096 or 8192 > eliminates the worst of the regression here, good improvement so far. yeah with the above numbers I would say that 8192 should remove most if not all of the regression. However it seems that we might have to make this more dynamic in the future since the behaviour seems to depend on a number of variables... Stefan
On Sat, Jun 20, 2009 at 12:10 PM, Greg Stark<gsstark@mit.edu> wrote: > > I don't understand what you mean by "size of the buffer" either. Ok, having gone back and read the whole thread I understand the context for that statement. Nevermind. -- greg http://mit.edu/~gsstark/resume.pdf
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>    and I believe I have seen the "more IO with 8.4" thing here too but I 
> have not actually paid enough attention yet to be sure.
FSM/VM overhead maybe?  I think COPY IN is setting the SKIP_FSM bit,
but I wonder if there's some vestigial overhead anyway.
        regards, tom lane
			
		On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote: > > > > 8.3.7: 0m39.266s 0m43.269s (alan: 36.2 - 39.2) > > > > 8192: 0m40.715s 0m42.480s > > 16384: 0m41.318s 0m42.118s > > 65536: 0m41.675s 0m42.955s > > hmm interesting - I just did a bunch of runs using the lineitem table > from the DBT3 tests (loading 60M rows in each run) and the same config > Alan used. > > 8.4(postpatch - not RC1 but that one seems to behave exactly the same way) > > lineitem1 > 8192 6m43.203s/6m48.293s > 16384 6m24.980s/6m24.116s > 32768 6m20.753s/6m22.083s > 65536 6m22.913s/6m22.449s > 1048576 6m23.765s/6m24.645s > 8.3 > > 6m45.650s/6m44.781s > so on this workload the sweetspot seems to be much higher than on the > one with the narrower rows. The rest of the patch should have had a greater effect on tables with thinner rows. Your results match my expectations, though I read from them that we should use 16384, since that provides some gain, not just a cancellation of the regression. I would suggest that we leave it as a tunable parameter in this release and remove it again once we have clear evidence of how to set it. We are unlikely to cover conclusively how to do this before we release 8.4. > > > > In any case, a bump of the ring multiplier to either 4096 or 8192 > > eliminates the worst of the regression here, good improvement so far. > > yeah with the above numbers I would say that 8192 should remove most if > not all of the regression. However it seems that we might have to make > this more dynamic in the future since the behaviour seems to depend on a > number of variables... I would be inclined to repeat this with multiple concurrent COPYs. We may be able to improve on the numbers there, as well as get a perspective on how to set parameter in real world. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sat, 20 Jun 2009, Simon Riggs wrote: > The reason for not doing that would be that we don't know that the > blocks are free to use; we know very little about them. The longer we > leave them the more likely they are to be reused, so putting buffers > onto the freelist when they aren't actually free would likely make the > cache less effective, ISTM. This is the prevailing wisdom. The current design of the background writer tries to make a reasonable guess as to how many buffers are going to be allocated in the next bg_writer_delay period, and it locates/generates clean buffers to fill that expected need. The idea would be to put those buffers, which have a good chance of being allocated by somebody else and therefore invalided shortly, onto the free list. There's a certain amount of overhead the backends themselves wait for in order to advance the clock sweep position to find the buffers they need, and that adds some latency to them. I would guess there's some potential for reducing that latency if the freelist was sometimes populated, which right now it rarely is. While that would add contention onto the free list, I wonder if it might still be less than the contention on advancing the current buffer pool strategy point. I don't want to go through the whole thing in this thread, just pointing out some similarity with an item I'd already proposed onto the TODO list--and that a good solution there might get rid of this whole "how big do I make the ring buffer?" situation. Obviously actual measurements here would trump any theorizing as to what works better, it's hard to get any intuition about low-level optimizing given how complicated CPU caches are nowadays. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
>> 8192 6m43.203s/6m48.293s
>> 16384 6m24.980s/6m24.116s
>> 32768 6m20.753s/6m22.083s
>> 65536 6m22.913s/6m22.449s
>> 1048576 6m23.765s/6m24.645s
> The rest of the patch should have had a greater effect on tables with
> thinner rows. Your results match my expectations, though I read from
> them that we should use 16384, since that provides some gain, not just a
> cancellation of the regression. 
+1 for using 16384 (ie, max ring buffer size 16MB).  Maybe even more.
It seems likely that other cases might have an even bigger issue than
is exhibited in the couple of test cases we have here, so we should
leave some margin for error.  Also, there's code in there to limit the
ring buffer to 1/8th of shared buffers, so we don't have to worry about
trashing the whole buffer arena in small configurations.  Any limitation
at all is still a step forward over previous releases as far as not
trashing the arena is concerned.
> I would suggest that we leave it as a tunable parameter in this release
> and remove it again once we have clear evidence of how to set it.
It's not going to become an exposed tunable.  There is not evidence to
justify that, and we are down to the point of the release cycle where
any more than one-liner changes have to be looked at with considerable
reluctance.
        regards, tom lane
			
		On Sat, 20 Jun 2009, Simon Riggs wrote: > I would suggest that we check how much WAL has been written. There may > be a secondary effect or a different regression hidden in these results. What's the easiest way to do that? My first thought was to issue a checkpoint before the test (which is a good idea to make each test consistent anyway), save the output from pg_controldata, test, checkpoint, and look at the control data again. This seems kind of clunky though, but still better than trolling through the OS statistics for the data. Any clever ideas for a better way to measure bytes of WAL written during a particular chunk of code? We may need some sort of checkpoint/sync after the test to get correct results, because I've noticed that the tests I run sometimes continue writing out buffers for a few seconds after the test time is finished. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote: >>> 8192 6m43.203s/6m48.293s >>> 16384 6m24.980s/6m24.116s >>> 32768 6m20.753s/6m22.083s >>> 65536 6m22.913s/6m22.449s >>> 1048576 6m23.765s/6m24.645s > >> The rest of the patch should have had a greater effect on tables with >> thinner rows. Your results match my expectations, though I read from >> them that we should use 16384, since that provides some gain, not just a >> cancellation of the regression. > > +1 for using 16384 (ie, max ring buffer size 16MB). Maybe even more. > It seems likely that other cases might have an even bigger issue than > is exhibited in the couple of test cases we have here, so we should > leave some margin for error. Also, there's code in there to limit the > ring buffer to 1/8th of shared buffers, so we don't have to worry about > trashing the whole buffer arena in small configurations. Any limitation > at all is still a step forward over previous releases as far as not > trashing the arena is concerned. +1. You might get away with a smaller ring with narrow tables, where writing 16MB of data produces more than 16MB of WAL, but I don't think it can ever be the other way round. Leaving a little bit of room for error doesn't seem like a bad idea, though. IIRC we experimented with an auto-tuning ring size when we worked on the original ring buffer patch. The idea is that you start with a small ring, and enlarge it in StrategyRejectBuffer. But that seems too risky for 8.4. I wonder if using the small ring showed any benefit when the COPY is not WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, so the small ring might have some L2 cache benefits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote: >>>> 8192 6m43.203s/6m48.293s >>>> 16384 6m24.980s/6m24.116s >>>> 32768 6m20.753s/6m22.083s >>>> 65536 6m22.913s/6m22.449s >>>> 1048576 6m23.765s/6m24.645s >> >>> The rest of the patch should have had a greater effect on tables with >>> thinner rows. Your results match my expectations, though I read from >>> them that we should use 16384, since that provides some gain, not just a >>> cancellation of the regression. >> >> +1 for using 16384 (ie, max ring buffer size 16MB). Maybe even more. >> It seems likely that other cases might have an even bigger issue than >> is exhibited in the couple of test cases we have here, so we should >> leave some margin for error. Also, there's code in there to limit the >> ring buffer to 1/8th of shared buffers, so we don't have to worry about >> trashing the whole buffer arena in small configurations. Any limitation >> at all is still a step forward over previous releases as far as not >> trashing the arena is concerned. > > +1. You might get away with a smaller ring with narrow tables, where > writing 16MB of data produces more than 16MB of WAL, but I don't think > it can ever be the other way round. Leaving a little bit of room for > error doesn't seem like a bad idea, though. yeah 16MB seems like the best choice given the available data and how far we are into the release cycle. > > IIRC we experimented with an auto-tuning ring size when we worked on the > original ring buffer patch. The idea is that you start with a small > ring, and enlarge it in StrategyRejectBuffer. But that seems too risky > for 8.4. agreed. > > I wonder if using the small ring showed any benefit when the COPY is not > WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, > so the small ring might have some L2 cache benefits. I did some limited testing on that but I was unable to measure any significant effect - especially since the difference between wal-logged and not is rather small for a non-parallel COPY (ie in the above example you get around 6m20s runtime for wal-logged and ~5m40s in the other case). Stefan
On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote: > I did some limited testing on that but I was unable to measure any > significant effect - especially since the difference between > wal-logged and not is rather small for a non-parallel COPY (ie in the > above example you get around 6m20s runtime for wal-logged and ~5m40s > in the other case). This is a common confusion for small tests. Non-WAL logged case causes all buffers to be written to disk at end of COPY. This is roughly the same size as the volume of WAL written. In logged case we do not write data blocks, they get written at next checkpoint. So the reduction in I/O is not apparent, since during the period of the test the I/O is about the same in both cases and less I/O in the non-WAL logged case. On longer tests the difference shows more clearly because the data blocks start to migrate out of shared buffers while the COPY is still running, effecting the test results. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sun, 2009-06-21 at 02:45 -0400, Greg Smith wrote: > On Sat, 20 Jun 2009, Simon Riggs wrote: > > > I would suggest that we check how much WAL has been written. There may > > be a secondary effect or a different regression hidden in these results. > > What's the easiest way to do that? pg_current_xlog_insert_location() -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote: >> I did some limited testing on that but I was unable to measure any >> significant effect - especially since the difference between >> wal-logged and not is rather small for a non-parallel COPY (ie in the >> above example you get around 6m20s runtime for wal-logged and ~5m40s >> in the other case). > > This is a common confusion for small tests. > > Non-WAL logged case causes all buffers to be written to disk at end of > COPY. This is roughly the same size as the volume of WAL written. In > logged case we do not write data blocks, they get written at next > checkpoint. So the reduction in I/O is not apparent, since during the > period of the test the I/O is about the same in both cases and less I/O > in the non-WAL logged case. On longer tests the difference shows more > clearly because the data blocks start to migrate out of shared buffers > while the COPY is still running, effecting the test results. I was actually testing with and without explicit CHECKPOINTing before/after the load(and also with longer runs) too - the difference is negligible especially with only one process involved. I think the difference is simply not that large because we are still mostly CPU bound within COPY on reasonably fast IO-subsystems. Stefan
Stefan Kaltenbrunner wrote: > Simon Riggs wrote: >> On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote: >>> I did some limited testing on that but I was unable to measure any >>> significant effect - especially since the difference between >>> wal-logged and not is rather small for a non-parallel COPY (ie in the >>> above example you get around 6m20s runtime for wal-logged and ~5m40s >>> in the other case). >> >> This is a common confusion for small tests. >> >> Non-WAL logged case causes all buffers to be written to disk at end of >> COPY. This is roughly the same size as the volume of WAL written. In >> logged case we do not write data blocks, they get written at next >> checkpoint. So the reduction in I/O is not apparent, since during the >> period of the test the I/O is about the same in both cases and less I/O >> in the non-WAL logged case. On longer tests the difference shows more >> clearly because the data blocks start to migrate out of shared buffers >> while the COPY is still running, effecting the test results. > > I was actually testing with and without explicit CHECKPOINTing > before/after the load(and also with longer runs) too - the difference is > negligible especially with only one process involved. > I think the difference is simply not that large because we are still > mostly CPU bound within COPY on reasonably fast IO-subsystems. hmm to further demonstrate that I just did some testing(same config as before and the 16MB for the buffer) by loading those 60M rows into a 20GB ramdisk instead of the SAN(with a CHECKPOINT before and after). this results in the following "improvements": 16384: wal bypass: 5min40s -> 5min10s (~9%) wal logged: 6min20s -> 6min8s (~3%) vmstat 5 output shows that the system is in fact CPU bound (ie using ~6% which is more or less a full core on a 16 core box) and not doing anything IO-wise. r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 19010444 118648 15415684 0 0 0 6 1068 2151 6 0 94 0 0 1 0 0 18870448 118652 15555204 0 0 0 6 1069 2142 6 0 94 0 0 1 0 0 18730568 118684 15694872 0 0 0 185 1080 2151 6 0 94 0 0 1 0 0 18591236 118692 15834516 0 0 0 19 1072 2144 6 0 94 0 0 1 0 0 18451472 118696 15973532 0 0 0 46 1073 2152 6 0 94 0 0 1 0 0 18311720 118704 16113204 0 0 0 7 1059 2136 6 0 94 0 0 1 0 0 18171968 118704 16252944 0 0 0 0 1077 2171 6 0 94 0 0 1 0 0 18032088 118712 16392300 0 0 0 54 1062 2138 6 0 94 0 0 1 0 0 17891716 118720 16532060 0 0 0 8 1078 2176 6 0 94 0 0 So I do think that IO is in fact not too significant for this kind of testing and we still have ways to go in terms of CPU efficiency in COPY. Stefan
On Sun, Jun 21, 2009 at 6:48 AM, Stefan Kaltenbrunner<stefan@kaltenbrunner.cc> wrote: > So I do think that IO is in fact not too significant for this kind of > testing and we still have ways to go in terms of CPU efficiency in COPY. It would be interesting to see some gprof or oprofile output from that test. I went back and dug up the results that I got when I profiled this patch during initial development, and my version of the patch applied, the profile looked like this on my system: % cumulative self self totaltime seconds seconds calls s/call s/call name14.48 0.85 0.85 1 0.85 5.47 DoCopy10.05 1.44 0.59 10000001 0.00 0.00 CopyReadLine 5.62 1.77 0.33 10000039 0.00 0.00 PageAddItem 5.11 2.07 0.30 10400378 0.00 0.00 LWLockRelease4.68 2.35 0.28 10000013 0.00 0.00 heap_insert 4.34 2.60 0.26 10000012 0.00 0.00 heap_formtuple 3.83 2.83 0.23 10356158 0.00 0.00 LWLockAcquire 3.83 3.05 0.23 10000054 0.00 0.00 MarkBufferDirty 3.32 3.25 0.20 10000013 0.00 0.00 RelationGetBufferForTuple3.07 3.43 0.18 10000005 0.00 0.00 pg_verify_mbstr_len 2.90 3.60 0.1710000002 0.00 0.00 CopyGetData 2.73 3.76 0.16 20000030 0.00 0.00 enlargeStringInfo 2.73 3.92 0.16 20000014 0.00 0.00 pq_getbytes 2.04 4.04 0.12 10000000 0.00 0.00 InputFunctionCall ...but this might not be very representative, since I think I may have tested it on a single-column table. It would be interesting to see some other results. Simon had the idea of further improving performance by keeping the current buffer locked (this patch just kept it pinned, but not locked), but I didn't see an obvious clean design for that. Heikki also had a patch for speeding up copy, but it got dropped for 8.4 due to time constraints. ...Robert
Robert Haas <robertmhaas@gmail.com> writes:
> It would be interesting to see some gprof or oprofile output from that
> test.   I went back and dug up the results that I got when I profiled
> this patch during initial development, and my version of the patch
> applied, the profile looked like this on my system:
Were you testing with a temp table?  The lack of XLogInsert in your
profile is striking.  Stefan's results upthread had it at the top,
and I got more or less the same thing here (didn't keep my numbers
unfortunately).
> Simon had the idea of further improving performance by keeping the
> current buffer locked (this patch just kept it pinned, but not
> locked), but I didn't see an obvious clean design for that.
The potential for deadlock seems sufficient reason not to do that.
        regards, tom lane
			
		Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It would be interesting to see some gprof or oprofile output from that >> test. I went back and dug up the results that I got when I profiled >> this patch during initial development, and my version of the patch >> applied, the profile looked like this on my system: > > Were you testing with a temp table? The lack of XLogInsert in your > profile is striking. Stefan's results upthread had it at the top, > and I got more or less the same thing here (didn't keep my numbers > unfortunately). I guess that profile was for the wal bypass case and it looks fairly similiar to what I get here(lineitem table into tmpfs - though only 30M rows this time to keep VM pressure low): samples % symbol name 286197 17.1997 DoCopy 232958 14.0002 CopyReadLine 99762 5.9954 DecodeNumber 92751 5.5741 heap_fill_tuple 84439 5.0746 pg_verify_mbstr_len 65421 3.9316 InputFunctionCall 62502 3.7562 DecodeDate 53565 3.2191 heap_form_tuple 47731 2.8685 ParseDateTime 41206 2.4764 DecodeDateTime 39936 2.4001 pg_next_dst_boundary 36093 2.1691 AllocSetAlloc 33967 2.0413 heap_compute_data_size 29921 1.7982 float4in 27227 1.6363 DetermineTimeZoneOffset 25622 1.5398 pg_atoi 24703 1.4846 pg_mblen 24495 1.4721 .plt 23912 1.4371 pg_mbstrlen_with_len 23448 1.4092 bpchar_input 20033 1.2039 date2j 16331 0.9815 date_in 15684 0.9426 bpcharin 14819 0.8906 PageAddItem 14261 0.8571 ValidateDate Stefan
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I wonder if using the small ring showed any benefit when the COPY is not 
> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, 
> so the small ring might have some L2 cache benefits.
I think the notion that we might get a cache win from a smaller ring
is an illusion.  We're not expecting to go back and re-read from a
previously filled page in this scenario.  In any case, all of the
profiling results so far show that the CPU bottlenecks are elsewhere.
Until we can squeeze an order of magnitude out of COPY's data parsing
and/or XLogInsert, any possible cache effects will be down in the noise.
So to my mind, the only question left to answer (at least for the 8.4
cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
Right at the moment I'd be satisfied with 16, but I wonder whether there
are scenarios where 32MB would show a significant advantage.
        regards, tom lane
			
		On Sun, Jun 21, 2009 at 11:31 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It would be interesting to see some gprof or oprofile output from that >> test. I went back and dug up the results that I got when I profiled >> this patch during initial development, and my version of the patch >> applied, the profile looked like this on my system: > > Were you testing with a temp table? No. > The lack of XLogInsert in your > profile is striking. Stefan's results upthread had it at the top, > and I got more or less the same thing here (didn't keep my numbers > unfortunately). As Stephen guessed, I created the table in the same transaction that I inserted into it... >> Simon had the idea of further improving performance by keeping the >> current buffer locked (this patch just kept it pinned, but not >> locked), but I didn't see an obvious clean design for that. > > The potential for deadlock seems sufficient reason not to do that. Yep, that was the problem. *thinks* I think we had the idea of buffering up enough tuples to fill a page (estimating conservatively so as to make sure we actually fill it), and then inserting them all at once. But I'm not sure how much trouble that causes in terms of the timing of inserting index entries and firing of after row insert triggers. If the command ID doesn't change, it seems like it might be OK. Or at worst, even if the optimization would only work in cases where there are no after row triggers, that would still be useful to a lot of people, I think. I haven't really spent much time on this angle of attack and completely confess to not understanding all of the issues... ...Robert
On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> I wonder if using the small ring showed any benefit when the COPY is not >> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, >> so the small ring might have some L2 cache benefits. > > I think the notion that we might get a cache win from a smaller ring > is an illusion. We're not expecting to go back and re-read from a > previously filled page in this scenario. In any case, all of the > profiling results so far show that the CPU bottlenecks are elsewhere. > Until we can squeeze an order of magnitude out of COPY's data parsing > and/or XLogInsert, any possible cache effects will be down in the noise. > > So to my mind, the only question left to answer (at least for the 8.4 > cycle) is "is 16MB enough, or do we want to make the ring even bigger?". > Right at the moment I'd be satisfied with 16, but I wonder whether there > are scenarios where 32MB would show a significant advantage. Even 32MB is not that much. It seems to me that in any realistic production scenario you're going to have at least half a gig of shared buffers, so we're really talking about at most one-sixteenth of the shared buffer arena, and possibly quite a bit less. I think that's pretty conservative. ...Robert
On Sun, Jun 21, 2009 at 5:07 PM, Robert Haas<robertmhaas@gmail.com> wrote: > > I think we had the idea of buffering up enough tuples to fill a page > (estimating conservatively so as to make sure we actually fill it), > and then inserting them all at once. But I'm not sure how much > trouble that causes in terms of the timing of inserting index entries > and firing of after row insert triggers. If the command ID doesn't > change, it seems like it might be OK. Or at worst, even if the > optimization would only work in cases where there are no after row > triggers, that would still be useful to a lot of people, I think. > > I haven't really spent much time on this angle of attack and > completely confess to not understanding all of the issues... There was some discussion of doing this in general for all inserts inside the indexam. The btree indexam could buffer up any inserts done within the transaction and keep them in an in-memory btree. Any lookups done within the transaction first look up in the in-memory tree then the disk. If the in-memory buffer fills up then we flush them to the index. The reason this is tempting is that we could then insert them all in a single index-merge operation which would often be more efficient than retail inserts. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes:
> There was some discussion of doing this in general for all inserts
> inside the indexam. The btree indexam could buffer up any inserts done
> within the transaction and keep them in an in-memory btree. Any
> lookups done within the transaction first look up in the in-memory
> tree then the disk. If the in-memory buffer fills up then we flush
> them to the index.
> The reason this is tempting is that we could then insert them all in a
> single index-merge operation which would often be more efficient than
> retail inserts.
That's not gonna work for a unique index, which unfortunately is a
pretty common case ...
        regards, tom lane
			
		Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> I wonder if using the small ring showed any benefit when the COPY is not >> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, >> so the small ring might have some L2 cache benefits. > > I think the notion that we might get a cache win from a smaller ring > is an illusion. We're not expecting to go back and re-read from a > previously filled page in this scenario. In any case, all of the > profiling results so far show that the CPU bottlenecks are elsewhere. > Until we can squeeze an order of magnitude out of COPY's data parsing > and/or XLogInsert, any possible cache effects will be down in the noise. we also need to take a serious look at our locking overhead - WAL logged COPY is already taking a significant performance hit with just a second process running in parallel(into a seperate table). I just did some testing using those 16MB buffer, the upthread mentioned postgresql.conf and a 20GB tmpfs. The following copying 3M rows(each) into a seperate table of the same database. processes total time(s) rows/s rows/s - per core 1 17.5 171428.57 171428.57 2 20.8 288461.54 144230.77 4 25.5 470588.24 117647.06 6 31.1 578778.14 96463.02 8 41.4 579710.14 72463.77 10 63 476190.48 47619.05 12 89 404494.38 33707.87 14 116 362068.97 25862.07 16 151 317880.79 19867.55 the higher the process count the more erratic the box behaves - it will show a very high context switch rate (between 300000 and 400000/s) a large amount of idle time (>60%!). example vmstat 5 output for the 12 process test: 7 0 0 21654500 45436 12932516 0 0 0 3 1079 336941 34 7 59 0 0 6 0 0 21354044 45444 13232444 0 0 0 52 1068 341836 35 7 59 0 0 4 0 0 21053832 45452 13531472 0 0 0 23 1082 341672 35 7 59 0 0 9 0 0 20751136 45460 13833336 0 0 0 41 1063 344117 35 7 59 0 0 6 0 0 20443856 45468 14138116 0 0 0 14 1079 349398 35 7 58 0 0 8 0 0 20136592 45476 14444644 0 0 0 8 1060 351569 35 7 58 0 0 10 0 0 19836600 45484 14743320 0 0 0 144 1086 341533 35 7 58 0 0 7 0 0 19540472 45492 15039616 0 0 0 94 1067 337731 36 7 58 0 0 2 0 0 19258244 45500 15321156 0 0 0 15 1079 311394 34 6 60 0 0 Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> The following copying 3M rows(each) into a seperate table of the same 
> database.
Is this with WAL, or bypassing WAL?  Given what we've already seen,
a lot of contention for WALInsertLock wouldn't surprise me much here.
It should be possible to bypass that though.
        regards, tom lane
			
		Robert Haas wrote: > On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> So to my mind, the only question left to answer (at least for the 8.4 >> cycle) is "is 16MB enough, or do we want to make the ring even bigger?". >> Right at the moment I'd be satisfied with 16, but I wonder whether there >> are scenarios where 32MB would show a significant advantage. > > Even 32MB is not that much. It seems to me that in any realistic > production scenario you're going to have at least half a gig of shared > buffers, so we're really talking about at most one-sixteenth of the > shared buffer arena, and possibly quite a bit less. I think that's > pretty conservative. I was going to say that since we flush the WAL every 16MB anyway (at every XLOG file switch), you shouldn't see any benefit with larger ring buffers, since to fill 16MB of data you're not going to write more than 16MB WAL. But then I realized that that's not true if you have an unusually low fillfactor. If you only fill each page say 50% full, you're going to use 32MB worth of data pages but only write 16MB of WAL. And maybe you could have a situation like that with very wide rows as well, with wasted space on each page that's not enough to store one more row. Could you test that scenario? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I was going to say that since we flush the WAL every 16MB anyway (at 
> every XLOG file switch), you shouldn't see any benefit with larger ring 
> buffers, since to fill 16MB of data you're not going to write more than 
> 16MB WAL.
I'm not convinced that WAL segment boundaries are particularly relevant
to this.  The unit of flushing is an 8K page, not a segment.
I wonder though whether the wal_buffers setting interacts with the
ring size.  Has everyone who's tested this used the same 16MB
wal_buffers setting as in Alan's original scenario?
        regards, tom lane
			
		Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> The following copying 3M rows(each) into a seperate table of the same >> database. > > Is this with WAL, or bypassing WAL? Given what we've already seen, > a lot of contention for WALInsertLock wouldn't surprise me much here. > It should be possible to bypass that though. this was with WAL. here are the numbers for bypass: processes total time(s) rows/s rows/s - per core 1 15 200000 200000 2 15.1 397350.99 198675.5 4 15.2 789473.68 197368.42 6 15.3 1176470.59 196078.43 8 16.2 1481481.48 185185.19 10 21.9 1369863.01 136986.3 12 22.7 1585903.08 132158.59 14 25.2 1666666.67 119047.62 16 27.9 1720430.11 107526.88 runtimes grew very short here but the numbers still seem sane and if you compare them to what I did on real storage(though without the 16MB ringbuffer fix!): http://www.kaltenbrunner.cc/blog/index.php?/archives/27-Benchmarking-8.4-Chapter-2bulk-loading.html you will see that for a single core there is almost no performance difference between ramdisk and real disk, at 8 cores there is the largest gap at around 45% but on 16 cores we are down to a mere 20% difference. All in all it seems that we have a big locking issue with WALInsertLock and even with that removed we are mostly CPU limited and not IO limited for COPY. Stefan
On Sun, 2009-06-21 at 12:38 -0400, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > There was some discussion of doing this in general for all inserts > > inside the indexam. The btree indexam could buffer up any inserts done > > within the transaction and keep them in an in-memory btree. Any > > lookups done within the transaction first look up in the in-memory > > tree then the disk. If the in-memory buffer fills up then we flush > > them to the index. > > > The reason this is tempting is that we could then insert them all in a > > single index-merge operation which would often be more efficient than > > retail inserts. > > That's not gonna work for a unique index, which unfortunately is a > pretty common case ... I think it can. If we fail on a unique index we fail. We aren't expecting that, else we wouldn't be using COPY. So I reckon its acceptable to load a whole block of rows and then load a whole blocks's worth of index entries. The worst thing that can happen is we insert a few extra heap rows that get aborted, which is small in comparison to the potential gains from buffering. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sun, 2009-06-21 at 20:37 +0300, Heikki Linnakangas wrote: > Robert Haas wrote: > > On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > >> So to my mind, the only question left to answer (at least for the 8.4 > >> cycle) is "is 16MB enough, or do we want to make the ring even bigger?". > >> Right at the moment I'd be satisfied with 16, but I wonder whether there > >> are scenarios where 32MB would show a significant advantage. > > > > Even 32MB is not that much. It seems to me that in any realistic > > production scenario you're going to have at least half a gig of shared > > buffers, so we're really talking about at most one-sixteenth of the > > shared buffer arena, and possibly quite a bit less. I think that's > > pretty conservative. > > I was going to say that since we flush the WAL every 16MB anyway (at > every XLOG file switch), you shouldn't see any benefit with larger ring > buffers, since to fill 16MB of data you're not going to write more than > 16MB WAL. But then I realized that that's not true if you have an > unusually low fillfactor. If you only fill each page say 50% full, > you're going to use 32MB worth of data pages but only write 16MB of WAL. > And maybe you could have a situation like that with very wide rows as > well, with wasted space on each page that's not enough to store one more > row. If walwriter is working correctly then it should be writing and fsyncing WAL, while the COPY process just inserts WAL. I don't see that as an argument to limit us to 16MB. But I take your point as being an argument in favour of that as a consensus value for us to choose. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi Tom,<br /><br />How much concern is there for the contention for use cases where the WAL can't be bypassed?<br /><br />Thanks,Alan<br /><br /><div class="gmail_quote">On Sun, Jun 21, 2009 at 10:00 AM, Tom Lane <span dir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">Stefan Kaltenbrunner<stefan@kaltenbrunner.cc> writes:<br /></div><div class="im">> The following copying 3M rows(each)into a seperate table of the same<br /> > database.<br /><br /></div>Is this with WAL, or bypassing WAL? Givenwhat we've already seen,<br /> a lot of contention for WALInsertLock wouldn't surprise me much here.<br /> It shouldbe possible to bypass that though.<br /><br /> regards, tom lane<br /><br /><br /></blockquote></div><br/>
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> I was going to say that since we flush the WAL every 16MB anyway (at >> every XLOG file switch), you shouldn't see any benefit with larger ring >> buffers, since to fill 16MB of data you're not going to write more than >> 16MB WAL. > > I'm not convinced that WAL segment boundaries are particularly relevant > to this. The unit of flushing is an 8K page, not a segment. We fsync() the old WAL segment every time we switch to a new WAL segment. That's what I meant by "flush". If the walwriter is keeping up, it will fsync() the WAL more often, but 16MB is the maximum distance between fsync()s. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-06-22 at 10:52 +0300, Heikki Linnakangas wrote: > Tom Lane wrote: > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> I was going to say that since we flush the WAL every 16MB anyway (at > >> every XLOG file switch), you shouldn't see any benefit with larger ring > >> buffers, since to fill 16MB of data you're not going to write more than > >> 16MB WAL. > > > > I'm not convinced that WAL segment boundaries are particularly relevant > > to this. The unit of flushing is an 8K page, not a segment. > > We fsync() the old WAL segment every time we switch to a new WAL > segment. That's what I meant by "flush". > > If the walwriter is keeping up, it will fsync() the WAL more often, but > 16MB is the maximum distance between fsync()s. Yes, but the fsync is performed by the process that writes the WAL, not necessarily by the process that inserts the WAL. In perfect balance, an inserter-of-WAL could insert an infinite amount of WAL and never need to fsync the WAL. So the question is are we in balance between WALWriter and COPY? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Alan Li <ali@truviso.com> writes:
> How much concern is there for the contention for use cases where the WAL
> can't be bypassed?
If you mean "is something going to be done about it in 8.4", the
answer is "no".  This is a pre-existing issue that there is no simple
fix for.
        regards, tom lane
			
		Tom Lane wrote: > Alan Li <ali@truviso.com> writes: > >> How much concern is there for the contention for use cases where the WAL >> can't be bypassed? >> > > If you mean "is something going to be done about it in 8.4", the > answer is "no". This is a pre-existing issue that there is no simple > fix for. > > > I thought he was asking if we intend to provide for WAL bypass on a table by table basis in future. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Alan Li <ali@truviso.com> writes:
>>> How much concern is there for the contention for use cases where the WAL
>>> can't be bypassed?
>> 
>> If you mean "is something going to be done about it in 8.4", the
>> answer is "no".  This is a pre-existing issue that there is no simple
>> fix for.
> I thought he was asking if we intend to provide for WAL bypass on a 
> table by table basis in future.
I thought he was asking for a solution to the problem of WALInsertLock
contention.  In any case, we have "WAL bypass on a table by table basis"
now, don't we?
        regards, tom lane
			
		Tom Lane wrote: > I thought he was asking for a solution to the problem of WALInsertLock > contention. In any case, we have "WAL bypass on a table by table basis" > now, don't we? > > > If we do I'm ignorant of it ;-) How do we say "Never WAL this table"? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> I thought he was asking for a solution to the problem of WALInsertLock
>> contention.  In any case, we have "WAL bypass on a table by table basis"
>> now, don't we?
> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?
Make it a temporary table.
The more useful case for data load is "create or truncate it in the
same transaction", of course.
        regards, tom lane
			
		* Andrew Dunstan <andrew@dunslane.net> [090622 10:47]: > If we do I'm ignorant of it ;-) How do we say "Never WAL this table"? CREATE TEMPORARY TABLE ... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> The more useful case for data load is "create or truncate it in the
> same transaction", of course.
Unfortunately, WAL bypass also requires not being in archive mode with
no way to turn that off w/o a server restart, aiui.
Thanks,
    Stephen
			
		Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> The more useful case for data load is "create or truncate it in the
>> same transaction", of course.
> Unfortunately, WAL bypass also requires not being in archive mode with
> no way to turn that off w/o a server restart, aiui.
Well, if you're trying to archive then you certainly wouldn't want WAL
off, so I'm failing to see where this thread is going ...
        regards, tom lane
			
		* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > Unfortunately, WAL bypass also requires not being in archive mode with
> > no way to turn that off w/o a server restart, aiui.
>
> Well, if you're trying to archive then you certainly wouldn't want WAL
> off, so I'm failing to see where this thread is going ...
I disagree.  I'd love to be able to say "please bypass WAL logging for
this bulk load" because I know that I'll pick up the data during my next
full dump and I can reload it from original if I get disrupted before
then.  This is especially true when you're doing bulk loads of static or
reference data from another data source.
Thanks,
    Stephen
			
		On Mon, 2009-06-22 at 11:14 -0400, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> The more useful case for data load is "create or truncate it in the > >> same transaction", of course. > > > Unfortunately, WAL bypass also requires not being in archive mode with > > no way to turn that off w/o a server restart, aiui. > > Well, if you're trying to archive then you certainly wouldn't want WAL > off, so I'm failing to see where this thread is going ... I was thinking it might be beneficial to be able to defer writing WAL until COPY is complete, so heap_sync would either fsync the whole heap file or copy the whole file to WAL. That would avoid writing WAL piecemeal because we could just backup the whole block, plus we wouldn't write anything at all if the COPY failed. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
> I was thinking it might be beneficial to be able to defer writing WAL
> until COPY is complete, so heap_sync would either fsync the whole heap
> file or copy the whole file to WAL.
What about indexes?
        regards, tom lane
			
		On Mon, 2009-06-22 at 11:24 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > I was thinking it might be beneficial to be able to defer writing WAL > > until COPY is complete, so heap_sync would either fsync the whole heap > > file or copy the whole file to WAL. > > What about indexes? I was thinking we could do exactly as stated for the cases that would be WAL-bypass now, but need to write WAL because XLogArchivingActive(). i.e. improve the exact case we are measuring here. Yes, it is more complex than that for loading to existing tables. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> I thought he was asking for a solution to the problem of WALInsertLock >>> contention. In any case, we have "WAL bypass on a table by table basis" >>> now, don't we? >>> > > >> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"? >> > > Make it a temporary table. > > > That doesn't help if you need the data visible in multiple sessions. But we're digressing from the original topic. Sorry. cheers andrew
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> I'm not convinced that WAL segment boundaries are particularly relevant
>> to this.  The unit of flushing is an 8K page, not a segment.
> We fsync() the old WAL segment every time we switch to a new WAL 
> segment. That's what I meant by "flush".
> If the walwriter is keeping up, it will fsync() the WAL more often, but 
> 16MB is the maximum distance between fsync()s.
I'm still not convinced --- to my mind the issue is not whether fsyncs
happen but whether the COPY process has to wait for 'em, and I don't
think that segment boundaries directly affect that.  I'd still be
interested to see similar measurements done with different wal_buffer
settings.
However, in the interests of getting this resolved in time for 8.4.0,
I propose that we just settle on 16MB as the bulkwrite ring buffer size.
There doesn't seem to be any evidence that a larger size will make for
a significant improvement, and we shouldn't allow COPY to trash a bigger
fraction of the arena than it really has to.
        regards, tom lane
			
		Le 22 juin 2009 à 17:24, Tom Lane <tgl@sss.pgh.pa.us> a écrit : > Simon Riggs <simon@2ndQuadrant.com> writes: >> I was thinking it might be beneficial to be able to defer writing WAL >> until COPY is complete, so heap_sync would either fsync the whole >> heap >> file or copy the whole file to WAL. > > What about indexes? Skip this optimisation if there are any. It's already Common practise to create them only after copy succeeded when possible for better bulk loading perfs. Then there's also the copy + Insert ... Select ... technique. -- dim
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wonder though whether the wal_buffers setting interacts with the > ring size. Has everyone who's tested this used the same 16MB > wal_buffers setting as in Alan's original scenario? I had been using his postgresql.conf file, then added autovacuum = off. When I tried with setting the ring size to 16MB, I accidentally left off the step to copy the postgresql.conf file, and got better performance. I alternated between the postgresql.conf file from earlier tests and the default file left there by the initdb, and got this: 8.4rc1 with 16MB ring, default postgresql.conf 0m23.223s 0m23.489s 0m23.921s 8.4rc1 with 16MB ring, Alan's postgresql.conf 0m28.678s 0m26.171s 0m27.513s default postgresql.conf (comments stripped) max_connections = 100 shared_buffers = 32MB datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' default_text_search_config = 'pg_catalog.english' Alan's postgresql.conf (comments stripped) shared_buffers = 256MB wal_buffers = 16MB checkpoint_segments = 100 autovacuum = off I'm not going to claim I know why, but I thought I should report it. Oh, and the 8.3.7 numbers and pre-patch numbers were averaging the same under the day-time load as the replication sync mode. So, with the ring size at 16MB this load is faster under 8.4 than 8.3. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> I wonder though whether the wal_buffers setting interacts with the
>> ring size.  Has everyone who's tested this used the same 16MB
>> wal_buffers setting as in Alan's original scenario?
> I had been using his postgresql.conf file, then added autovacuum =
> off.  When I tried with setting the ring size to 16MB, I accidentally
> left off the step to copy the postgresql.conf file, and got better
> performance.
Huh, that's bizarre.  I can see that increasing shared_buffers should
make no difference in this test case (we're not using them all anyway).
But why should increasing wal_buffers make it slower?  I forget the
walwriter's control algorithm at the moment ... maybe it works harder
when wal buffers are full?
BTW, I committed the change to use 16MB; that will be in RC2.
        regards, tom lane
			
		On Mon, 22 Jun 2009, Kevin Grittner wrote: > When I tried with setting the ring size to 16MB, I accidentally left off > the step to copy the postgresql.conf file, and got better performance. Do you have happen to have a build with assertions turned on? That is one common cause of performance going down via increased shared_buffers that people tend to run into during beta. You can check via psql with show debug_assertions; -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> wrote: > Do you have happen to have a build with assertions turned on? Nope. I showed my ./configure options upthread, but can confirm with pg_config: BINDIR = /usr/local/pgsql-8.4rc1/bin DOCDIR = /usr/local/pgsql-8.4rc1/share/doc HTMLDIR = /usr/local/pgsql-8.4rc1/share/doc INCLUDEDIR = /usr/local/pgsql-8.4rc1/include PKGINCLUDEDIR = /usr/local/pgsql-8.4rc1/include INCLUDEDIR-SERVER = /usr/local/pgsql-8.4rc1/include/server LIBDIR = /usr/local/pgsql-8.4rc1/lib PKGLIBDIR = /usr/local/pgsql-8.4rc1/lib LOCALEDIR = /usr/local/pgsql-8.4rc1/share/locale MANDIR = /usr/local/pgsql-8.4rc1/share/man SHAREDIR = /usr/local/pgsql-8.4rc1/share SYSCONFDIR = /usr/local/pgsql-8.4rc1/etc PGXS = /usr/local/pgsql-8.4rc1/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/usr/local/pgsql-8.4rc1' '--enable-integer-datetimes' '--enable-debug' '--disable-nls' '--with-libxml' CC = gcc CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g CFLAGS_SL = -fpic LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql-8.4rc1/lib' LDFLAGS_SL = LIBS = -lpgport -lxml2 -lz -lreadline -lcrypt -ldl -lm VERSION = PostgreSQL 8.4rc1 > You can check via psql with > > show debug_assertions; OK, we'll do it your way. :-) kgrittn@ATHENA:~/postgresql-8.4rc1> psql postgres psql (8.4rc1) Type "help" for help. postgres=# show debug_assertions;debug_assertions ------------------off (1 row) -Kevin
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > A 25-30% performance regression in our main bulk loading mechanism > should at least be explained before the release... I think a performance regression of that magnitude merits holding up a release to resolve. Note that in a follow-up post showing the slow 8.4 copying on Linux, the copy was 11.7% to 19.3% *faster* on 8.4 when the WAL writing was suppressed: http://archives.postgresql.org/pgsql-performance/2009-06/msg00219.php Extracting from that post: # I can reproduce that on Linux(CentoS 5.3/x86_64, Nehalem Xeon E5530) # on 8.4 I get: # # 3m59/4m01/3m56s runtime and a profile of # # samples % symbol name # 636302 19.6577 XLogInsert # 415510 12.8366 CopyReadLine # on 8.3.7 I get 2m58s,2m54s,2m55s # # and a profile of: # # samples % symbol name # 460966 16.2924 XLogInsert # 307386 10.8643 CopyReadLine # If I do the same test utilizing WAL bypass the picture changes: # # 8.3 runtimes:2m16,2min14s,2min22s # 8.4 runtime: 2m1s,2m,1m59s Is there a reason to believe that the XLogInsert part of this *only* affects bulk loads? -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Huh, that's bizarre. I can see that increasing shared_buffers > should make no difference in this test case (we're not using them > all anyway). But why should increasing wal_buffers make it slower? > I forget the walwriter's control algorithm at the moment ... maybe > it works harder when wal buffers are full? I created a postgresql.conf file with the options from the default file, and then tried that by itself again, and with each of three other options: <none> 0m24.540s 0m24.630s 0m23.778s checkpoint_segments = 100 0m30.251s 0m29.474s 0m26.604s wal_buffers = 16MB 0m24.487s 0m23.939s 0m23.557s shared_buffers = 256MB 0m25.885s 0m25.654s 0m24.025s So the big hit seems to come from boosting checkpoint_segments, although boosting shared_buffers seems to cause a slight slowdown. Boosting wal_buffers seemed to help a little. Both of these last two, though, are within the noise, so low confidence on those without a lot more tests. The checkpoint_segments seems dramatic enough to be real. I wonder if the test is short enough that it never got around to re-using any of them, so it was doing extra writes for the initial creation during the test? -Kevin
I wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > >> A 25-30% performance regression in our main bulk loading mechanism >> should at least be explained before the release... > > I think a performance regression of that magnitude merits holding > up a release to resolve. Wow. That sure took a long time to come through. I posted that days ago.... -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> The checkpoint_segments seems dramatic enough to be real.  I wonder if
> the test is short enough that it never got around to re-using any of
> them, so it was doing extra writes for the initial creation during the
> test?
That's exactly what I was about to suggest.  Are you starting each run
from a fresh initdb?  If so, try running the load long enough that the
number of WAL files stabilizes (should happen at 2x checkpoint_segments)
and then start the test measurement.
        regards, tom lane
			
		<div class="gmail_quote">On Mon, Jun 22, 2009 at 7:16 AM, Tom Lane <span dir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">Alan Li <<ahref="mailto:ali@truviso.com">ali@truviso.com</a>> writes:<br /> > How much concern is there for the contentionfor use cases where the WAL<br /> > can't be bypassed?<br /><br /></div>If you mean "is something going to bedone about it in 8.4", the<br /> answer is "no". This is a pre-existing issue that there is no simple<br /> fix for.<br/><br /> regards, tom lane<br /><br /><br /></blockquote></div>No no, I am certainly not implyinganything for the 8.4 timeframe.<br /><br />Moving forward, I imagine this being more of a problem for data warehouseapplications, where bulk inserts occur on existing fact tables. In this case, the WAL cannot be bypassed (unlessthe bulk insert occurs on a newly created partition). And since COPY is cpu-bound, it would perhaps be advantageousto do parallel COPY's on the same table on multi-core systems, which won't work with WAL bypassing either.<br/><br />Thanks, Alan<br />
On Mon, 2009-06-22 at 15:18 -0500, Kevin Grittner wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > default postgresql.conf (comments stripped) > max_connections = 100 > shared_buffers = 32MB This forces ring size to be 4MB, since min(32MB/8, ringsize). Please re-run tests with your config, ring size 4MB, which should give same answer -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> The checkpoint_segments seems dramatic enough to be real. I wonder >> if the test is short enough that it never got around to re-using >> any of them, so it was doing extra writes for the initial creation >> during the test? > > That's exactly what I was about to suggest. Are you starting each > run from a fresh initdb? If so, try running the load long enough > that the number of WAL files stabilizes (should happen at 2x > checkpoint_segments) and then start the test measurement. default conf (xlogs not populated) real 3m49.604s real 3m47.225s real 3m45.831s default conf (xlogs populated) real 3m45.603s real 3m45.284s real 3m45.906s default conf + checkpoint_segments = 100 (xlogs not populated) real 4m27.629s real 4m24.496s real 4m22.832s default conf + checkpoint_segments = 100 (xlogs populated) real 3m52.746s real 3m52.619s real 3m50.418s I used ten times the number of rows, to get more meaningful results. To get the "populated" times, I just dropped the target table and created it again; otherwise identical runs. Clearly, pre-populating the xlog files reduces run time, especially for a large number of xlog files; however, I still got better performance with a smaller set of xlog files. Regarding the fact that even with the xlog files pre-populated, the smaller set of xlog files is faster: I'm only guessing, but I suspect the battery backed RAID controller is what's defeating conventional wisdom here. By writing to the same, relatively small, set of xlog files repeatedly, some of the actual disk writes probably evaporate in the BBU cache. More frequent checkpoints from the smaller number of xlog files might also have caused data to start streaming to the disk a little sooner, minimizing write gluts later. I've often seen similar benefits to the BBU cache which cause some of the frequently-given advice here to have no discernible affect or be counter-productive in our environment. (I know that some doubted that my aggressive background writer settings didn't increase disk writes, but I couldn't even measure a difference there in the writes from OS cache to the controller cache, much less anything which indicated it actually increased physical disk writes.) By the way, the number of xlog files seemed to always go to two above 2x checkpoint_segments. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Regarding the fact that even with the xlog files pre-populated, the
> smaller set of xlog files is faster: I'm only guessing, but I suspect
> the battery backed RAID controller is what's defeating conventional
> wisdom here.  By writing to the same, relatively small, set of xlog
> files repeatedly, some of the actual disk writes probably evaporate in
> the BBU cache.
Yeah, sounds plausible.  How big is your BBU cache?
        regards, tom lane
			
		On Fri, June 26, 2009 4:13 pm, Kevin Grittner wrote: > By the way, the number of xlog files seemed to always go to two above > 2x checkpoint_segments. The docs say: "There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 files." cheers andrew
Tom Lane <tgl@sss.pgh.pa.us> wrote: > How big is your BBU cache? On this machine, I guess it is 512MB. (Possibly 1GB, but I'm having trouble finding the right incantation to check it at the moment, so I'm going by what the hardware tech remembers.) -Kevin