Обсуждение: Seq scans status update
Attached is a new version of Simon's "scan-resistant buffer manager" patch. It's not ready for committing yet because of a small issue I found this morning (* see bottom), but here's a status update. To recap, the basic idea is to use a small ring of buffers for large scans like VACUUM, COPY and seq-scans. Changes to the original patch: - a different sized ring is used for VACUUMs and seq-scans, and COPY. VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 buffers in default configuration. See README changes in the patch for rationale. - for queries with large seqscans, the buffer ring is only used for reads issued by the seq scan, not for any other reads in the query. Typical scenario where this matters is doing a large seq scan with a nested loop join to a smaller table. You don't want to use the buffer ring for index lookups inside the nested loop. - for seqscans, drop buffers from the ring that would need a WAL flush to reuse. That makes bulk updates to behave roughly like they do without the patch, instead of having to do a WAL flush every 32 pages. I've spent a lot of time thinking of solutions to the last point. The obvious solution would be to not use the buffer ring for updating scans. The difficulty with that is that we don't know if a scan is read-only in heapam.c, where the hint to use the buffer ring is set. I've completed a set of performance tests on a test server. The server has 4 GB of RAM, of which 1 GB is used for shared_buffers. Results for a 10 GB table: head-copy-bigtable | 00:10:09.07016 head-copy-bigtable | 00:10:20.507357 head-copy-bigtable | 00:10:21.857677 head-copy_nowal-bigtable | 00:05:18.232956 head-copy_nowal-bigtable | 00:03:24.109047 head-copy_nowal-bigtable | 00:05:31.019643 head-select-bigtable | 00:03:47.102731 head-select-bigtable | 00:01:08.314719 head-select-bigtable | 00:01:08.238509 head-select-bigtable | 00:01:08.208563 head-select-bigtable | 00:01:08.28347 head-select-bigtable | 00:01:08.308671 head-vacuum_clean-bigtable | 00:01:04.227832 head-vacuum_clean-bigtable | 00:01:04.232258 head-vacuum_clean-bigtable | 00:01:04.294621 head-vacuum_clean-bigtable | 00:01:04.280677 head-vacuum_hintbits-bigtable | 00:04:01.123924 head-vacuum_hintbits-bigtable | 00:03:58.253175 head-vacuum_hintbits-bigtable | 00:04:26.318159 head-vacuum_hintbits-bigtable | 00:04:37.512965 patched-copy-bigtable | 00:09:52.776754 patched-copy-bigtable | 00:10:18.185826 patched-copy-bigtable | 00:10:16.975482 patched-copy_nowal-bigtable | 00:03:14.882366 patched-copy_nowal-bigtable | 00:04:01.04648 patched-copy_nowal-bigtable | 00:03:56.062272 patched-select-bigtable | 00:03:47.704154 patched-select-bigtable | 00:01:08.460326 patched-select-bigtable | 00:01:10.441544 patched-select-bigtable | 00:01:11.916221 patched-select-bigtable | 00:01:13.848038 patched-select-bigtable | 00:01:10.956133 patched-vacuum_clean-bigtable | 00:01:10.315439 patched-vacuum_clean-bigtable | 00:01:12.210537 patched-vacuum_clean-bigtable | 00:01:15.202114 patched-vacuum_clean-bigtable | 00:01:10.712235 patched-vacuum_hintbits-bigtable | 00:03:42.279201 patched-vacuum_hintbits-bigtable | 00:04:02.057778 patched-vacuum_hintbits-bigtable | 00:04:26.805822 patched-vacuum_hintbits-bigtable | 00:04:28.911184 In other words, the patch has no significant effect, as expected. The select times did go up by a couple of seconds, which I didn't expect, though. One theory is that unused shared_buffers are swapped out during the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and try again at some point. Results for a 2 GB table: copy-medsize-unpatched | 00:02:18.23246 copy-medsize-unpatched | 00:02:22.347194 copy-medsize-unpatched | 00:02:23.875874 copy_nowal-medsize-unpatched | 00:01:27.606334 copy_nowal-medsize-unpatched | 00:01:17.491243 copy_nowal-medsize-unpatched | 00:01:31.902719 select-medsize-unpatched | 00:00:03.786031 select-medsize-unpatched | 00:00:02.678069 select-medsize-unpatched | 00:00:02.666103 select-medsize-unpatched | 00:00:02.673494 select-medsize-unpatched | 00:00:02.669645 select-medsize-unpatched | 00:00:02.666278 vacuum_clean-medsize-unpatched | 00:00:01.091356 vacuum_clean-medsize-unpatched | 00:00:01.923138 vacuum_clean-medsize-unpatched | 00:00:01.917213 vacuum_clean-medsize-unpatched | 00:00:01.917333 vacuum_hintbits-medsize-unpatched | 00:00:01.683718 vacuum_hintbits-medsize-unpatched | 00:00:01.864003 vacuum_hintbits-medsize-unpatched | 00:00:03.186596 vacuum_hintbits-medsize-unpatched | 00:00:02.16494 copy-medsize-patched | 00:02:35.113501 copy-medsize-patched | 00:02:25.269866 copy-medsize-patched | 00:02:31.881089 copy_nowal-medsize-patched | 00:01:00.254633 copy_nowal-medsize-patched | 00:01:04.630687 copy_nowal-medsize-patched | 00:01:03.729128 select-medsize-patched | 00:00:03.201837 select-medsize-patched | 00:00:01.332975 select-medsize-patched | 00:00:01.33014 select-medsize-patched | 00:00:01.332392 select-medsize-patched | 00:00:01.333498 select-medsize-patched | 00:00:01.332692 vacuum_clean-medsize-patched | 00:00:01.140189 vacuum_clean-medsize-patched | 00:00:01.062762 vacuum_clean-medsize-patched | 00:00:01.062402 vacuum_clean-medsize-patched | 00:00:01.07113 vacuum_hintbits-medsize-patched | 00:00:17.865446 vacuum_hintbits-medsize-patched | 00:00:15.162064 vacuum_hintbits-medsize-patched | 00:00:01.704651 vacuum_hintbits-medsize-patched | 00:00:02.671651 This looks good to me, except for some glitch at the last vacuum_hintbits tests. Selects and vacuums benefit significantly, as does non-WAL-logged copy. Not shown here, but I run tests earlier with vacuum on a table that actually had dead tuples to be removed on it. In that test the patched version really shined, reducing the runtime to ~ 1/6th. That was the original motivation of this patch: not having to do a WAL flush on every page in the 2nd phase of vacuum. Test script attached. To use it: 1. Edit testscript.sh. Change BIGTABLESIZE. 2. Start postmaster 3. Run script, giving test-label as argument. For example: "./testscript.sh bigtable-patched" Attached is also the patch I used for the tests. I would appreciate it if people would download the patch and the script and repeat the tests on different hardware. I'm particularly interested in testing on a box with good I/O hardware where selects on unpatched PostgreSQL are bottlenecked by CPU. Barring any surprises I'm going to fix the remaining issue and submit a final patch, probably in the weekend. (*) The issue with this patch is that if the buffer cache is completely filled with dirty buffers that need a WAL flush to evict, the buffer ring code will get into an infinite loop trying to find one that doesn't need a WAL flush. Should be simple to fix. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.232 diff -c -r1.232 heapam.c *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232 --- src/backend/access/heap/heapam.c 16 May 2007 11:35:14 -0000 *************** *** 83,88 **** --- 83,96 ---- */ scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd); + /* A scan on a table smaller than shared_buffers is treated like random + * access, but bigger scans should use the bulk read replacement policy. + */ + if (scan->rs_nblocks > NBuffers) + scan->rs_accesspattern = AP_BULKREAD; + else + scan->rs_accesspattern = AP_NORMAL; + scan->rs_inited = false; scan->rs_ctup.t_data = NULL; ItemPointerSetInvalid(&scan->rs_ctup.t_self); *************** *** 123,133 **** --- 131,146 ---- Assert(page < scan->rs_nblocks); + /* Read the page with the right strategy */ + SetAccessPattern(scan->rs_accesspattern); + scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf, scan->rs_rd, page); scan->rs_cblock = page; + SetAccessPattern(AP_NORMAL); + if (!scan->rs_pageatatime) return; Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.268 diff -c -r1.268 xlog.c *** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 -0000 1.268 --- src/backend/access/transam/xlog.c 15 May 2007 16:23:30 -0000 *************** *** 1668,1673 **** --- 1668,1700 ---- } /* + * Returns true if 'record' hasn't been flushed to disk yet. + */ + bool + XLogNeedsFlush(XLogRecPtr record) + { + /* Quick exit if already known flushed */ + if (XLByteLE(record, LogwrtResult.Flush)) + return false; + + /* read LogwrtResult and update local state */ + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + LogwrtResult = xlogctl->LogwrtResult; + SpinLockRelease(&xlogctl->info_lck); + } + + /* check again */ + if (XLByteLE(record, LogwrtResult.Flush)) + return false; + + return true; + } + + /* * Ensure that all XLOG data through the given position is flushed to disk. * * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.283 diff -c -r1.283 copy.c *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 -0000 1.283 --- src/backend/commands/copy.c 15 May 2007 17:05:29 -0000 *************** *** 1876,1881 **** --- 1876,1888 ---- nfields = file_has_oids ? (attr_count + 1) : attr_count; field_strings = (char **) palloc(nfields * sizeof(char *)); + /* Use the special COPY buffer replacement strategy if WAL-logging + * is enabled. If it's not, the pages we're writing are dirty but + * don't need a WAL flush to write out, so the BULKREAD strategy + * is more suitable. + */ + SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD); + /* Initialize state variables */ cstate->fe_eof = false; cstate->eol_type = EOL_UNKNOWN; *************** *** 2161,2166 **** --- 2168,2176 ---- cstate->filename))); } + /* Reset buffer replacement strategy */ + SetAccessPattern(AP_NORMAL); + /* * If we skipped writing WAL, then we need to sync the heap (but not * indexes since those use WAL anyway) Index: src/backend/commands/vacuum.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v retrieving revision 1.350 diff -c -r1.350 vacuum.c *** src/backend/commands/vacuum.c 16 Apr 2007 18:29:50 -0000 1.350 --- src/backend/commands/vacuum.c 15 May 2007 17:06:18 -0000 *************** *** 421,431 **** * Tell the buffer replacement strategy that vacuum is causing * the IO */ ! StrategyHintVacuum(true); analyze_rel(relid, vacstmt); ! StrategyHintVacuum(false); if (use_own_xacts) CommitTransactionCommand(); --- 421,431 ---- * Tell the buffer replacement strategy that vacuum is causing * the IO */ ! SetAccessPattern(AP_VACUUM); analyze_rel(relid, vacstmt); ! SetAccessPattern(AP_NORMAL); if (use_own_xacts) CommitTransactionCommand(); *************** *** 442,448 **** /* Make sure cost accounting is turned off after error */ VacuumCostActive = false; /* And reset buffer replacement strategy, too */ ! StrategyHintVacuum(false); PG_RE_THROW(); } PG_END_TRY(); --- 442,448 ---- /* Make sure cost accounting is turned off after error */ VacuumCostActive = false; /* And reset buffer replacement strategy, too */ ! SetAccessPattern(AP_NORMAL); PG_RE_THROW(); } PG_END_TRY(); *************** *** 1088,1094 **** * Tell the cache replacement strategy that vacuum is causing all * following IO */ ! StrategyHintVacuum(true); /* * Do the actual work --- either FULL or "lazy" vacuum --- 1088,1094 ---- * Tell the cache replacement strategy that vacuum is causing all * following IO */ ! SetAccessPattern(AP_VACUUM); /* * Do the actual work --- either FULL or "lazy" vacuum *************** *** 1098,1104 **** else lazy_vacuum_rel(onerel, vacstmt); ! StrategyHintVacuum(false); /* all done with this class, but hold lock until commit */ relation_close(onerel, NoLock); --- 1098,1104 ---- else lazy_vacuum_rel(onerel, vacstmt); ! SetAccessPattern(AP_NORMAL); /* all done with this class, but hold lock until commit */ relation_close(onerel, NoLock); Index: src/backend/storage/buffer/README =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v retrieving revision 1.11 diff -c -r1.11 README *** src/backend/storage/buffer/README 23 Jul 2006 03:07:58 -0000 1.11 --- src/backend/storage/buffer/README 16 May 2007 11:43:11 -0000 *************** *** 152,159 **** a field to show which backend is doing its I/O). ! Buffer replacement strategy ! --------------------------- There is a "free list" of buffers that are prime candidates for replacement. In particular, buffers that are completely free (contain no valid page) are --- 152,159 ---- a field to show which backend is doing its I/O). ! Normal buffer replacement strategy ! ---------------------------------- There is a "free list" of buffers that are prime candidates for replacement. In particular, buffers that are completely free (contain no valid page) are *************** *** 199,221 **** have to give up and try another buffer. This however is not a concern of the basic select-a-victim-buffer algorithm.) - A special provision is that while running VACUUM, a backend does not - increment the usage count on buffers it accesses. In fact, if ReleaseBuffer - sees that it is dropping the pin count to zero and the usage count is zero, - then it appends the buffer to the tail of the free list. (This implies that - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer; - this shouldn't create much of a contention problem.) This provision - encourages VACUUM to work in a relatively small number of buffers rather - than blowing out the entire buffer cache. It is reasonable since a page - that has been touched only by VACUUM is unlikely to be needed again soon. - - Since VACUUM usually requests many pages very fast, the effect of this is that - it will get back the very buffers it filled and possibly modified on the next - call and will therefore do its work in a few shared memory buffers, while - being able to use whatever it finds in the cache already. This also implies - that most of the write traffic caused by a VACUUM will be done by the VACUUM - itself and not pushed off onto other processes. Background writer's processing ------------------------------ --- 199,243 ---- have to give up and try another buffer. This however is not a concern of the basic select-a-victim-buffer algorithm.) + Buffer ring replacement strategy + --------------------------------- + + When running a query that needs to access a large number of pages, like VACUUM, + COPY, or a large sequential scan, a different strategy is used. A page that + has been touched only by such a scan is unlikely to be needed again soon, so + instead of running the normal clock sweep algorithm and blowing out the entire + buffer cache, a small ring of buffers is allocated using the normal clock sweep + algorithm and those buffers are reused for the whole scan. This also implies + that most of the write traffic caused by such a statement will be done by the + backend itself and not pushed off onto other processes. + + The size of the ring used depends on the kind of scan: + + For sequential scans, a small 256 KB ring is used. That's small enough to fit + in L2 cache, which makes transferring pages from OS cache to shared buffer + cache efficient. Even less would often be enough, but the ring must be big + enough to accommodate all pages in the scan that are pinned concurrently. + 256 KB should also be enough to leave a small cache trail for other backends to + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer + is removed from the ring and a replacement buffer is chosen using the normal + replacement strategy. In a scan that modifies every page in the scan, like a + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the + ring strategy effectively degrades to the normal strategy. + + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not + removed from the ring. WAL is flushed instead to allow reuse of the buffers. + Before introducing the buffer ring strategy in 8.3, buffers were put to the + freelist, which was effectively a buffer ring of 1 buffer. + + COPY behaves like VACUUM, but a much larger ring is used. The ring size is + chosen to be twice the WAL segment size. This avoids polluting the buffer cache + like the clock sweep would do, and using a ring larger than WAL segment size + avoids having to do any extra WAL flushes, since a WAL segment will always be + filled, forcing a WAL flush, before looping through the buffer ring and bumping + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY + operations the smaller 256 KB ring is used because WAL flushes are not needed + to write the buffers. Background writer's processing ------------------------------ Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.218 diff -c -r1.218 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 2 May 2007 23:34:48 -0000 1.218 --- src/backend/storage/buffer/bufmgr.c 16 May 2007 12:34:10 -0000 *************** *** 419,431 **** /* Loop here in case we have to try another victim buffer */ for (;;) { /* * Select a victim buffer. The buffer is returned with its header * spinlock still held! Also the BufFreelistLock is still held, since * it would be bad to hold the spinlock while possibly waking up other * processes. */ ! buf = StrategyGetBuffer(); Assert(buf->refcount == 0); --- 419,433 ---- /* Loop here in case we have to try another victim buffer */ for (;;) { + bool lock_held; + /* * Select a victim buffer. The buffer is returned with its header * spinlock still held! Also the BufFreelistLock is still held, since * it would be bad to hold the spinlock while possibly waking up other * processes. */ ! buf = StrategyGetBuffer(&lock_held); Assert(buf->refcount == 0); *************** *** 436,442 **** PinBuffer_Locked(buf); /* Now it's safe to release the freelist lock */ ! LWLockRelease(BufFreelistLock); /* * If the buffer was dirty, try to write it out. There is a race --- 438,445 ---- PinBuffer_Locked(buf); /* Now it's safe to release the freelist lock */ ! if (lock_held) ! LWLockRelease(BufFreelistLock); /* * If the buffer was dirty, try to write it out. There is a race *************** *** 464,469 **** --- 467,489 ---- */ if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED)) { + /* In BULKREAD-mode, check if a WAL flush would be needed to + * evict this buffer. If so, ask the replacement strategy if + * we should go ahead and do it or choose another victim. + */ + if (active_access_pattern == AP_BULKREAD) + { + if (XLogNeedsFlush(BufferGetLSN(buf))) + { + if (StrategyRejectBuffer(buf)) + { + LWLockRelease(buf->content_lock); + UnpinBuffer(buf, true, false); + continue; + } + } + } + FlushBuffer(buf, NULL); LWLockRelease(buf->content_lock); } *************** *** 925,932 **** PrivateRefCount[b]--; if (PrivateRefCount[b] == 0) { - bool immed_free_buffer = false; - /* I'd better not still hold any locks on the buffer */ Assert(!LWLockHeldByMe(buf->content_lock)); Assert(!LWLockHeldByMe(buf->io_in_progress_lock)); --- 945,950 ---- *************** *** 940,956 **** /* Update buffer usage info, unless this is an internal access */ if (normalAccess) { ! if (!strategy_hint_vacuum) { ! if (buf->usage_count < BM_MAX_USAGE_COUNT) ! buf->usage_count++; } else ! { ! /* VACUUM accesses don't bump usage count, instead... */ ! if (buf->refcount == 0 && buf->usage_count == 0) ! immed_free_buffer = true; ! } } if ((buf->flags & BM_PIN_COUNT_WAITER) && --- 958,975 ---- /* Update buffer usage info, unless this is an internal access */ if (normalAccess) { ! if (active_access_pattern != AP_NORMAL) { ! /* We don't want large one-off scans like vacuum to inflate ! * the usage_count. We do want to set it to 1, though, to keep ! * other backends from hijacking it from the buffer ring. ! */ ! if (buf->usage_count == 0) ! buf->usage_count = 1; } else ! if (buf->usage_count < BM_MAX_USAGE_COUNT) ! buf->usage_count++; } if ((buf->flags & BM_PIN_COUNT_WAITER) && *************** *** 965,978 **** } else UnlockBufHdr(buf); - - /* - * If VACUUM is releasing an otherwise-unused buffer, send it to the - * freelist for near-term reuse. We put it at the tail so that it - * won't be used before any invalid buffers that may exist. - */ - if (immed_free_buffer) - StrategyFreeBuffer(buf, false); } } --- 984,989 ---- Index: src/backend/storage/buffer/freelist.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v retrieving revision 1.58 diff -c -r1.58 freelist.c *** src/backend/storage/buffer/freelist.c 5 Jan 2007 22:19:37 -0000 1.58 --- src/backend/storage/buffer/freelist.c 17 May 2007 16:12:56 -0000 *************** *** 18,23 **** --- 18,25 ---- #include "storage/buf_internals.h" #include "storage/bufmgr.h" + #include "utils/memutils.h" + /* * The shared freelist control information. *************** *** 39,47 **** /* Pointers to shared state */ static BufferStrategyControl *StrategyControl = NULL; ! /* Backend-local state about whether currently vacuuming */ ! bool strategy_hint_vacuum = false; /* * StrategyGetBuffer --- 41,53 ---- /* Pointers to shared state */ static BufferStrategyControl *StrategyControl = NULL; ! /* Currently active access pattern hint. */ ! AccessPattern active_access_pattern = AP_NORMAL; + /* prototypes for internal functions */ + static volatile BufferDesc *GetBufferFromRing(void); + static void PutBufferToRing(volatile BufferDesc *buf); + static void InitRing(void); /* * StrategyGetBuffer *************** *** 51,67 **** * the selected buffer must not currently be pinned by anyone. * * To ensure that no one else can pin the buffer before we do, we must ! * return the buffer with the buffer header spinlock still held. That ! * means that we return with the BufFreelistLock still held, as well; ! * the caller must release that lock once the spinlock is dropped. */ volatile BufferDesc * ! StrategyGetBuffer(void) { volatile BufferDesc *buf; int trycounter; LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); /* * Try to get a buffer from the freelist. Note that the freeNext fields --- 57,89 ---- * the selected buffer must not currently be pinned by anyone. * * To ensure that no one else can pin the buffer before we do, we must ! * return the buffer with the buffer header spinlock still held. If ! * *lock_held is set at return, we return with the BufFreelistLock still ! * held, as well; the caller must release that lock once the spinlock is ! * dropped. */ volatile BufferDesc * ! StrategyGetBuffer(bool *lock_held) { volatile BufferDesc *buf; int trycounter; + /* Get a buffer from the ring if we're doing a bulk scan */ + if (active_access_pattern != AP_NORMAL) + { + buf = GetBufferFromRing(); + if (buf != NULL) + { + *lock_held = false; + return buf; + } + } + + /* + * If our selected buffer wasn't available, pick another... + */ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + *lock_held = true; /* * Try to get a buffer from the freelist. Note that the freeNext fields *************** *** 86,96 **** */ LockBufHdr(buf); if (buf->refcount == 0 && buf->usage_count == 0) return buf; UnlockBufHdr(buf); } ! /* Nothing on the freelist, so run the "clock sweep" algorithm */ trycounter = NBuffers; for (;;) { --- 108,122 ---- */ LockBufHdr(buf); if (buf->refcount == 0 && buf->usage_count == 0) + { + if (active_access_pattern != AP_NORMAL) + PutBufferToRing(buf); return buf; + } UnlockBufHdr(buf); } ! /* Nothing on the freelist, so run the shared "clock sweep" algorithm */ trycounter = NBuffers; for (;;) { *************** *** 105,111 **** --- 131,141 ---- */ LockBufHdr(buf); if (buf->refcount == 0 && buf->usage_count == 0) + { + if (active_access_pattern != AP_NORMAL) + PutBufferToRing(buf); return buf; + } if (buf->usage_count > 0) { buf->usage_count--; *************** *** 191,204 **** } /* ! * StrategyHintVacuum -- tell us whether VACUUM is active */ void ! StrategyHintVacuum(bool vacuum_active) { ! strategy_hint_vacuum = vacuum_active; ! } /* * StrategyShmemSize --- 221,245 ---- } /* ! * SetAccessPattern -- Sets the active access pattern hint ! * ! * Caller is responsible for resetting the hint to AP_NORMAL after the bulk ! * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of ! * the other strategies, for example in a query with one large sequential scan ! * nested loop joined to an index scan. Index tuples should be fetched with the ! * normal strategy and the pages from the seq scan should be read in with the ! * AP_BULKREAD strategy. The ring won't be affected by such switching, however ! * switching to an access pattern with different ring size will invalidate the ! * old ring. */ void ! SetAccessPattern(AccessPattern new_pattern) { ! active_access_pattern = new_pattern; + if (active_access_pattern != AP_NORMAL) + InitRing(); + } /* * StrategyShmemSize *************** *** 274,276 **** --- 315,498 ---- else Assert(!init); } + + /* ---------------------------------------------------------------- + * Backend-private buffer ring management + * ---------------------------------------------------------------- + */ + + /* + * Ring sizes for different access patterns. See README for the rationale + * of these. + */ + #define BULKREAD_RING_SIZE 256 * 1024 / BLCKSZ + #define VACUUM_RING_SIZE 256 * 1024 / BLCKSZ + #define COPY_RING_SIZE Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2) + + /* + * BufferRing is an array of buffer ids, and RingSize it's size in number of + * elements. It's allocated in TopMemoryContext the first time it's needed. + */ + static int *BufferRing = NULL; + static int RingSize = 0; + + /* Index of the "current" slot in the ring. It's advanced every time a buffer + * is handed out from the ring with GetBufferFromRing and it points to the + * last buffer returned from the ring. RingCurSlot + 1 is the next victim + * GetBufferRing will hand out. + */ + static int RingCurSlot = 0; + + /* magic value to mark empty slots in the ring */ + #define BUF_ID_NOT_SET -1 + + + /* + * GetBufferFromRing -- returns a buffer from the ring, or NULL if the + * ring is empty. + * + * The bufhdr spin lock is held on the returned buffer. + */ + static volatile BufferDesc * + GetBufferFromRing(void) + { + volatile BufferDesc *buf; + + /* ring should be initialized by now */ + Assert(RingSize > 0 && BufferRing != NULL); + + /* Run private "clock cycle" */ + if (++RingCurSlot >= RingSize) + RingCurSlot = 0; + + /* + * If that slot hasn't been filled yet, tell the caller to allocate + * a new buffer with the normal allocation strategy. He will then + * fill this slot by calling PutBufferToRing with the new buffer. + */ + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) + return NULL; + + buf = &BufferDescriptors[BufferRing[RingCurSlot]]; + + /* + * If the buffer is pinned we cannot use it under any circumstances. + * If usage_count == 0 then the buffer is fair game. + * + * We also choose this buffer if usage_count == 1. Strictly, this + * might sometimes be the wrong thing to do, but we rely on the high + * probability that it was this process that last touched the buffer. + * If it wasn't, we'll choose a suboptimal victim, but it shouldn't + * make any difference in the big scheme of things. + * + */ + LockBufHdr(buf); + if (buf->refcount == 0 && buf->usage_count <= 1) + return buf; + UnlockBufHdr(buf); + + return NULL; + } + + /* + * PutBufferToRing -- adds a buffer to the buffer ring + * + * Caller must hold the buffer header spinlock on the buffer. + */ + static void + PutBufferToRing(volatile BufferDesc *buf) + { + /* ring should be initialized by now */ + Assert(RingSize > 0 && BufferRing != NULL); + + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) + BufferRing[RingCurSlot] = buf->buf_id; + } + + /* + * Initializes a ring buffer with correct size for the currently + * active strategy. Does nothing if the ring already has the right size. + */ + static void + InitRing(void) + { + int new_size; + int old_size = RingSize; + int i; + MemoryContext oldcxt; + + /* Determine new size */ + + switch(active_access_pattern) + { + case AP_BULKREAD: + new_size = BULKREAD_RING_SIZE; + break; + case AP_COPY: + new_size = COPY_RING_SIZE; + break; + case AP_VACUUM: + new_size = VACUUM_RING_SIZE; + break; + default: + elog(ERROR, "unexpected buffer cache strategy %d", + active_access_pattern); + return; /* keep compile happy */ + } + + /* + * Seq scans set and reset the strategy on every page, so we better exit + * quickly if no change in size is needed. + */ + if (new_size == old_size) + return; + + /* Allocate array */ + + oldcxt = MemoryContextSwitchTo(TopMemoryContext); + + if (old_size == 0) + { + Assert(BufferRing == NULL); + BufferRing = palloc(new_size * sizeof(int)); + } + else + BufferRing = repalloc(BufferRing, new_size * sizeof(int)); + + MemoryContextSwitchTo(oldcxt); + + for(i = 0; i < new_size; i++) + BufferRing[i] = BUF_ID_NOT_SET; + + RingCurSlot = 0; + RingSize = new_size; + } + + /* + * Buffer manager calls this function in AP_BULKREAD mode when the + * buffer handed to it turns out to need a WAL flush to write out. This + * gives the strategy a second chance to choose another victim. + * + * Returns true if buffer manager should ask for a new victim, and false + * if WAL should be flushed and this buffer used. + */ + bool + StrategyRejectBuffer(volatile BufferDesc *buf) + { + Assert(RingSize > 0); + + if (BufferRing[RingCurSlot] == buf->buf_id) + { + BufferRing[RingCurSlot] = BUF_ID_NOT_SET; + return true; + } + else + { + /* Apparently the buffer didn't come from the ring. We don't want to + * mess with how the clock sweep works; in worst case there's no + * buffers in the buffer cache that can be reused without a WAL flush, + * and we'd get into an endless loop trying. + */ + return false; + } + } Index: src/include/access/relscan.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v retrieving revision 1.52 diff -c -r1.52 relscan.h *** src/include/access/relscan.h 20 Jan 2007 18:43:35 -0000 1.52 --- src/include/access/relscan.h 15 May 2007 17:01:31 -0000 *************** *** 28,33 **** --- 28,34 ---- ScanKey rs_key; /* array of scan key descriptors */ BlockNumber rs_nblocks; /* number of blocks to scan */ bool rs_pageatatime; /* verify visibility page-at-a-time? */ + AccessPattern rs_accesspattern; /* access pattern to use for reads */ /* scan current state */ bool rs_inited; /* false = scan not init'd yet */ Index: src/include/access/xlog.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v retrieving revision 1.76 diff -c -r1.76 xlog.h *** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76 --- src/include/access/xlog.h 14 May 2007 21:22:40 -0000 *************** *** 151,156 **** --- 151,157 ---- extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata); extern void XLogFlush(XLogRecPtr RecPtr); + extern bool XLogNeedsFlush(XLogRecPtr RecPtr); extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record); extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec); Index: src/include/storage/buf_internals.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v retrieving revision 1.89 diff -c -r1.89 buf_internals.h *** src/include/storage/buf_internals.h 5 Jan 2007 22:19:57 -0000 1.89 --- src/include/storage/buf_internals.h 15 May 2007 17:07:59 -0000 *************** *** 16,21 **** --- 16,22 ---- #define BUFMGR_INTERNALS_H #include "storage/buf.h" + #include "storage/bufmgr.h" #include "storage/lwlock.h" #include "storage/shmem.h" #include "storage/spin.h" *************** *** 168,174 **** extern BufferDesc *LocalBufferDescriptors; /* in freelist.c */ ! extern bool strategy_hint_vacuum; /* event counters in buf_init.c */ extern long int ReadBufferCount; --- 169,175 ---- extern BufferDesc *LocalBufferDescriptors; /* in freelist.c */ ! extern AccessPattern active_access_pattern; /* event counters in buf_init.c */ extern long int ReadBufferCount; *************** *** 184,195 **** */ /* freelist.c */ ! extern volatile BufferDesc *StrategyGetBuffer(void); extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); extern int StrategySyncStart(void); extern Size StrategyShmemSize(void); extern void StrategyInitialize(bool init); /* buf_table.c */ extern Size BufTableShmemSize(int size); extern void InitBufTable(int size); --- 185,198 ---- */ /* freelist.c */ ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held); extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); extern int StrategySyncStart(void); extern Size StrategyShmemSize(void); extern void StrategyInitialize(bool init); + extern bool StrategyRejectBuffer(volatile BufferDesc *buf); + /* buf_table.c */ extern Size BufTableShmemSize(int size); extern void InitBufTable(int size); Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.103 diff -c -r1.103 bufmgr.h *** src/include/storage/bufmgr.h 2 May 2007 23:18:03 -0000 1.103 --- src/include/storage/bufmgr.h 15 May 2007 17:07:02 -0000 *************** *** 48,53 **** --- 48,61 ---- #define BUFFER_LOCK_SHARE 1 #define BUFFER_LOCK_EXCLUSIVE 2 + typedef enum AccessPattern + { + AP_NORMAL, /* Normal random access */ + AP_BULKREAD, /* Large read-only scan (hint bit updates are ok) */ + AP_COPY, /* Large updating scan, like COPY with WAL enabled */ + AP_VACUUM, /* VACUUM */ + } AccessPattern; + /* * These routines are beaten on quite heavily, hence the macroization. */ *************** *** 157,162 **** extern void AtProcExit_LocalBuffers(void); /* in freelist.c */ ! extern void StrategyHintVacuum(bool vacuum_active); #endif --- 165,170 ---- extern void AtProcExit_LocalBuffers(void); /* in freelist.c */ ! extern void SetAccessPattern(AccessPattern new_pattern); #endif
Вложения
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I've completed a set of performance tests on a test server. The server > has 4 GB of RAM, of which 1 GB is used for shared_buffers. Perhaps I'm misreading it, but these tests seem to show no improvement worth spending any effort on --- some of the tests improve a bit but many get worse, and that's for tests allegedly designed to highlight the improvement; there's been no attempt to measure the distributed cost of the additional logic in scenarios where it's not helpful. To say nothing of the likelihood that it will be flat-out counterproductive in yet other scenarios. regression=# select id,sum(rt),count(*) from times group by id; id | sum | count ------------+-----------------+------- 10G | 01:15:53.497114 | 20 10Gpatched | 01:12:51.749906 | 20 2G | 00:11:54.343741 | 20 2Gpatched | 00:11:32.482733 | 20 (4 rows) Should we not just reject the patch and move on to something more useful? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I've completed a set of performance tests on a test server. The server >> has 4 GB of RAM, of which 1 GB is used for shared_buffers. > > Perhaps I'm misreading it, but these tests seem to show no improvement > worth spending any effort on --- some of the tests improve a bit but > many get worse, and that's for tests allegedly designed to highlight the > improvement; there's been no attempt to measure the distributed cost of > the additional logic in scenarios where it's not helpful. To say > nothing of the likelihood that it will be flat-out counterproductive > in yet other scenarios. Yeah, possibly. I've been thinking hard of scenarios where this would be counterproductive, bulk delete is one that the original patch hurt, but I think I have all interesting scenarios covered now. > Should we not just reject the patch and move on to something more > useful? If Luke is right, the L2 cache effect that's visible in these in-memory tests: select-medsize-patched | 00:00:01.332975 select-medsize-patched | 00:00:01.33014 select-medsize-patched | 00:00:01.332392 select-medsize-patched | 00:00:01.333498 select-medsize-patched | 00:00:01.332692 select-medsize-unpatched | 00:00:02.678069 select-medsize-unpatched | 00:00:02.666103 select-medsize-unpatched | 00:00:02.673494 select-medsize-unpatched | 00:00:02.669645 select-medsize-unpatched | 00:00:02.666278 is also visible on larger scans that don't fit in cache with bigger I/O hardware, and this patch would increase the max. I/O throughput that we can handle on such hardware. I don't have such hardware available, I hope someone else will try that. In addition to that, this should reduce the cache-spoiling effect of big scans and COPY. That's harder to quantify, I've been planning to run a TPC-C test with a large SELECT COUNT(*) running in the background to measure that, but me and the server has been busy with those other tests. In any case, I do want this for VACUUMs to fix the "WAL flush for every dirty page" problem. Maybe we should indeed drop the other aspects of the patch and move on, I'm getting tired of this as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > In any case, I do want this for VACUUMs to fix the "WAL flush for every > dirty page" problem. Maybe we should indeed drop the other aspects of > the patch and move on, I'm getting tired of this as well. Can we devise a small patch that fixes that issue without creating uncertain impacts on other behavior? The thing that has made me uncomfortable with this set of patches right along is the presumption that it's a good idea to tweak cache behavior to improve a small set of cases. We are using cache behavior (now clock-sweep, formerly LRU) that is well tested and proven across a large body of experience; my gut feeling is that deviating from that behavior is likely to be a net loss when the big picture is considered. I certainly have not seen any test results that try to prove that there's no such generalized disadvantage to these patches. One could argue that the real bug here is that VACUUM deviates from the standard behavior by forcing immediate recycling of pages it releases, and that getting rid of that wart is the correct answer rather than piling more warts atop it --- especially warts that will change the behavior for anything besides VACUUM. regards, tom lane
Hi Heikki, On 5/17/07 10:28 AM, "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: > is also visible on larger scans that don't fit in cache with bigger I/O > hardware, and this patch would increase the max. I/O throughput that we > can handle on such hardware. I don't have such hardware available, I > hope someone else will try that. Yes, this is absolutely the case, in addition to the benefits of not polluting the bufcache with seq scans (as discussed in detail previously). We've adopted this (see CK's patch) with excellent benefits. We can try your version on a machine with fast I/O and get back to you with a comparison of this and CK's version. - Luke
CK Tan wrote: > If it is convenient for you, could you run my patch against the same > hardware and data to get some numbers on select for comparison? Although > we don't address updates, copy, or inserts, we are definitely getting > at least 20% improvement in scans here without poisoning the bufpool for > large tables. Sure, here you are: copy-cktan-big | 00:10:08.843126 copy-cktan-big | 00:10:17.767606 copy-cktan-big | 00:09:37.797059 copy_nowal-cktan-big | 00:05:12.305025 copy_nowal-cktan-big | 00:05:10.518417 copy_nowal-cktan-big | 00:05:03.472939 select-cktan-big | 00:03:27.655982 select-cktan-big | 00:01:55.496408 select-cktan-big | 00:01:31.693856 select-cktan-big | 00:01:12.705268 select-cktan-big | 00:01:12.478247 select-cktan-big | 00:01:10.866484 vacuum_clean-cktan-big | 00:03:05.340875 vacuum_clean-cktan-big | 00:01:12.428317 vacuum_clean-cktan-big | 00:01:13.179957 vacuum_clean-cktan-big | 00:01:10.438888 vacuum_hintbits-cktan-big | 00:03:58.78208 vacuum_hintbits-cktan-big | 00:04:02.515778 vacuum_hintbits-cktan-big | 00:04:19.083402 vacuum_hintbits-cktan-big | 00:04:11.170831 copy-cktan-med | 00:02:19.413484 copy-cktan-med | 00:02:22.270497 copy-cktan-med | 00:02:22.297946 copy_nowal-cktan-med | 00:01:31.192601 copy_nowal-cktan-med | 00:01:17.736356 copy_nowal-cktan-med | 00:01:32.272778 select-cktan-med | 00:00:03.774974 select-cktan-med | 00:00:01.279276 select-cktan-med | 00:00:01.297703 select-cktan-med | 00:00:01.304129 select-cktan-med | 00:00:01.297048 select-cktan-med | 00:00:01.306073 vacuum_clean-cktan-med | 00:00:01.820733 vacuum_clean-cktan-med | 00:00:01.755684 vacuum_clean-cktan-med | 00:00:01.755659 vacuum_clean-cktan-med | 00:00:01.750972 vacuum_hintbits-cktan-med | 00:00:01.58744 vacuum_hintbits-cktan-med | 00:00:06.216038 vacuum_hintbits-cktan-med | 00:00:02.201789 vacuum_hintbits-cktan-med | 00:00:36.494427 Select performance looks the same as with Simon's/my patch. That 20% gain must be hardware-dependent. My interpretation is that the CPU savings from more efficient cache usage only matters when the I/O bandwidth is high enough that CPU becomes the bottleneck. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> In any case, I do want this for VACUUMs to fix the "WAL flush for every >> dirty page" problem. Maybe we should indeed drop the other aspects of >> the patch and move on, I'm getting tired of this as well. > > Can we devise a small patch that fixes that issue without creating > uncertain impacts on other behavior? Yeah certainly, that's my backup plan. > The thing that has made me uncomfortable with this set of patches right > along is the presumption that it's a good idea to tweak cache behavior > to improve a small set of cases. We are using cache behavior (now > clock-sweep, formerly LRU) that is well tested and proven across a large > body of experience; my gut feeling is that deviating from that behavior > is likely to be a net loss when the big picture is considered. I > certainly have not seen any test results that try to prove that there's > no such generalized disadvantage to these patches. That was my initial reaction as well. However, people claimed that using a small number of buffers you can achieve higher raw throughput. Avoiding the cache spoiling sounds plausible as well, if you're going to do a seq scan of a table larger than shared_buffers, you know those pages are not going to be needed in the near future; you're going to replace them yourself with pages from the same scan. > One could argue that the real bug here is that VACUUM deviates from the > standard behavior by forcing immediate recycling of pages it releases, > and that getting rid of that wart is the correct answer rather than > piling more warts atop it --- especially warts that will change the > behavior for anything besides VACUUM. Maybe a better approach would be switching to an algorithm that's more resistant to sequential scans by nature. That's definitely not going to happen for 8.3, though. It's also possible that whatever algorithm we choose doesn't make much difference in practice because the in typical configurations the OS cache is bigger and more significant anyway. It's also possible that there's some counter-productive interactions between our and OS cache management. In any case, I'd like to see more test results before we make a decision. I'm running tests with DBT-2 and a seq scan running in the background to see if the cache-spoiling effect shows up. I'm also trying to get hold of some bigger hardware to run on. Running these tests takes some calendar time, but the hard work has already been done. I'm going to start reviewing Jeff's synchronized scans patch now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > In any case, I'd like to see more test results before we make a > decision. I'm running tests with DBT-2 and a seq scan running in the > background to see if the cache-spoiling effect shows up. I'm also trying > to get hold of some bigger hardware to run on. Running these tests takes > some calendar time, but the hard work has already been done. I'm going > to start reviewing Jeff's synchronized scans patch now. Here are the results of the DBT-2 tests: http://community.enterprisedb.com/seqscan/imola/ In each of these tests, at the end of rampup a script is started that issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay between end of previous query and start of next one. The patch makes the seq scans go significantly faster. In the 1 hour test period, the patched tests perform roughly 30-100% as many selects as unpatched tests. With 100 and 105 warehouses, it also significantly reduces the impact of the seq scan on other queries; response times are lower with the patch. With 120 warehouses the reduction of impact is not as clear, but when you plot the response times it's still there (the plots on the "response times charts"-page are useless because they're overwhelmed by the checkpoint spike). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I forgot to attach the program used to generate test data. Here it is. Heikki Linnakangas wrote: > Attached is a new version of Simon's "scan-resistant buffer manager" > patch. It's not ready for committing yet because of a small issue I > found this morning (* see bottom), but here's a status update. > > To recap, the basic idea is to use a small ring of buffers for large > scans like VACUUM, COPY and seq-scans. Changes to the original patch: > > - a different sized ring is used for VACUUMs and seq-scans, and COPY. > VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 > buffers in default configuration. See README changes in the patch for > rationale. > > - for queries with large seqscans, the buffer ring is only used for > reads issued by the seq scan, not for any other reads in the query. > Typical scenario where this matters is doing a large seq scan with a > nested loop join to a smaller table. You don't want to use the buffer > ring for index lookups inside the nested loop. > > - for seqscans, drop buffers from the ring that would need a WAL flush > to reuse. That makes bulk updates to behave roughly like they do without > the patch, instead of having to do a WAL flush every 32 pages. > > I've spent a lot of time thinking of solutions to the last point. The > obvious solution would be to not use the buffer ring for updating scans. > The difficulty with that is that we don't know if a scan is read-only in > heapam.c, where the hint to use the buffer ring is set. > > I've completed a set of performance tests on a test server. The server > has 4 GB of RAM, of which 1 GB is used for shared_buffers. > > Results for a 10 GB table: > > head-copy-bigtable | 00:10:09.07016 > head-copy-bigtable | 00:10:20.507357 > head-copy-bigtable | 00:10:21.857677 > head-copy_nowal-bigtable | 00:05:18.232956 > head-copy_nowal-bigtable | 00:03:24.109047 > head-copy_nowal-bigtable | 00:05:31.019643 > head-select-bigtable | 00:03:47.102731 > head-select-bigtable | 00:01:08.314719 > head-select-bigtable | 00:01:08.238509 > head-select-bigtable | 00:01:08.208563 > head-select-bigtable | 00:01:08.28347 > head-select-bigtable | 00:01:08.308671 > head-vacuum_clean-bigtable | 00:01:04.227832 > head-vacuum_clean-bigtable | 00:01:04.232258 > head-vacuum_clean-bigtable | 00:01:04.294621 > head-vacuum_clean-bigtable | 00:01:04.280677 > head-vacuum_hintbits-bigtable | 00:04:01.123924 > head-vacuum_hintbits-bigtable | 00:03:58.253175 > head-vacuum_hintbits-bigtable | 00:04:26.318159 > head-vacuum_hintbits-bigtable | 00:04:37.512965 > patched-copy-bigtable | 00:09:52.776754 > patched-copy-bigtable | 00:10:18.185826 > patched-copy-bigtable | 00:10:16.975482 > patched-copy_nowal-bigtable | 00:03:14.882366 > patched-copy_nowal-bigtable | 00:04:01.04648 > patched-copy_nowal-bigtable | 00:03:56.062272 > patched-select-bigtable | 00:03:47.704154 > patched-select-bigtable | 00:01:08.460326 > patched-select-bigtable | 00:01:10.441544 > patched-select-bigtable | 00:01:11.916221 > patched-select-bigtable | 00:01:13.848038 > patched-select-bigtable | 00:01:10.956133 > patched-vacuum_clean-bigtable | 00:01:10.315439 > patched-vacuum_clean-bigtable | 00:01:12.210537 > patched-vacuum_clean-bigtable | 00:01:15.202114 > patched-vacuum_clean-bigtable | 00:01:10.712235 > patched-vacuum_hintbits-bigtable | 00:03:42.279201 > patched-vacuum_hintbits-bigtable | 00:04:02.057778 > patched-vacuum_hintbits-bigtable | 00:04:26.805822 > patched-vacuum_hintbits-bigtable | 00:04:28.911184 > > In other words, the patch has no significant effect, as expected. The > select times did go up by a couple of seconds, which I didn't expect, > though. One theory is that unused shared_buffers are swapped out during > the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and > try again at some point. > > Results for a 2 GB table: > > copy-medsize-unpatched | 00:02:18.23246 > copy-medsize-unpatched | 00:02:22.347194 > copy-medsize-unpatched | 00:02:23.875874 > copy_nowal-medsize-unpatched | 00:01:27.606334 > copy_nowal-medsize-unpatched | 00:01:17.491243 > copy_nowal-medsize-unpatched | 00:01:31.902719 > select-medsize-unpatched | 00:00:03.786031 > select-medsize-unpatched | 00:00:02.678069 > select-medsize-unpatched | 00:00:02.666103 > select-medsize-unpatched | 00:00:02.673494 > select-medsize-unpatched | 00:00:02.669645 > select-medsize-unpatched | 00:00:02.666278 > vacuum_clean-medsize-unpatched | 00:00:01.091356 > vacuum_clean-medsize-unpatched | 00:00:01.923138 > vacuum_clean-medsize-unpatched | 00:00:01.917213 > vacuum_clean-medsize-unpatched | 00:00:01.917333 > vacuum_hintbits-medsize-unpatched | 00:00:01.683718 > vacuum_hintbits-medsize-unpatched | 00:00:01.864003 > vacuum_hintbits-medsize-unpatched | 00:00:03.186596 > vacuum_hintbits-medsize-unpatched | 00:00:02.16494 > copy-medsize-patched | 00:02:35.113501 > copy-medsize-patched | 00:02:25.269866 > copy-medsize-patched | 00:02:31.881089 > copy_nowal-medsize-patched | 00:01:00.254633 > copy_nowal-medsize-patched | 00:01:04.630687 > copy_nowal-medsize-patched | 00:01:03.729128 > select-medsize-patched | 00:00:03.201837 > select-medsize-patched | 00:00:01.332975 > select-medsize-patched | 00:00:01.33014 > select-medsize-patched | 00:00:01.332392 > select-medsize-patched | 00:00:01.333498 > select-medsize-patched | 00:00:01.332692 > vacuum_clean-medsize-patched | 00:00:01.140189 > vacuum_clean-medsize-patched | 00:00:01.062762 > vacuum_clean-medsize-patched | 00:00:01.062402 > vacuum_clean-medsize-patched | 00:00:01.07113 > vacuum_hintbits-medsize-patched | 00:00:17.865446 > vacuum_hintbits-medsize-patched | 00:00:15.162064 > vacuum_hintbits-medsize-patched | 00:00:01.704651 > vacuum_hintbits-medsize-patched | 00:00:02.671651 > > This looks good to me, except for some glitch at the last > vacuum_hintbits tests. Selects and vacuums benefit significantly, as > does non-WAL-logged copy. > > Not shown here, but I run tests earlier with vacuum on a table that > actually had dead tuples to be removed on it. In that test the patched > version really shined, reducing the runtime to ~ 1/6th. That was the > original motivation of this patch: not having to do a WAL flush on every > page in the 2nd phase of vacuum. > > Test script attached. To use it: > > 1. Edit testscript.sh. Change BIGTABLESIZE. > 2. Start postmaster > 3. Run script, giving test-label as argument. For example: > "./testscript.sh bigtable-patched" > > Attached is also the patch I used for the tests. > > I would appreciate it if people would download the patch and the script > and repeat the tests on different hardware. I'm particularly interested > in testing on a box with good I/O hardware where selects on unpatched > PostgreSQL are bottlenecked by CPU. > > Barring any surprises I'm going to fix the remaining issue and submit a > final patch, probably in the weekend. > > (*) The issue with this patch is that if the buffer cache is completely > filled with dirty buffers that need a WAL flush to evict, the buffer > ring code will get into an infinite loop trying to find one that doesn't > need a WAL flush. Should be simple to fix. > > > ------------------------------------------------------------------------ > > Index: src/backend/access/heap/heapam.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v > retrieving revision 1.232 > diff -c -r1.232 heapam.c > *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232 > --- src/backend/access/heap/heapam.c 16 May 2007 11:35:14 -0000 > *************** > *** 83,88 **** > --- 83,96 ---- > */ > scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd); > > + /* A scan on a table smaller than shared_buffers is treated like random > + * access, but bigger scans should use the bulk read replacement policy. > + */ > + if (scan->rs_nblocks > NBuffers) > + scan->rs_accesspattern = AP_BULKREAD; > + else > + scan->rs_accesspattern = AP_NORMAL; > + > scan->rs_inited = false; > scan->rs_ctup.t_data = NULL; > ItemPointerSetInvalid(&scan->rs_ctup.t_self); > *************** > *** 123,133 **** > --- 131,146 ---- > > Assert(page < scan->rs_nblocks); > > + /* Read the page with the right strategy */ > + SetAccessPattern(scan->rs_accesspattern); > + > scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf, > scan->rs_rd, > page); > scan->rs_cblock = page; > > + SetAccessPattern(AP_NORMAL); > + > if (!scan->rs_pageatatime) > return; > > Index: src/backend/access/transam/xlog.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v > retrieving revision 1.268 > diff -c -r1.268 xlog.c > *** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 -0000 1.268 > --- src/backend/access/transam/xlog.c 15 May 2007 16:23:30 -0000 > *************** > *** 1668,1673 **** > --- 1668,1700 ---- > } > > /* > + * Returns true if 'record' hasn't been flushed to disk yet. > + */ > + bool > + XLogNeedsFlush(XLogRecPtr record) > + { > + /* Quick exit if already known flushed */ > + if (XLByteLE(record, LogwrtResult.Flush)) > + return false; > + > + /* read LogwrtResult and update local state */ > + { > + /* use volatile pointer to prevent code rearrangement */ > + volatile XLogCtlData *xlogctl = XLogCtl; > + > + SpinLockAcquire(&xlogctl->info_lck); > + LogwrtResult = xlogctl->LogwrtResult; > + SpinLockRelease(&xlogctl->info_lck); > + } > + > + /* check again */ > + if (XLByteLE(record, LogwrtResult.Flush)) > + return false; > + > + return true; > + } > + > + /* > * Ensure that all XLOG data through the given position is flushed to disk. > * > * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not > Index: src/backend/commands/copy.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v > retrieving revision 1.283 > diff -c -r1.283 copy.c > *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 -0000 1.283 > --- src/backend/commands/copy.c 15 May 2007 17:05:29 -0000 > *************** > *** 1876,1881 **** > --- 1876,1888 ---- > nfields = file_has_oids ? (attr_count + 1) : attr_count; > field_strings = (char **) palloc(nfields * sizeof(char *)); > > + /* Use the special COPY buffer replacement strategy if WAL-logging > + * is enabled. If it's not, the pages we're writing are dirty but > + * don't need a WAL flush to write out, so the BULKREAD strategy > + * is more suitable. > + */ > + SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD); > + > /* Initialize state variables */ > cstate->fe_eof = false; > cstate->eol_type = EOL_UNKNOWN; > *************** > *** 2161,2166 **** > --- 2168,2176 ---- > cstate->filename))); > } > > + /* Reset buffer replacement strategy */ > + SetAccessPattern(AP_NORMAL); > + > /* > * If we skipped writing WAL, then we need to sync the heap (but not > * indexes since those use WAL anyway) > Index: src/backend/commands/vacuum.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v > retrieving revision 1.350 > diff -c -r1.350 vacuum.c > *** src/backend/commands/vacuum.c 16 Apr 2007 18:29:50 -0000 1.350 > --- src/backend/commands/vacuum.c 15 May 2007 17:06:18 -0000 > *************** > *** 421,431 **** > * Tell the buffer replacement strategy that vacuum is causing > * the IO > */ > ! StrategyHintVacuum(true); > > analyze_rel(relid, vacstmt); > > ! StrategyHintVacuum(false); > > if (use_own_xacts) > CommitTransactionCommand(); > --- 421,431 ---- > * Tell the buffer replacement strategy that vacuum is causing > * the IO > */ > ! SetAccessPattern(AP_VACUUM); > > analyze_rel(relid, vacstmt); > > ! SetAccessPattern(AP_NORMAL); > > if (use_own_xacts) > CommitTransactionCommand(); > *************** > *** 442,448 **** > /* Make sure cost accounting is turned off after error */ > VacuumCostActive = false; > /* And reset buffer replacement strategy, too */ > ! StrategyHintVacuum(false); > PG_RE_THROW(); > } > PG_END_TRY(); > --- 442,448 ---- > /* Make sure cost accounting is turned off after error */ > VacuumCostActive = false; > /* And reset buffer replacement strategy, too */ > ! SetAccessPattern(AP_NORMAL); > PG_RE_THROW(); > } > PG_END_TRY(); > *************** > *** 1088,1094 **** > * Tell the cache replacement strategy that vacuum is causing all > * following IO > */ > ! StrategyHintVacuum(true); > > /* > * Do the actual work --- either FULL or "lazy" vacuum > --- 1088,1094 ---- > * Tell the cache replacement strategy that vacuum is causing all > * following IO > */ > ! SetAccessPattern(AP_VACUUM); > > /* > * Do the actual work --- either FULL or "lazy" vacuum > *************** > *** 1098,1104 **** > else > lazy_vacuum_rel(onerel, vacstmt); > > ! StrategyHintVacuum(false); > > /* all done with this class, but hold lock until commit */ > relation_close(onerel, NoLock); > --- 1098,1104 ---- > else > lazy_vacuum_rel(onerel, vacstmt); > > ! SetAccessPattern(AP_NORMAL); > > /* all done with this class, but hold lock until commit */ > relation_close(onerel, NoLock); > Index: src/backend/storage/buffer/README > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v > retrieving revision 1.11 > diff -c -r1.11 README > *** src/backend/storage/buffer/README 23 Jul 2006 03:07:58 -0000 1.11 > --- src/backend/storage/buffer/README 16 May 2007 11:43:11 -0000 > *************** > *** 152,159 **** > a field to show which backend is doing its I/O). > > > ! Buffer replacement strategy > ! --------------------------- > > There is a "free list" of buffers that are prime candidates for replacement. > In particular, buffers that are completely free (contain no valid page) are > --- 152,159 ---- > a field to show which backend is doing its I/O). > > > ! Normal buffer replacement strategy > ! ---------------------------------- > > There is a "free list" of buffers that are prime candidates for replacement. > In particular, buffers that are completely free (contain no valid page) are > *************** > *** 199,221 **** > have to give up and try another buffer. This however is not a concern > of the basic select-a-victim-buffer algorithm.) > > - A special provision is that while running VACUUM, a backend does not > - increment the usage count on buffers it accesses. In fact, if ReleaseBuffer > - sees that it is dropping the pin count to zero and the usage count is zero, > - then it appends the buffer to the tail of the free list. (This implies that > - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer; > - this shouldn't create much of a contention problem.) This provision > - encourages VACUUM to work in a relatively small number of buffers rather > - than blowing out the entire buffer cache. It is reasonable since a page > - that has been touched only by VACUUM is unlikely to be needed again soon. > - > - Since VACUUM usually requests many pages very fast, the effect of this is that > - it will get back the very buffers it filled and possibly modified on the next > - call and will therefore do its work in a few shared memory buffers, while > - being able to use whatever it finds in the cache already. This also implies > - that most of the write traffic caused by a VACUUM will be done by the VACUUM > - itself and not pushed off onto other processes. > > > Background writer's processing > ------------------------------ > --- 199,243 ---- > have to give up and try another buffer. This however is not a concern > of the basic select-a-victim-buffer algorithm.) > > > + Buffer ring replacement strategy > + --------------------------------- > + > + When running a query that needs to access a large number of pages, like VACUUM, > + COPY, or a large sequential scan, a different strategy is used. A page that > + has been touched only by such a scan is unlikely to be needed again soon, so > + instead of running the normal clock sweep algorithm and blowing out the entire > + buffer cache, a small ring of buffers is allocated using the normal clock sweep > + algorithm and those buffers are reused for the whole scan. This also implies > + that most of the write traffic caused by such a statement will be done by the > + backend itself and not pushed off onto other processes. > + > + The size of the ring used depends on the kind of scan: > + > + For sequential scans, a small 256 KB ring is used. That's small enough to fit > + in L2 cache, which makes transferring pages from OS cache to shared buffer > + cache efficient. Even less would often be enough, but the ring must be big > + enough to accommodate all pages in the scan that are pinned concurrently. > + 256 KB should also be enough to leave a small cache trail for other backends to > + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer > + is removed from the ring and a replacement buffer is chosen using the normal > + replacement strategy. In a scan that modifies every page in the scan, like a > + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the > + ring strategy effectively degrades to the normal strategy. > + > + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not > + removed from the ring. WAL is flushed instead to allow reuse of the buffers. > + Before introducing the buffer ring strategy in 8.3, buffers were put to the > + freelist, which was effectively a buffer ring of 1 buffer. > + > + COPY behaves like VACUUM, but a much larger ring is used. The ring size is > + chosen to be twice the WAL segment size. This avoids polluting the buffer cache > + like the clock sweep would do, and using a ring larger than WAL segment size > + avoids having to do any extra WAL flushes, since a WAL segment will always be > + filled, forcing a WAL flush, before looping through the buffer ring and bumping > + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY > + operations the smaller 256 KB ring is used because WAL flushes are not needed > + to write the buffers. > > Background writer's processing > ------------------------------ > Index: src/backend/storage/buffer/bufmgr.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v > retrieving revision 1.218 > diff -c -r1.218 bufmgr.c > *** src/backend/storage/buffer/bufmgr.c 2 May 2007 23:34:48 -0000 1.218 > --- src/backend/storage/buffer/bufmgr.c 16 May 2007 12:34:10 -0000 > *************** > *** 419,431 **** > /* Loop here in case we have to try another victim buffer */ > for (;;) > { > /* > * Select a victim buffer. The buffer is returned with its header > * spinlock still held! Also the BufFreelistLock is still held, since > * it would be bad to hold the spinlock while possibly waking up other > * processes. > */ > ! buf = StrategyGetBuffer(); > > Assert(buf->refcount == 0); > > --- 419,433 ---- > /* Loop here in case we have to try another victim buffer */ > for (;;) > { > + bool lock_held; > + > /* > * Select a victim buffer. The buffer is returned with its header > * spinlock still held! Also the BufFreelistLock is still held, since > * it would be bad to hold the spinlock while possibly waking up other > * processes. > */ > ! buf = StrategyGetBuffer(&lock_held); > > Assert(buf->refcount == 0); > > *************** > *** 436,442 **** > PinBuffer_Locked(buf); > > /* Now it's safe to release the freelist lock */ > ! LWLockRelease(BufFreelistLock); > > /* > * If the buffer was dirty, try to write it out. There is a race > --- 438,445 ---- > PinBuffer_Locked(buf); > > /* Now it's safe to release the freelist lock */ > ! if (lock_held) > ! LWLockRelease(BufFreelistLock); > > /* > * If the buffer was dirty, try to write it out. There is a race > *************** > *** 464,469 **** > --- 467,489 ---- > */ > if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED)) > { > + /* In BULKREAD-mode, check if a WAL flush would be needed to > + * evict this buffer. If so, ask the replacement strategy if > + * we should go ahead and do it or choose another victim. > + */ > + if (active_access_pattern == AP_BULKREAD) > + { > + if (XLogNeedsFlush(BufferGetLSN(buf))) > + { > + if (StrategyRejectBuffer(buf)) > + { > + LWLockRelease(buf->content_lock); > + UnpinBuffer(buf, true, false); > + continue; > + } > + } > + } > + > FlushBuffer(buf, NULL); > LWLockRelease(buf->content_lock); > } > *************** > *** 925,932 **** > PrivateRefCount[b]--; > if (PrivateRefCount[b] == 0) > { > - bool immed_free_buffer = false; > - > /* I'd better not still hold any locks on the buffer */ > Assert(!LWLockHeldByMe(buf->content_lock)); > Assert(!LWLockHeldByMe(buf->io_in_progress_lock)); > --- 945,950 ---- > *************** > *** 940,956 **** > /* Update buffer usage info, unless this is an internal access */ > if (normalAccess) > { > ! if (!strategy_hint_vacuum) > { > ! if (buf->usage_count < BM_MAX_USAGE_COUNT) > ! buf->usage_count++; > } > else > ! { > ! /* VACUUM accesses don't bump usage count, instead... */ > ! if (buf->refcount == 0 && buf->usage_count == 0) > ! immed_free_buffer = true; > ! } > } > > if ((buf->flags & BM_PIN_COUNT_WAITER) && > --- 958,975 ---- > /* Update buffer usage info, unless this is an internal access */ > if (normalAccess) > { > ! if (active_access_pattern != AP_NORMAL) > { > ! /* We don't want large one-off scans like vacuum to inflate > ! * the usage_count. We do want to set it to 1, though, to keep > ! * other backends from hijacking it from the buffer ring. > ! */ > ! if (buf->usage_count == 0) > ! buf->usage_count = 1; > } > else > ! if (buf->usage_count < BM_MAX_USAGE_COUNT) > ! buf->usage_count++; > } > > if ((buf->flags & BM_PIN_COUNT_WAITER) && > *************** > *** 965,978 **** > } > else > UnlockBufHdr(buf); > - > - /* > - * If VACUUM is releasing an otherwise-unused buffer, send it to the > - * freelist for near-term reuse. We put it at the tail so that it > - * won't be used before any invalid buffers that may exist. > - */ > - if (immed_free_buffer) > - StrategyFreeBuffer(buf, false); > } > } > > --- 984,989 ---- > Index: src/backend/storage/buffer/freelist.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v > retrieving revision 1.58 > diff -c -r1.58 freelist.c > *** src/backend/storage/buffer/freelist.c 5 Jan 2007 22:19:37 -0000 1.58 > --- src/backend/storage/buffer/freelist.c 17 May 2007 16:12:56 -0000 > *************** > *** 18,23 **** > --- 18,25 ---- > #include "storage/buf_internals.h" > #include "storage/bufmgr.h" > > + #include "utils/memutils.h" > + > > /* > * The shared freelist control information. > *************** > *** 39,47 **** > /* Pointers to shared state */ > static BufferStrategyControl *StrategyControl = NULL; > > ! /* Backend-local state about whether currently vacuuming */ > ! bool strategy_hint_vacuum = false; > > > /* > * StrategyGetBuffer > --- 41,53 ---- > /* Pointers to shared state */ > static BufferStrategyControl *StrategyControl = NULL; > > ! /* Currently active access pattern hint. */ > ! AccessPattern active_access_pattern = AP_NORMAL; > > + /* prototypes for internal functions */ > + static volatile BufferDesc *GetBufferFromRing(void); > + static void PutBufferToRing(volatile BufferDesc *buf); > + static void InitRing(void); > > /* > * StrategyGetBuffer > *************** > *** 51,67 **** > * the selected buffer must not currently be pinned by anyone. > * > * To ensure that no one else can pin the buffer before we do, we must > ! * return the buffer with the buffer header spinlock still held. That > ! * means that we return with the BufFreelistLock still held, as well; > ! * the caller must release that lock once the spinlock is dropped. > */ > volatile BufferDesc * > ! StrategyGetBuffer(void) > { > volatile BufferDesc *buf; > int trycounter; > > LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > > /* > * Try to get a buffer from the freelist. Note that the freeNext fields > --- 57,89 ---- > * the selected buffer must not currently be pinned by anyone. > * > * To ensure that no one else can pin the buffer before we do, we must > ! * return the buffer with the buffer header spinlock still held. If > ! * *lock_held is set at return, we return with the BufFreelistLock still > ! * held, as well; the caller must release that lock once the spinlock is > ! * dropped. > */ > volatile BufferDesc * > ! StrategyGetBuffer(bool *lock_held) > { > volatile BufferDesc *buf; > int trycounter; > > + /* Get a buffer from the ring if we're doing a bulk scan */ > + if (active_access_pattern != AP_NORMAL) > + { > + buf = GetBufferFromRing(); > + if (buf != NULL) > + { > + *lock_held = false; > + return buf; > + } > + } > + > + /* > + * If our selected buffer wasn't available, pick another... > + */ > LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > + *lock_held = true; > > /* > * Try to get a buffer from the freelist. Note that the freeNext fields > *************** > *** 86,96 **** > */ > LockBufHdr(buf); > if (buf->refcount == 0 && buf->usage_count == 0) > return buf; > UnlockBufHdr(buf); > } > > ! /* Nothing on the freelist, so run the "clock sweep" algorithm */ > trycounter = NBuffers; > for (;;) > { > --- 108,122 ---- > */ > LockBufHdr(buf); > if (buf->refcount == 0 && buf->usage_count == 0) > + { > + if (active_access_pattern != AP_NORMAL) > + PutBufferToRing(buf); > return buf; > + } > UnlockBufHdr(buf); > } > > ! /* Nothing on the freelist, so run the shared "clock sweep" algorithm */ > trycounter = NBuffers; > for (;;) > { > *************** > *** 105,111 **** > --- 131,141 ---- > */ > LockBufHdr(buf); > if (buf->refcount == 0 && buf->usage_count == 0) > + { > + if (active_access_pattern != AP_NORMAL) > + PutBufferToRing(buf); > return buf; > + } > if (buf->usage_count > 0) > { > buf->usage_count--; > *************** > *** 191,204 **** > } > > /* > ! * StrategyHintVacuum -- tell us whether VACUUM is active > */ > void > ! StrategyHintVacuum(bool vacuum_active) > { > ! strategy_hint_vacuum = vacuum_active; > ! } > > > /* > * StrategyShmemSize > --- 221,245 ---- > } > > /* > ! * SetAccessPattern -- Sets the active access pattern hint > ! * > ! * Caller is responsible for resetting the hint to AP_NORMAL after the bulk > ! * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of > ! * the other strategies, for example in a query with one large sequential scan > ! * nested loop joined to an index scan. Index tuples should be fetched with the > ! * normal strategy and the pages from the seq scan should be read in with the > ! * AP_BULKREAD strategy. The ring won't be affected by such switching, however > ! * switching to an access pattern with different ring size will invalidate the > ! * old ring. > */ > void > ! SetAccessPattern(AccessPattern new_pattern) > { > ! active_access_pattern = new_pattern; > > + if (active_access_pattern != AP_NORMAL) > + InitRing(); > + } > > /* > * StrategyShmemSize > *************** > *** 274,276 **** > --- 315,498 ---- > else > Assert(!init); > } > + > + /* ---------------------------------------------------------------- > + * Backend-private buffer ring management > + * ---------------------------------------------------------------- > + */ > + > + /* > + * Ring sizes for different access patterns. See README for the rationale > + * of these. > + */ > + #define BULKREAD_RING_SIZE 256 * 1024 / BLCKSZ > + #define VACUUM_RING_SIZE 256 * 1024 / BLCKSZ > + #define COPY_RING_SIZE Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2) > + > + /* > + * BufferRing is an array of buffer ids, and RingSize it's size in number of > + * elements. It's allocated in TopMemoryContext the first time it's needed. > + */ > + static int *BufferRing = NULL; > + static int RingSize = 0; > + > + /* Index of the "current" slot in the ring. It's advanced every time a buffer > + * is handed out from the ring with GetBufferFromRing and it points to the > + * last buffer returned from the ring. RingCurSlot + 1 is the next victim > + * GetBufferRing will hand out. > + */ > + static int RingCurSlot = 0; > + > + /* magic value to mark empty slots in the ring */ > + #define BUF_ID_NOT_SET -1 > + > + > + /* > + * GetBufferFromRing -- returns a buffer from the ring, or NULL if the > + * ring is empty. > + * > + * The bufhdr spin lock is held on the returned buffer. > + */ > + static volatile BufferDesc * > + GetBufferFromRing(void) > + { > + volatile BufferDesc *buf; > + > + /* ring should be initialized by now */ > + Assert(RingSize > 0 && BufferRing != NULL); > + > + /* Run private "clock cycle" */ > + if (++RingCurSlot >= RingSize) > + RingCurSlot = 0; > + > + /* > + * If that slot hasn't been filled yet, tell the caller to allocate > + * a new buffer with the normal allocation strategy. He will then > + * fill this slot by calling PutBufferToRing with the new buffer. > + */ > + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) > + return NULL; > + > + buf = &BufferDescriptors[BufferRing[RingCurSlot]]; > + > + /* > + * If the buffer is pinned we cannot use it under any circumstances. > + * If usage_count == 0 then the buffer is fair game. > + * > + * We also choose this buffer if usage_count == 1. Strictly, this > + * might sometimes be the wrong thing to do, but we rely on the high > + * probability that it was this process that last touched the buffer. > + * If it wasn't, we'll choose a suboptimal victim, but it shouldn't > + * make any difference in the big scheme of things. > + * > + */ > + LockBufHdr(buf); > + if (buf->refcount == 0 && buf->usage_count <= 1) > + return buf; > + UnlockBufHdr(buf); > + > + return NULL; > + } > + > + /* > + * PutBufferToRing -- adds a buffer to the buffer ring > + * > + * Caller must hold the buffer header spinlock on the buffer. > + */ > + static void > + PutBufferToRing(volatile BufferDesc *buf) > + { > + /* ring should be initialized by now */ > + Assert(RingSize > 0 && BufferRing != NULL); > + > + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) > + BufferRing[RingCurSlot] = buf->buf_id; > + } > + > + /* > + * Initializes a ring buffer with correct size for the currently > + * active strategy. Does nothing if the ring already has the right size. > + */ > + static void > + InitRing(void) > + { > + int new_size; > + int old_size = RingSize; > + int i; > + MemoryContext oldcxt; > + > + /* Determine new size */ > + > + switch(active_access_pattern) > + { > + case AP_BULKREAD: > + new_size = BULKREAD_RING_SIZE; > + break; > + case AP_COPY: > + new_size = COPY_RING_SIZE; > + break; > + case AP_VACUUM: > + new_size = VACUUM_RING_SIZE; > + break; > + default: > + elog(ERROR, "unexpected buffer cache strategy %d", > + active_access_pattern); > + return; /* keep compile happy */ > + } > + > + /* > + * Seq scans set and reset the strategy on every page, so we better exit > + * quickly if no change in size is needed. > + */ > + if (new_size == old_size) > + return; > + > + /* Allocate array */ > + > + oldcxt = MemoryContextSwitchTo(TopMemoryContext); > + > + if (old_size == 0) > + { > + Assert(BufferRing == NULL); > + BufferRing = palloc(new_size * sizeof(int)); > + } > + else > + BufferRing = repalloc(BufferRing, new_size * sizeof(int)); > + > + MemoryContextSwitchTo(oldcxt); > + > + for(i = 0; i < new_size; i++) > + BufferRing[i] = BUF_ID_NOT_SET; > + > + RingCurSlot = 0; > + RingSize = new_size; > + } > + > + /* > + * Buffer manager calls this function in AP_BULKREAD mode when the > + * buffer handed to it turns out to need a WAL flush to write out. This > + * gives the strategy a second chance to choose another victim. > + * > + * Returns true if buffer manager should ask for a new victim, and false > + * if WAL should be flushed and this buffer used. > + */ > + bool > + StrategyRejectBuffer(volatile BufferDesc *buf) > + { > + Assert(RingSize > 0); > + > + if (BufferRing[RingCurSlot] == buf->buf_id) > + { > + BufferRing[RingCurSlot] = BUF_ID_NOT_SET; > + return true; > + } > + else > + { > + /* Apparently the buffer didn't come from the ring. We don't want to > + * mess with how the clock sweep works; in worst case there's no > + * buffers in the buffer cache that can be reused without a WAL flush, > + * and we'd get into an endless loop trying. > + */ > + return false; > + } > + } > Index: src/include/access/relscan.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v > retrieving revision 1.52 > diff -c -r1.52 relscan.h > *** src/include/access/relscan.h 20 Jan 2007 18:43:35 -0000 1.52 > --- src/include/access/relscan.h 15 May 2007 17:01:31 -0000 > *************** > *** 28,33 **** > --- 28,34 ---- > ScanKey rs_key; /* array of scan key descriptors */ > BlockNumber rs_nblocks; /* number of blocks to scan */ > bool rs_pageatatime; /* verify visibility page-at-a-time? */ > + AccessPattern rs_accesspattern; /* access pattern to use for reads */ > > /* scan current state */ > bool rs_inited; /* false = scan not init'd yet */ > Index: src/include/access/xlog.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v > retrieving revision 1.76 > diff -c -r1.76 xlog.h > *** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76 > --- src/include/access/xlog.h 14 May 2007 21:22:40 -0000 > *************** > *** 151,156 **** > --- 151,157 ---- > > extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata); > extern void XLogFlush(XLogRecPtr RecPtr); > + extern bool XLogNeedsFlush(XLogRecPtr RecPtr); > > extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record); > extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec); > Index: src/include/storage/buf_internals.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v > retrieving revision 1.89 > diff -c -r1.89 buf_internals.h > *** src/include/storage/buf_internals.h 5 Jan 2007 22:19:57 -0000 1.89 > --- src/include/storage/buf_internals.h 15 May 2007 17:07:59 -0000 > *************** > *** 16,21 **** > --- 16,22 ---- > #define BUFMGR_INTERNALS_H > > #include "storage/buf.h" > + #include "storage/bufmgr.h" > #include "storage/lwlock.h" > #include "storage/shmem.h" > #include "storage/spin.h" > *************** > *** 168,174 **** > extern BufferDesc *LocalBufferDescriptors; > > /* in freelist.c */ > ! extern bool strategy_hint_vacuum; > > /* event counters in buf_init.c */ > extern long int ReadBufferCount; > --- 169,175 ---- > extern BufferDesc *LocalBufferDescriptors; > > /* in freelist.c */ > ! extern AccessPattern active_access_pattern; > > /* event counters in buf_init.c */ > extern long int ReadBufferCount; > *************** > *** 184,195 **** > */ > > /* freelist.c */ > ! extern volatile BufferDesc *StrategyGetBuffer(void); > extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); > extern int StrategySyncStart(void); > extern Size StrategyShmemSize(void); > extern void StrategyInitialize(bool init); > > /* buf_table.c */ > extern Size BufTableShmemSize(int size); > extern void InitBufTable(int size); > --- 185,198 ---- > */ > > /* freelist.c */ > ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held); > extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); > extern int StrategySyncStart(void); > extern Size StrategyShmemSize(void); > extern void StrategyInitialize(bool init); > > + extern bool StrategyRejectBuffer(volatile BufferDesc *buf); > + > /* buf_table.c */ > extern Size BufTableShmemSize(int size); > extern void InitBufTable(int size); > Index: src/include/storage/bufmgr.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v > retrieving revision 1.103 > diff -c -r1.103 bufmgr.h > *** src/include/storage/bufmgr.h 2 May 2007 23:18:03 -0000 1.103 > --- src/include/storage/bufmgr.h 15 May 2007 17:07:02 -0000 > *************** > *** 48,53 **** > --- 48,61 ---- > #define BUFFER_LOCK_SHARE 1 > #define BUFFER_LOCK_EXCLUSIVE 2 > > + typedef enum AccessPattern > + { > + AP_NORMAL, /* Normal random access */ > + AP_BULKREAD, /* Large read-only scan (hint bit updates are ok) */ > + AP_COPY, /* Large updating scan, like COPY with WAL enabled */ > + AP_VACUUM, /* VACUUM */ > + } AccessPattern; > + > /* > * These routines are beaten on quite heavily, hence the macroization. > */ > *************** > *** 157,162 **** > extern void AtProcExit_LocalBuffers(void); > > /* in freelist.c */ > ! extern void StrategyHintVacuum(bool vacuum_active); > > #endif > --- 165,170 ---- > extern void AtProcExit_LocalBuffers(void); > > /* in freelist.c */ > ! extern void SetAccessPattern(AccessPattern new_pattern); > > #endif -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com #include <stdio.h> #include <stdlib.h> #include <string.h> #define TUPLES_PER_PAGE 15 int main(int argc, char **argv) { int tablesize; int lines; char buf[1000]; int i; if (argc != 2) { exit(1); } memset(buf, 'a', 500); buf[500] = '\0'; tablesize = atoi(argv[1]); lines = tablesize * 1024 * 1024 / 8192 * TUPLES_PER_PAGE; for(i = 1; i <= lines; i++) printf("%d\t%s\n", i, buf); }
Great. Based on this, do you have a patch that is ready to apply. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > In any case, I'd like to see more test results before we make a > > decision. I'm running tests with DBT-2 and a seq scan running in the > > background to see if the cache-spoiling effect shows up. I'm also trying > > to get hold of some bigger hardware to run on. Running these tests takes > > some calendar time, but the hard work has already been done. I'm going > > to start reviewing Jeff's synchronized scans patch now. > > Here are the results of the DBT-2 tests: > > http://community.enterprisedb.com/seqscan/imola/ > > In each of these tests, at the end of rampup a script is started that > issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay > between end of previous query and start of next one. > > The patch makes the seq scans go significantly faster. In the 1 hour > test period, the patched tests perform roughly 30-100% as many selects > as unpatched tests. > > With 100 and 105 warehouses, it also significantly reduces the impact of > the seq scan on other queries; response times are lower with the patch. > With 120 warehouses the reduction of impact is not as clear, but when > you plot the response times it's still there (the plots on the "response > times charts"-page are useless because they're overwhelmed by the > checkpoint spike). > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Not yet, there's still one issue that needs fixing. Bruce Momjian wrote: > Great. Based on this, do you have a patch that is ready to apply. > > --------------------------------------------------------------------------- > > Heikki Linnakangas wrote: >> Heikki Linnakangas wrote: >>> In any case, I'd like to see more test results before we make a >>> decision. I'm running tests with DBT-2 and a seq scan running in the >>> background to see if the cache-spoiling effect shows up. I'm also trying >>> to get hold of some bigger hardware to run on. Running these tests takes >>> some calendar time, but the hard work has already been done. I'm going >>> to start reviewing Jeff's synchronized scans patch now. >> Here are the results of the DBT-2 tests: >> >> http://community.enterprisedb.com/seqscan/imola/ >> >> In each of these tests, at the end of rampup a script is started that >> issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay >> between end of previous query and start of next one. >> >> The patch makes the seq scans go significantly faster. In the 1 hour >> test period, the patched tests perform roughly 30-100% as many selects >> as unpatched tests. >> >> With 100 and 105 warehouses, it also significantly reduces the impact of >> the seq scan on other queries; response times are lower with the patch. >> With 120 warehouses the reduction of impact is not as clear, but when >> you plot the response times it's still there (the plots on the "response >> times charts"-page are useless because they're overwhelmed by the >> checkpoint spike). >> >> -- >> Heikki Linnakangas >> EnterpriseDB http://www.enterprisedb.com >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 5: don't forget to increase your free space map settings > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Here's a new version, all known issues are now fixed. I'm now happy with this patch. Next, I'll start looking at the latest version of Jeff's synchronized scans patch. Bruce Momjian wrote: > Great. Based on this, do you have a patch that is ready to apply. > > --------------------------------------------------------------------------- > > Heikki Linnakangas wrote: >> Heikki Linnakangas wrote: >>> In any case, I'd like to see more test results before we make a >>> decision. I'm running tests with DBT-2 and a seq scan running in the >>> background to see if the cache-spoiling effect shows up. I'm also trying >>> to get hold of some bigger hardware to run on. Running these tests takes >>> some calendar time, but the hard work has already been done. I'm going >>> to start reviewing Jeff's synchronized scans patch now. >> Here are the results of the DBT-2 tests: >> >> http://community.enterprisedb.com/seqscan/imola/ >> >> In each of these tests, at the end of rampup a script is started that >> issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay >> between end of previous query and start of next one. >> >> The patch makes the seq scans go significantly faster. In the 1 hour >> test period, the patched tests perform roughly 30-100% as many selects >> as unpatched tests. >> >> With 100 and 105 warehouses, it also significantly reduces the impact of >> the seq scan on other queries; response times are lower with the patch. >> With 120 warehouses the reduction of impact is not as clear, but when >> you plot the response times it's still there (the plots on the "response >> times charts"-page are useless because they're overwhelmed by the >> checkpoint spike). >> >> -- >> Heikki Linnakangas >> EnterpriseDB http://www.enterprisedb.com >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 5: don't forget to increase your free space map settings > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.232 diff -c -r1.232 heapam.c *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232 --- src/backend/access/heap/heapam.c 25 May 2007 19:22:33 -0000 *************** *** 83,88 **** --- 83,96 ---- */ scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd); + /* A scan on a table smaller than shared_buffers is treated like random + * access, but bigger scans use the bulk read page replacement policy. + */ + if (scan->rs_nblocks > NBuffers) + scan->rs_accesspattern = AP_BULKREAD; + else + scan->rs_accesspattern = AP_NORMAL; + scan->rs_inited = false; scan->rs_ctup.t_data = NULL; ItemPointerSetInvalid(&scan->rs_ctup.t_self); *************** *** 123,133 **** --- 131,146 ---- Assert(page < scan->rs_nblocks); + /* Read the page with the right strategy */ + SetAccessPattern(scan->rs_accesspattern); + scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf, scan->rs_rd, page); scan->rs_cblock = page; + SetAccessPattern(AP_NORMAL); + if (!scan->rs_pageatatime) return; Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.268 diff -c -r1.268 xlog.c *** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 -0000 1.268 --- src/backend/access/transam/xlog.c 15 May 2007 16:23:30 -0000 *************** *** 1668,1673 **** --- 1668,1700 ---- } /* + * Returns true if 'record' hasn't been flushed to disk yet. + */ + bool + XLogNeedsFlush(XLogRecPtr record) + { + /* Quick exit if already known flushed */ + if (XLByteLE(record, LogwrtResult.Flush)) + return false; + + /* read LogwrtResult and update local state */ + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + LogwrtResult = xlogctl->LogwrtResult; + SpinLockRelease(&xlogctl->info_lck); + } + + /* check again */ + if (XLByteLE(record, LogwrtResult.Flush)) + return false; + + return true; + } + + /* * Ensure that all XLOG data through the given position is flushed to disk. * * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.283 diff -c -r1.283 copy.c *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 -0000 1.283 --- src/backend/commands/copy.c 15 May 2007 17:05:29 -0000 *************** *** 1876,1881 **** --- 1876,1888 ---- nfields = file_has_oids ? (attr_count + 1) : attr_count; field_strings = (char **) palloc(nfields * sizeof(char *)); + /* Use the special COPY buffer replacement strategy if WAL-logging + * is enabled. If it's not, the pages we're writing are dirty but + * don't need a WAL flush to write out, so the BULKREAD strategy + * is more suitable. + */ + SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD); + /* Initialize state variables */ cstate->fe_eof = false; cstate->eol_type = EOL_UNKNOWN; *************** *** 2161,2166 **** --- 2168,2176 ---- cstate->filename))); } + /* Reset buffer replacement strategy */ + SetAccessPattern(AP_NORMAL); + /* * If we skipped writing WAL, then we need to sync the heap (but not * indexes since those use WAL anyway) Index: src/backend/commands/vacuum.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v retrieving revision 1.350 diff -c -r1.350 vacuum.c *** src/backend/commands/vacuum.c 16 Apr 2007 18:29:50 -0000 1.350 --- src/backend/commands/vacuum.c 15 May 2007 17:06:18 -0000 *************** *** 421,431 **** * Tell the buffer replacement strategy that vacuum is causing * the IO */ ! StrategyHintVacuum(true); analyze_rel(relid, vacstmt); ! StrategyHintVacuum(false); if (use_own_xacts) CommitTransactionCommand(); --- 421,431 ---- * Tell the buffer replacement strategy that vacuum is causing * the IO */ ! SetAccessPattern(AP_VACUUM); analyze_rel(relid, vacstmt); ! SetAccessPattern(AP_NORMAL); if (use_own_xacts) CommitTransactionCommand(); *************** *** 442,448 **** /* Make sure cost accounting is turned off after error */ VacuumCostActive = false; /* And reset buffer replacement strategy, too */ ! StrategyHintVacuum(false); PG_RE_THROW(); } PG_END_TRY(); --- 442,448 ---- /* Make sure cost accounting is turned off after error */ VacuumCostActive = false; /* And reset buffer replacement strategy, too */ ! SetAccessPattern(AP_NORMAL); PG_RE_THROW(); } PG_END_TRY(); *************** *** 1088,1094 **** * Tell the cache replacement strategy that vacuum is causing all * following IO */ ! StrategyHintVacuum(true); /* * Do the actual work --- either FULL or "lazy" vacuum --- 1088,1094 ---- * Tell the cache replacement strategy that vacuum is causing all * following IO */ ! SetAccessPattern(AP_VACUUM); /* * Do the actual work --- either FULL or "lazy" vacuum *************** *** 1098,1104 **** else lazy_vacuum_rel(onerel, vacstmt); ! StrategyHintVacuum(false); /* all done with this class, but hold lock until commit */ relation_close(onerel, NoLock); --- 1098,1104 ---- else lazy_vacuum_rel(onerel, vacstmt); ! SetAccessPattern(AP_NORMAL); /* all done with this class, but hold lock until commit */ relation_close(onerel, NoLock); Index: src/backend/storage/buffer/README =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v retrieving revision 1.11 diff -c -r1.11 README *** src/backend/storage/buffer/README 23 Jul 2006 03:07:58 -0000 1.11 --- src/backend/storage/buffer/README 16 May 2007 11:43:11 -0000 *************** *** 152,159 **** a field to show which backend is doing its I/O). ! Buffer replacement strategy ! --------------------------- There is a "free list" of buffers that are prime candidates for replacement. In particular, buffers that are completely free (contain no valid page) are --- 152,159 ---- a field to show which backend is doing its I/O). ! Normal buffer replacement strategy ! ---------------------------------- There is a "free list" of buffers that are prime candidates for replacement. In particular, buffers that are completely free (contain no valid page) are *************** *** 199,221 **** have to give up and try another buffer. This however is not a concern of the basic select-a-victim-buffer algorithm.) - A special provision is that while running VACUUM, a backend does not - increment the usage count on buffers it accesses. In fact, if ReleaseBuffer - sees that it is dropping the pin count to zero and the usage count is zero, - then it appends the buffer to the tail of the free list. (This implies that - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer; - this shouldn't create much of a contention problem.) This provision - encourages VACUUM to work in a relatively small number of buffers rather - than blowing out the entire buffer cache. It is reasonable since a page - that has been touched only by VACUUM is unlikely to be needed again soon. - - Since VACUUM usually requests many pages very fast, the effect of this is that - it will get back the very buffers it filled and possibly modified on the next - call and will therefore do its work in a few shared memory buffers, while - being able to use whatever it finds in the cache already. This also implies - that most of the write traffic caused by a VACUUM will be done by the VACUUM - itself and not pushed off onto other processes. Background writer's processing ------------------------------ --- 199,243 ---- have to give up and try another buffer. This however is not a concern of the basic select-a-victim-buffer algorithm.) + Buffer ring replacement strategy + --------------------------------- + + When running a query that needs to access a large number of pages, like VACUUM, + COPY, or a large sequential scan, a different strategy is used. A page that + has been touched only by such a scan is unlikely to be needed again soon, so + instead of running the normal clock sweep algorithm and blowing out the entire + buffer cache, a small ring of buffers is allocated using the normal clock sweep + algorithm and those buffers are reused for the whole scan. This also implies + that most of the write traffic caused by such a statement will be done by the + backend itself and not pushed off onto other processes. + + The size of the ring used depends on the kind of scan: + + For sequential scans, a small 256 KB ring is used. That's small enough to fit + in L2 cache, which makes transferring pages from OS cache to shared buffer + cache efficient. Even less would often be enough, but the ring must be big + enough to accommodate all pages in the scan that are pinned concurrently. + 256 KB should also be enough to leave a small cache trail for other backends to + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer + is removed from the ring and a replacement buffer is chosen using the normal + replacement strategy. In a scan that modifies every page in the scan, like a + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the + ring strategy effectively degrades to the normal strategy. + + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not + removed from the ring. WAL is flushed instead to allow reuse of the buffers. + Before introducing the buffer ring strategy in 8.3, buffers were put to the + freelist, which was effectively a buffer ring of 1 buffer. + + COPY behaves like VACUUM, but a much larger ring is used. The ring size is + chosen to be twice the WAL segment size. This avoids polluting the buffer cache + like the clock sweep would do, and using a ring larger than WAL segment size + avoids having to do any extra WAL flushes, since a WAL segment will always be + filled, forcing a WAL flush, before looping through the buffer ring and bumping + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY + operations the smaller 256 KB ring is used because WAL flushes are not needed + to write the buffers. Background writer's processing ------------------------------ Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.218 diff -c -r1.218 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 2 May 2007 23:34:48 -0000 1.218 --- src/backend/storage/buffer/bufmgr.c 16 May 2007 12:34:10 -0000 *************** *** 419,431 **** /* Loop here in case we have to try another victim buffer */ for (;;) { /* * Select a victim buffer. The buffer is returned with its header * spinlock still held! Also the BufFreelistLock is still held, since * it would be bad to hold the spinlock while possibly waking up other * processes. */ ! buf = StrategyGetBuffer(); Assert(buf->refcount == 0); --- 419,433 ---- /* Loop here in case we have to try another victim buffer */ for (;;) { + bool lock_held; + /* * Select a victim buffer. The buffer is returned with its header * spinlock still held! Also the BufFreelistLock is still held, since * it would be bad to hold the spinlock while possibly waking up other * processes. */ ! buf = StrategyGetBuffer(&lock_held); Assert(buf->refcount == 0); *************** *** 436,442 **** PinBuffer_Locked(buf); /* Now it's safe to release the freelist lock */ ! LWLockRelease(BufFreelistLock); /* * If the buffer was dirty, try to write it out. There is a race --- 438,445 ---- PinBuffer_Locked(buf); /* Now it's safe to release the freelist lock */ ! if (lock_held) ! LWLockRelease(BufFreelistLock); /* * If the buffer was dirty, try to write it out. There is a race *************** *** 464,469 **** --- 467,489 ---- */ if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED)) { + /* In BULKREAD-mode, check if a WAL flush would be needed to + * evict this buffer. If so, ask the replacement strategy if + * we should go ahead and do it or choose another victim. + */ + if (active_access_pattern == AP_BULKREAD) + { + if (XLogNeedsFlush(BufferGetLSN(buf))) + { + if (StrategyRejectBuffer(buf)) + { + LWLockRelease(buf->content_lock); + UnpinBuffer(buf, true, false); + continue; + } + } + } + FlushBuffer(buf, NULL); LWLockRelease(buf->content_lock); } *************** *** 925,932 **** PrivateRefCount[b]--; if (PrivateRefCount[b] == 0) { - bool immed_free_buffer = false; - /* I'd better not still hold any locks on the buffer */ Assert(!LWLockHeldByMe(buf->content_lock)); Assert(!LWLockHeldByMe(buf->io_in_progress_lock)); --- 945,950 ---- *************** *** 940,956 **** /* Update buffer usage info, unless this is an internal access */ if (normalAccess) { ! if (!strategy_hint_vacuum) { ! if (buf->usage_count < BM_MAX_USAGE_COUNT) ! buf->usage_count++; } else ! { ! /* VACUUM accesses don't bump usage count, instead... */ ! if (buf->refcount == 0 && buf->usage_count == 0) ! immed_free_buffer = true; ! } } if ((buf->flags & BM_PIN_COUNT_WAITER) && --- 958,975 ---- /* Update buffer usage info, unless this is an internal access */ if (normalAccess) { ! if (active_access_pattern != AP_NORMAL) { ! /* We don't want large one-off scans like vacuum to inflate ! * the usage_count. We do want to set it to 1, though, to keep ! * other backends from hijacking it from the buffer ring. ! */ ! if (buf->usage_count == 0) ! buf->usage_count = 1; } else ! if (buf->usage_count < BM_MAX_USAGE_COUNT) ! buf->usage_count++; } if ((buf->flags & BM_PIN_COUNT_WAITER) && *************** *** 965,978 **** } else UnlockBufHdr(buf); - - /* - * If VACUUM is releasing an otherwise-unused buffer, send it to the - * freelist for near-term reuse. We put it at the tail so that it - * won't be used before any invalid buffers that may exist. - */ - if (immed_free_buffer) - StrategyFreeBuffer(buf, false); } } --- 984,989 ---- Index: src/backend/storage/buffer/freelist.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v retrieving revision 1.58 diff -c -r1.58 freelist.c *** src/backend/storage/buffer/freelist.c 5 Jan 2007 22:19:37 -0000 1.58 --- src/backend/storage/buffer/freelist.c 25 May 2007 19:25:16 -0000 *************** *** 18,23 **** --- 18,25 ---- #include "storage/buf_internals.h" #include "storage/bufmgr.h" + #include "utils/memutils.h" + /* * The shared freelist control information. *************** *** 39,47 **** /* Pointers to shared state */ static BufferStrategyControl *StrategyControl = NULL; ! /* Backend-local state about whether currently vacuuming */ ! bool strategy_hint_vacuum = false; /* * StrategyGetBuffer --- 41,59 ---- /* Pointers to shared state */ static BufferStrategyControl *StrategyControl = NULL; ! /* Currently active access pattern hint. */ ! AccessPattern active_access_pattern = AP_NORMAL; + /* prototypes for internal functions */ + static volatile BufferDesc *GetBufferFromRing(void); + static void PutBufferToRing(volatile BufferDesc *buf); + static void InitRing(void); + + /* Did the last buffer returned by StrategyGetBuffer come from the buffer + * ring or from freelist/clock sweep? StrategyRejectBuffer needs to know + * that, see comments there. + */ + static bool lastBufferCameFromRing = false; /* * StrategyGetBuffer *************** *** 51,67 **** * the selected buffer must not currently be pinned by anyone. * * To ensure that no one else can pin the buffer before we do, we must ! * return the buffer with the buffer header spinlock still held. That ! * means that we return with the BufFreelistLock still held, as well; ! * the caller must release that lock once the spinlock is dropped. */ volatile BufferDesc * ! StrategyGetBuffer(void) { volatile BufferDesc *buf; int trycounter; LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); /* * Try to get a buffer from the freelist. Note that the freeNext fields --- 63,98 ---- * the selected buffer must not currently be pinned by anyone. * * To ensure that no one else can pin the buffer before we do, we must ! * return the buffer with the buffer header spinlock still held. If ! * *lock_held is set at return, we return with the BufFreelistLock still ! * held, as well; the caller must release that lock once the spinlock is ! * dropped. */ volatile BufferDesc * ! StrategyGetBuffer(bool *lock_held) { volatile BufferDesc *buf; int trycounter; + /* Get a buffer from the ring if we're doing a bulk scan */ + if (active_access_pattern != AP_NORMAL) + { + buf = GetBufferFromRing(); + if (buf != NULL) + { + *lock_held = false; + lastBufferCameFromRing = true; + return buf; + } + } + + lastBufferCameFromRing = false; + + /* + * If our selected buffer wasn't available, pick another... + */ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + *lock_held = true; /* * Try to get a buffer from the freelist. Note that the freeNext fields *************** *** 86,96 **** */ LockBufHdr(buf); if (buf->refcount == 0 && buf->usage_count == 0) return buf; UnlockBufHdr(buf); } ! /* Nothing on the freelist, so run the "clock sweep" algorithm */ trycounter = NBuffers; for (;;) { --- 117,131 ---- */ LockBufHdr(buf); if (buf->refcount == 0 && buf->usage_count == 0) + { + if (active_access_pattern != AP_NORMAL) + PutBufferToRing(buf); return buf; + } UnlockBufHdr(buf); } ! /* Nothing on the freelist, so run the shared "clock sweep" algorithm */ trycounter = NBuffers; for (;;) { *************** *** 105,111 **** --- 140,150 ---- */ LockBufHdr(buf); if (buf->refcount == 0 && buf->usage_count == 0) + { + if (active_access_pattern != AP_NORMAL) + PutBufferToRing(buf); return buf; + } if (buf->usage_count > 0) { buf->usage_count--; *************** *** 191,204 **** } /* ! * StrategyHintVacuum -- tell us whether VACUUM is active */ void ! StrategyHintVacuum(bool vacuum_active) { ! strategy_hint_vacuum = vacuum_active; ! } /* * StrategyShmemSize --- 230,254 ---- } /* ! * SetAccessPattern -- Sets the active access pattern hint ! * ! * Caller is responsible for resetting the hint to AP_NORMAL after the bulk ! * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of ! * the other strategies, for example in a query with one large sequential scan ! * nested loop joined to an index scan. Index tuples should be fetched with the ! * normal strategy and the pages from the seq scan should be read in with the ! * AP_BULKREAD strategy. The ring won't be affected by such switching, however ! * switching to an access pattern with different ring size will invalidate the ! * old ring. */ void ! SetAccessPattern(AccessPattern new_pattern) { ! active_access_pattern = new_pattern; + if (active_access_pattern != AP_NORMAL) + InitRing(); + } /* * StrategyShmemSize *************** *** 274,276 **** --- 324,504 ---- else Assert(!init); } + + /* ---------------------------------------------------------------- + * Backend-private buffer ring management + * ---------------------------------------------------------------- + */ + + /* + * Ring sizes for different access patterns. See README for the rationale + * of these. + */ + #define BULKREAD_RING_SIZE 256 * 1024 / BLCKSZ + #define VACUUM_RING_SIZE 256 * 1024 / BLCKSZ + #define COPY_RING_SIZE Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2) + + /* + * BufferRing is an array of buffer ids, and RingSize it's size in number of + * elements. It's allocated in TopMemoryContext the first time it's needed. + */ + static int *BufferRing = NULL; + static int RingSize = 0; + + /* Index of the "current" slot in the ring. It's advanced every time a buffer + * is handed out from the ring with GetBufferFromRing and it points to the + * last buffer returned from the ring. RingCurSlot + 1 is the next victim + * GetBufferRing will hand out. + */ + static int RingCurSlot = 0; + + /* magic value to mark empty slots in the ring */ + #define BUF_ID_NOT_SET -1 + + + /* + * GetBufferFromRing -- returns a buffer from the ring, or NULL if the + * ring is empty. + * + * The bufhdr spin lock is held on the returned buffer. + */ + static volatile BufferDesc * + GetBufferFromRing(void) + { + volatile BufferDesc *buf; + + /* ring should be initialized by now */ + Assert(RingSize > 0 && BufferRing != NULL); + + /* Run private "clock cycle" */ + if (++RingCurSlot >= RingSize) + RingCurSlot = 0; + + /* + * If that slot hasn't been filled yet, tell the caller to allocate + * a new buffer with the normal allocation strategy. He will then + * fill this slot by calling PutBufferToRing with the new buffer. + */ + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) + return NULL; + + buf = &BufferDescriptors[BufferRing[RingCurSlot]]; + + /* + * If the buffer is pinned we cannot use it under any circumstances. + * If usage_count == 0 then the buffer is fair game. + * + * We also choose this buffer if usage_count == 1. Strictly, this + * might sometimes be the wrong thing to do, but we rely on the high + * probability that it was this process that last touched the buffer. + * If it wasn't, we'll choose a suboptimal victim, but it shouldn't + * make any difference in the big scheme of things. + * + */ + LockBufHdr(buf); + if (buf->refcount == 0 && buf->usage_count <= 1) + return buf; + UnlockBufHdr(buf); + + return NULL; + } + + /* + * PutBufferToRing -- adds a buffer to the buffer ring + * + * Caller must hold the buffer header spinlock on the buffer. + */ + static void + PutBufferToRing(volatile BufferDesc *buf) + { + /* ring should be initialized by now */ + Assert(RingSize > 0 && BufferRing != NULL); + + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) + BufferRing[RingCurSlot] = buf->buf_id; + + } + + /* + * Initializes a ring buffer with correct size for the currently + * active strategy. Does nothing if the ring already has the right size. + */ + static void + InitRing(void) + { + int new_size; + int old_size = RingSize; + int i; + MemoryContext oldcxt; + + /* Determine new size */ + + switch(active_access_pattern) + { + case AP_BULKREAD: + new_size = BULKREAD_RING_SIZE; + break; + case AP_COPY: + new_size = COPY_RING_SIZE; + break; + case AP_VACUUM: + new_size = VACUUM_RING_SIZE; + break; + default: + elog(ERROR, "unexpected buffer cache strategy %d", + active_access_pattern); + return; /* keep compile happy */ + } + + /* + * Seq scans set and reset the strategy on every page, so we better exit + * quickly if no change in size is needed. + */ + if (new_size == old_size) + return; + + /* Allocate array */ + + oldcxt = MemoryContextSwitchTo(TopMemoryContext); + + if (old_size == 0) + { + Assert(BufferRing == NULL); + BufferRing = palloc(new_size * sizeof(int)); + } + else + BufferRing = repalloc(BufferRing, new_size * sizeof(int)); + + MemoryContextSwitchTo(oldcxt); + + for(i = 0; i < new_size; i++) + BufferRing[i] = BUF_ID_NOT_SET; + + RingCurSlot = 0; + RingSize = new_size; + } + + /* + * In AP_BULKREAD mode, the buffer manager calls this function when it turns + * out that the buffer handed to it by StrategyGetBuffer needs a WAL flush to + * write out. That gives us a second chance to choose another victim. + * + * Returns true if buffer manager should ask for a new victim, and false + * if WAL should be flushed and this buffer used. + */ + bool + StrategyRejectBuffer(volatile BufferDesc *buf) + { + Assert(RingSize > 0); + + /* If the buffer didn't come from the ring, we tell the caller to do the + * WAL flush and use the buffer. We don't want to mess with how the clock + * sweep works; in worst case there's no buffers in the buffer cache that + * can be reused without a WAL flush, and we'd get into an endless loop + * trying. + */ + if (lastBufferCameFromRing) + BufferRing[RingCurSlot] = BUF_ID_NOT_SET; + + return lastBufferCameFromRing; + } Index: src/include/access/relscan.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v retrieving revision 1.52 diff -c -r1.52 relscan.h *** src/include/access/relscan.h 20 Jan 2007 18:43:35 -0000 1.52 --- src/include/access/relscan.h 15 May 2007 17:01:31 -0000 *************** *** 28,33 **** --- 28,34 ---- ScanKey rs_key; /* array of scan key descriptors */ BlockNumber rs_nblocks; /* number of blocks to scan */ bool rs_pageatatime; /* verify visibility page-at-a-time? */ + AccessPattern rs_accesspattern; /* access pattern to use for reads */ /* scan current state */ bool rs_inited; /* false = scan not init'd yet */ Index: src/include/access/xlog.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v retrieving revision 1.76 diff -c -r1.76 xlog.h *** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76 --- src/include/access/xlog.h 14 May 2007 21:22:40 -0000 *************** *** 151,156 **** --- 151,157 ---- extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata); extern void XLogFlush(XLogRecPtr RecPtr); + extern bool XLogNeedsFlush(XLogRecPtr RecPtr); extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record); extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec); Index: src/include/storage/buf_internals.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v retrieving revision 1.89 diff -c -r1.89 buf_internals.h *** src/include/storage/buf_internals.h 5 Jan 2007 22:19:57 -0000 1.89 --- src/include/storage/buf_internals.h 15 May 2007 17:07:59 -0000 *************** *** 16,21 **** --- 16,22 ---- #define BUFMGR_INTERNALS_H #include "storage/buf.h" + #include "storage/bufmgr.h" #include "storage/lwlock.h" #include "storage/shmem.h" #include "storage/spin.h" *************** *** 168,174 **** extern BufferDesc *LocalBufferDescriptors; /* in freelist.c */ ! extern bool strategy_hint_vacuum; /* event counters in buf_init.c */ extern long int ReadBufferCount; --- 169,175 ---- extern BufferDesc *LocalBufferDescriptors; /* in freelist.c */ ! extern AccessPattern active_access_pattern; /* event counters in buf_init.c */ extern long int ReadBufferCount; *************** *** 184,195 **** */ /* freelist.c */ ! extern volatile BufferDesc *StrategyGetBuffer(void); extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); extern int StrategySyncStart(void); extern Size StrategyShmemSize(void); extern void StrategyInitialize(bool init); /* buf_table.c */ extern Size BufTableShmemSize(int size); extern void InitBufTable(int size); --- 185,198 ---- */ /* freelist.c */ ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held); extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); extern int StrategySyncStart(void); extern Size StrategyShmemSize(void); extern void StrategyInitialize(bool init); + extern bool StrategyRejectBuffer(volatile BufferDesc *buf); + /* buf_table.c */ extern Size BufTableShmemSize(int size); extern void InitBufTable(int size); Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.103 diff -c -r1.103 bufmgr.h *** src/include/storage/bufmgr.h 2 May 2007 23:18:03 -0000 1.103 --- src/include/storage/bufmgr.h 25 May 2007 19:23:37 -0000 *************** *** 49,54 **** --- 49,67 ---- #define BUFFER_LOCK_EXCLUSIVE 2 /* + * For better cache efficiency, we use different buffer replacement strategies + * for different kinds of access patterns. Use SetAccessPattern to hint the + * buffer manager which kind of access we're doing at the moment. + */ + typedef enum AccessPattern + { + AP_NORMAL, /* Normal random access */ + AP_BULKREAD, /* Large read-only scan (hint bit updates are ok) */ + AP_COPY, /* Large updating scan, like COPY with WAL enabled */ + AP_VACUUM, /* VACUUM */ + } AccessPattern; + + /* * These routines are beaten on quite heavily, hence the macroization. */ *************** *** 157,162 **** extern void AtProcExit_LocalBuffers(void); /* in freelist.c */ ! extern void StrategyHintVacuum(bool vacuum_active); #endif --- 170,175 ---- extern void AtProcExit_LocalBuffers(void); /* in freelist.c */ ! extern void SetAccessPattern(AccessPattern new_pattern); #endif
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Here's a new version, all known issues are now fixed. I'm now happy with > this patch. > Next, I'll start looking at the latest version of Jeff's synchronized > scans patch. I'm a bit confused --- weren't you intending to review these in parallel because of the possible interactions? Do you think this should be applied now, or does it need to wait to see what happens with Jeff's patch? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Here's a new version, all known issues are now fixed. I'm now happy with >> this patch. >> Next, I'll start looking at the latest version of Jeff's synchronized >> scans patch. > > I'm a bit confused --- weren't you intending to review these in parallel > because of the possible interactions? Do you think this should be > applied now, or does it need to wait to see what happens with Jeff's > patch? I think it should be applied now. I've briefly looked at Jeff's patch and I don't see any problems looming there. Jeff performed tests with Simon's original patch and his patch, and I think the results from those tests are still valid since the basic behavior hasn't been changed. I'll repeat those tests myself, and run some more to see if the added CPU overhead shows up in tests, but I'm pretty confident the patches will work together as advertised. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Here's a new version, all known issues are now fixed. I'm now happy with > this patch. I'm looking this over and finding it fairly ugly from a system-structural point of view. In particular, this has pushed the single-global-variable StrategyHintVacuum() concept well past the breaking point. That API was sort of tolerable for vacuum, since vacuum isn't re-entrant and can't coexist with any other behavior within the same backend; though I never liked the fact that it affected vacuum's system-catalog accesses along with the intended table fetches. But now you've got bits of code hacking the access pattern in contexts that are far less predictable than vacuum ... and not guaranteeing to reset the access pattern on failure, either. I think we've got to get rid of the global variable and make the access pattern be a parameter to StrategyGetBuffer, instead. Which in turn suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the next level up. This would serve directly for the heapscan case, and for the main heap accesses in vacuum/analyze, but it might be a bit of a pain to make index vacuuming pay attention to the strategy. The other case that the patch addresses is COPY TO REL, which we could handle if we were willing to pass a strategy parameter down through heap_insert and RelationGetBufferForTuple ... is it worth the trouble? I don't recall anyone having presented measurements to show COPY TO REL being helped by this patch. I don't especially care for the ring data structure being global inside freelist.c, either. What I'm inclined to do is have the "access strategy parameter" actually be a pointer to some struct type, with NULL representing the default strategy. At the beginning of a bulk operation we'd create an access strategy object, which would be internally the current Ring data structure, but it'd be opaque to code outside freelist.c. This'd mean that a query using two large seqscans would use two rings not one, but that's not necessarily bad; it would in any case make it a lot easier to generalize the strategy concept in future to support more than one kind of non-default policy being active in different parts of a query. A point I have not figured out how to deal with is that in the patch as given, UnpinBuffer needs to know the strategy; and getting it that info would make the changes far more invasive. But the patch's behavior here is pretty risky anyway, since the strategy global at the time of unpin might have nothing to do with how it was set when the buffer was acquired. What I'm tempted to do is remove the special case there and adjust buffer acquisition instead (maybe make it decrement the usage_count when re-using a buffer from the ring). Comments? I'm still playing with the ideas at this point... regards, tom lane
Tom Lane wrote: > ... and not guaranteeing to reset theaccess pattern on failure, either. Good catch, I thought I had that covered but apparently not. > I think we've got to get rid of the global variable and make the access > pattern be a parameter to StrategyGetBuffer, instead. Which in turn > suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the > next level up. This would serve directly for the heapscan case, and > for the main heap accesses in vacuum/analyze, but it might be a bit > of a pain to make index vacuuming pay attention to the strategy. I thought of that as well, but settled on the global variable for that same reason, the UnpinBuffer problem, and the fact that we'd end up with quite many different ReadBuffer variants: ReadBuffer, ReadOrZeroBuffer, ReadBufferWithStrategy, ReleaseAndReadBuffer, ReleaseAndReadBufferWithStrategy (if we care about that). Index vacuum code for all the indexams could be changed to use the ReadBufferWithStrategy if we took that approach, though there's quite a many of those calls in total. We could use a hybrid approach, where you could use ReadBufferWithStrategy to give the strategy on page-by-page basis, or use SetStrategyHintVacuum/SetAccessPattern to set it for all subsequent ReadBuffer-calls. > The other case that the patch addresses is COPY TO REL, which we could > handle if we were willing to pass a strategy parameter down through > heap_insert and RelationGetBufferForTuple ... is it worth the trouble? > I don't recall anyone having presented measurements to show COPY TO REL > being helped by this patch. The theory was that it'll reduce the cache-spoiling of COPY, just like it does for selects. I can run tests on that. > I don't especially care for the ring data structure being global inside > freelist.c, either. What I'm inclined to do is have the "access > strategy parameter" actually be a pointer to some struct type, with NULL > representing the default strategy. At the beginning of a bulk operation > we'd create an access strategy object, which would be internally the > current Ring data structure, but it'd be opaque to code outside > freelist.c. This'd mean that a query using two large seqscans would use > two rings not one, but that's not necessarily bad; it would in any case > make it a lot easier to generalize the strategy concept in future to > support more than one kind of non-default policy being active in > different parts of a query. I thought about the two large scans scenario but came to the conclusion that it's not possible to have two large scans active at the same time in a single backend. You could have a query like "SELECT * FROM bigtable_1, bigtable_2" with a cartesian product, or something with a function that scans a large table, but even then one of the scans would progress veery slowly compared to the other scan, and using more than one ring buffer would make no difference. I agree on the generalization argument, though. > A point I have not figured out how to deal with is that in the patch as > given, UnpinBuffer needs to know the strategy; and getting it that info > would make the changes far more invasive. But the patch's behavior here > is pretty risky anyway, since the strategy global at the time of unpin > might have nothing to do with how it was set when the buffer was > acquired. What I'm tempted to do is remove the special case there and > adjust buffer acquisition instead (maybe make it decrement the > usage_count when re-using a buffer from the ring). It's assumed that the same strategy is used when unpinning, which is true for the current usage (and apparently needs to be documented). Normally buffers that are in the ring are recycled quickly, in which case the usage_count will always be increased from 0->1 in UnpinBuffer, regardless of the check. The check is there to avoid inflating the usage_count on buffers that happened to be already in shared_buffers. It might not be that important for the selects, but that's what keeps COPY from bumping the usage_count all the way up to 5. If we figure out another way to deal with the COPY usage_count, maybe we could remove the check altogether. I've been thinking of changing COPY anyway so that it always kept the last page it inserted to pinned, to avoid the traffic to buffer manager on each tuple. > Comments? I'm still playing with the ideas at this point... Let me know if you want me to make changes and submit a new version. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> A point I have not figured out how to deal with is that in the patch as >> given, UnpinBuffer needs to know the strategy; and getting it that info >> would make the changes far more invasive. But the patch's behavior here >> is pretty risky anyway, since the strategy global at the time of unpin >> might have nothing to do with how it was set when the buffer was >> acquired. > It's assumed that the same strategy is used when unpinning, which is > true for the current usage (and apparently needs to be documented). I don't believe that for a moment. Even in the trivial heapscan case, the last pin is typically dropped during a tupleslot clear operation sometime after the scan itself has moved on to the next page. In a more general context (such as parallel heapscan and indexscan on a rel) it's certainly too risky to assume it. > Normally buffers that are in the ring are recycled quickly, in which > case the usage_count will always be increased from 0->1 in UnpinBuffer, > regardless of the check. The check is there to avoid inflating the > usage_count on buffers that happened to be already in shared_buffers. A heapscan would pin the buffer only once and hence bump its count at most once, so I don't see a big problem here. Also, I'd argue that buffers that had a positive usage_count shouldn't get sucked into the ring to begin with. > If we figure out another way to deal with the COPY usage_count, maybe we > could remove the check altogether. I've been thinking of changing COPY > anyway so that it always kept the last page it inserted to pinned, to > avoid the traffic to buffer manager on each tuple. This is getting fairly invasive for a part of the patch you've not justified at all yet... > Let me know if you want me to make changes and submit a new version. I can work on this stuff; please do the tests to show whether it's worth handling COPY TO REL, and keep on with Jeff's patch. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> It's assumed that the same strategy is used when unpinning, which is >> true for the current usage (and apparently needs to be documented). > > I don't believe that for a moment. Even in the trivial heapscan case, > the last pin is typically dropped during a tupleslot clear operation > sometime after the scan itself has moved on to the next page. In a more > general context (such as parallel heapscan and indexscan on a rel) it's > certainly too risky to assume it. Hmm, I guess you're right. :( One idea is to keep track which pins are taken using the bulk strategy. It's a bit tricky when a buffer is pinned multiple times since we don't know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we could get away with just a flag per pinned buffer. Set the flag when a buffer is pinned with bulk strategy and it wasn't pinned by us before, and clear it when it's pinned with another strategy. I'm thinking we steal one bit from PrivateRefCount for this. >> Normally buffers that are in the ring are recycled quickly, in which >> case the usage_count will always be increased from 0->1 in UnpinBuffer, >> regardless of the check. The check is there to avoid inflating the >> usage_count on buffers that happened to be already in shared_buffers. > > A heapscan would pin the buffer only once and hence bump its count at > most once, so I don't see a big problem here. Also, I'd argue that > buffers that had a positive usage_count shouldn't get sucked into the > ring to begin with. True, except that with the synchronized scans patch two synchronized scans will pin the buffer twice. >> If we figure out another way to deal with the COPY usage_count, maybe we >> could remove the check altogether. I've been thinking of changing COPY >> anyway so that it always kept the last page it inserted to pinned, to >> avoid the traffic to buffer manager on each tuple. > > This is getting fairly invasive for a part of the patch you've not > justified at all yet... > >> Let me know if you want me to make changes and submit a new version. > > I can work on this stuff; please do the tests to show whether it's worth > handling COPY TO REL, and keep on with Jeff's patch. Ok. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > One idea is to keep track which pins are taken using the bulk strategy. > It's a bit tricky when a buffer is pinned multiple times since we don't > know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we > could get away with just a flag per pinned buffer. Set the flag when a > buffer is pinned with bulk strategy and it wasn't pinned by us before, > and clear it when it's pinned with another strategy. I'm thinking we > steal one bit from PrivateRefCount for this. Seems like a mess. Why don't we just fix it so there's no need for different behavior at Unpin time? The facts on the ground are that the current patch's change in UnpinBuffer is a no-op anyway, because of the tupletableslot interference. The behavior I'm imagining is just that when we try to take a buffer from the ring, if its usage count exceeds 1 then drop it from the ring and get another buffer. 1 would be the expected case if no one had touched it since we last used it. >> A heapscan would pin the buffer only once and hence bump its count at >> most once, so I don't see a big problem here. Also, I'd argue that >> buffers that had a positive usage_count shouldn't get sucked into the >> ring to begin with. > True, except that with the synchronized scans patch two synchronized > scans will pin the buffer twice. Hmm. But we probably don't want the same buffer in two different backends' rings, either. You *sure* the sync-scan patch has no interaction with this one? One other question: I see the patch sets the threshold for switching from normal to ring-buffer heapscans at table size = NBuffers. Why so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10. If you don't want a seqscan blowing out your buffer cache, you surely don't want it blowing out 90% of the cache either. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > A point I have not figured out how to deal with is that in the patch as > given, UnpinBuffer needs to know the strategy; and getting it that info > would make the changes far more invasive. But the patch's behavior here > is pretty risky anyway, since the strategy global at the time of unpin > might have nothing to do with how it was set when the buffer was > acquired. What I'm tempted to do is remove the special case there and > adjust buffer acquisition instead (maybe make it decrement the > usage_count when re-using a buffer from the ring). Is there a reason UnpinBuffer has to be the one to increment the usage count anyways? Why can't ReadBuffer handle incrementing the count and just trust that it won't be decremented until the buffer is unpinned anyways? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Is there a reason UnpinBuffer has to be the one to increment the usage count > anyways? Why can't ReadBuffer handle incrementing the count and just trust > that it won't be decremented until the buffer is unpinned anyways? That's a good question. I think the idea was that if we hold a buffer pinned for awhile (long enough that the bgwriter's clock sweep passes over it one or more times), we want the usage count decrementing to start when we release the pin, not when we acquire it. But maybe that could be fixed if the clock sweep doesn't touch the usage_count of a pinned buffer. Which in fact it may not do already --- didn't look. regards, tom lane
On Mon, 28 May 2007, Tom Lane wrote: > But maybe that could be fixed if the clock sweep doesn't touch the > usage_count of a pinned buffer. Which in fact it may not do already --- > didn't look. StrategyGetBuffer doesn't care whether the buffer is pinned or not; it decrements the usage_count regardless. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: > > Is there a reason UnpinBuffer has to be the one to increment the usage count > > anyways? Why can't ReadBuffer handle incrementing the count and just trust > > that it won't be decremented until the buffer is unpinned anyways? > > That's a good question. I think the idea was that if we hold a buffer > pinned for awhile (long enough that the bgwriter's clock sweep passes > over it one or more times), we want the usage count decrementing to > start when we release the pin, not when we acquire it. But maybe that > could be fixed if the clock sweep doesn't touch the usage_count of a > pinned buffer. Which in fact it may not do already --- didn't look. It does -- in BgBufferSync the "all" scan calls SyncOneBuffer with skip_pinned=false. The "lru" scan does skip pinned buffers. -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ "World domination is proceeding according to plan" (Andrew Morton)
Alvaro Herrera wrote: > Tom Lane wrote: >> Gregory Stark <stark@enterprisedb.com> writes: >>> Is there a reason UnpinBuffer has to be the one to increment the usage count >>> anyways? Why can't ReadBuffer handle incrementing the count and just trust >>> that it won't be decremented until the buffer is unpinned anyways? >> That's a good question. I think the idea was that if we hold a buffer >> pinned for awhile (long enough that the bgwriter's clock sweep passes >> over it one or more times), we want the usage count decrementing to >> start when we release the pin, not when we acquire it. But maybe that >> could be fixed if the clock sweep doesn't touch the usage_count of a >> pinned buffer. Which in fact it may not do already --- didn't look. > > It does -- in BgBufferSync the "all" scan calls SyncOneBuffer with > skip_pinned=false. The "lru" scan does skip pinned buffers. You're looking at the wrong place. StrategyGetBuffer drives the clock sweep, and it always decreases the usage_count, IOW it doesn't skip pinned buffers. SyncOneBuffer and BgBufferSync don't decrease the usage_count in any case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-05-28 at 17:36 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > One idea is to keep track which pins are taken using the bulk strategy. > > It's a bit tricky when a buffer is pinned multiple times since we don't > > know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we > > could get away with just a flag per pinned buffer. Set the flag when a > > buffer is pinned with bulk strategy and it wasn't pinned by us before, > > and clear it when it's pinned with another strategy. I'm thinking we > > steal one bit from PrivateRefCount for this. > > Seems like a mess. Why don't we just fix it so there's no need for > different behavior at Unpin time? The facts on the ground are that > the current patch's change in UnpinBuffer is a no-op anyway, because > of the tupletableslot interference. > > The behavior I'm imagining is just that when we try to take a buffer > from the ring, if its usage count exceeds 1 then drop it from the ring > and get another buffer. 1 would be the expected case if no one had > touched it since we last used it. > > >> A heapscan would pin the buffer only once and hence bump its count at > >> most once, so I don't see a big problem here. Also, I'd argue that > >> buffers that had a positive usage_count shouldn't get sucked into the > >> ring to begin with. > > > True, except that with the synchronized scans patch two synchronized > > scans will pin the buffer twice. > > Hmm. But we probably don't want the same buffer in two different > backends' rings, either. You *sure* the sync-scan patch has no > interaction with this one? > I will run some tests again tonight, I think the interaction needs more testing than I did originally. Also, I'm not sure that the hardware I have is sufficient to test those cases. It looks like the case to worry about is when there are a large number of scans on the same table and the I/O system is fast enough that it causes lock contention on the buffers in the rings. Is this the case you're worried about? Also, keep in mind that I have added a SyncScanLock after I ran those tests. That could have an effect. Regards, Jeff Davis
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >>> A heapscan would pin the buffer only once and hence bump its count at >>> most once, so I don't see a big problem here. Also, I'd argue that >>> buffers that had a positive usage_count shouldn't get sucked into the >>> ring to begin with. > >> True, except that with the synchronized scans patch two synchronized >> scans will pin the buffer twice. > > Hmm. But we probably don't want the same buffer in two different > backends' rings, either. You *sure* the sync-scan patch has no > interaction with this one? A buffer is only put to the ring when it's requested with StrategyGetBuffer. If a page is requested with ReadBuffer, and it's in cache already, it won't be put in the ring. When multiple scanners are scanning synchronously, they will all use their own, distinct rings when reading buffers into the cache, and "peek" into the buffers in other backend's rings for pages that others read to the buffer cache. I'm going to test the interactions in more detail... > One other question: I see the patch sets the threshold for switching > from normal to ring-buffer heapscans at table size = NBuffers. Why > so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10. > If you don't want a seqscan blowing out your buffer cache, you surely > don't want it blowing out 90% of the cache either. NBuffers is the maximum value that makes sense; if you're scanning more than NBuffers, the scan is definitely not going to fit in shared_buffers. Anything less than that and we might be causing harm to some use cases, so I chose that for the time being. Simon argued for a GUC variable, and Jeff's patch as it stands introduces one. I'm not sure we want it but if we do, we should use the same variable to control both the sync scan and cache replacement policy. It's essentially "how large a scan do you expect to fit in shared_buffers?" -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote: > > Hmm. But we probably don't want the same buffer in two different > > backends' rings, either. You *sure* the sync-scan patch has no > > interaction with this one? > > > > I will run some tests again tonight, I think the interaction needs more > testing than I did originally. Also, I'm not sure that the hardware I > have is sufficient to test those cases. > I ran some sanity tests last night with cvs head, plus my syncscan20- cvshead.patch, plus scan_recycle_buffers.v3.patch. It passed the sanity tests at least. I did see that there was more interference with sync_seqscan_threshold=0 (always on) and scan_recycle_buffers=0 (off) than I had previously seen with 8.2.4, so I will test again against 8.2.4 to see why that might be. The interference that I saw was still quite small, a scan moving concurrently with 9 other scans was about 10% slower than a scan running alone -- which is still very good compared with plain cvs head and no sync scan -- it's just not ideal. However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128 didn't have much effect. At 32 it appeared to be about 1% worse during 10 scans, but that may have been noise. The other values I tried didn't have any difference that I could see. This was really just a quick sanity test, I think more hard data would be useful. Regards, Jeff Davis
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> One other question: I see the patch sets the threshold for switching >> from normal to ring-buffer heapscans at table size = NBuffers. Why >> so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10. >> If you don't want a seqscan blowing out your buffer cache, you surely >> don't want it blowing out 90% of the cache either. > NBuffers is the maximum value that makes sense; if you're scanning more > than NBuffers, the scan is definitely not going to fit in > shared_buffers. Anything less than that and we might be causing harm to > some use cases, so I chose that for the time being. But the flip side of that is you're failing to provide the benefit of the patch in quite a lot of use-cases where it's clearly beneficial. I just don't believe that there are very many cases where people will want a heapscan to eat 90% of their cache. > Simon argued for a GUC variable, and Jeff's patch as it stands > introduces one. I'm not sure we want it but if we do, we should use the > same variable to control both the sync scan and cache replacement > policy. It's essentially "how large a scan do you expect to fit in > shared_buffers?" Well, let's do some experiments and see if there's really any point in varying the cutover. regards, tom lane
Jeff Davis wrote: > On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote: >>> Hmm. But we probably don't want the same buffer in two different >>> backends' rings, either. You *sure* the sync-scan patch has no >>> interaction with this one? >>> >> I will run some tests again tonight, I think the interaction needs more >> testing than I did originally. Also, I'm not sure that the hardware I >> have is sufficient to test those cases. >> > > I ran some sanity tests last night with cvs head, plus my syncscan20- > cvshead.patch, plus scan_recycle_buffers.v3.patch. > > It passed the sanity tests at least. > > I did see that there was more interference with sync_seqscan_threshold=0 > (always on) and scan_recycle_buffers=0 (off) than I had previously seen > with 8.2.4, so I will test again against 8.2.4 to see why that might be. > The interference that I saw was still quite small, a scan moving > concurrently with 9 other scans was about 10% slower than a scan running > alone -- which is still very good compared with plain cvs head and no > sync scan -- it's just not ideal. > > However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128 > didn't have much effect. At 32 it appeared to be about 1% worse during > 10 scans, but that may have been noise. The other values I tried didn't > have any difference that I could see. > > This was really just a quick sanity test, I think more hard data would > be useful. The interesting question is whether the small buffer ring is big enough to let all synchronized scans to process a page before it's being recycled. Keep an eye on pg_buffercache to see if it gets filled with pages from the table you're querying. I just ran a quick test with 4 concurrent scans on a dual-core system, and it looks like we do "leak" buffers from the rings because they're pinned at the time they would be recycled. A full scan of a 30GB table took just under 7 minutes, and starting after a postmaster restart it took ~4-5 minutes until all of the 320MB of shared_buffers were used. That means we're leaking a buffer from the ring very roughly on every 20-40th ReadBuffer call, but I'll put in some proper instrumentation and test with different configurations to get a better picture. The synchronized scans gives such a big benefit when it's applicable, that I think that some cache-spoiling is acceptable and in fact unavoidable in some scenarios. It's much better than 8.2 behavior anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I just ran a quick test with 4 concurrent scans on a dual-core system, > and it looks like we do "leak" buffers from the rings because they're > pinned at the time they would be recycled. Yeah, I noticed the same in some tests here. I think there's not a lot we can do about that; we don't have enough visibility into why someone else has the buffer pinned. Using a larger ring would help, by making it less probable that any other sync-scanning backend is so far behind as to still have the oldest element of our ring pinned. But if we do that we have the L2-cache-size effect to worry about. Is there any actual data backing up that it's useful to keep the ring fitting in L2, or is that just guesswork? In the sync-scan case the idea seems pretty bogus anyway, because the actual working set will be N backends' rings not just one. regards, tom lane
On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote: > In > the sync-scan case the idea seems pretty bogus anyway, because the > actual working set will be N backends' rings not just one. I don't follow. Ideally, in the sync-scan case, the sets of buffers in the ring of different scans on the same relation will overlap completely, right? We might not be at the ideal, but the sets of buffers in the rings shouldn't be disjoint, should they? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote: >> In the sync-scan case the idea seems pretty bogus anyway, because the >> actual working set will be N backends' rings not just one. > I don't follow. Ideally, in the sync-scan case, the sets of buffers in > the ring of different scans on the same relation will overlap > completely, right? > We might not be at the ideal, but the sets of buffers in the rings > shouldn't be disjoint, should they? According to Heikki's explanation here http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php each backend doing a heapscan would collect its own ring of buffers. You might have a few backends that are always followers, never leaders, and so never actually fetch any pages --- but for each backend that actually did any I/O there would be a separate ring. In practice I'd expect the lead would "change hands" pretty often and so you'd see all the backends accumulating their own rings. regards, tom lane
On Wed, 2007-05-30 at 17:45 -0400, Tom Lane wrote: > According to Heikki's explanation here > http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php > each backend doing a heapscan would collect its own ring of buffers. > You might have a few backends that are always followers, never leaders, > and so never actually fetch any pages --- but for each backend that > actually did any I/O there would be a separate ring. In practice I'd > expect the lead would "change hands" pretty often and so you'd see all > the backends accumulating their own rings. > Oh, I see what you mean. The rings will actually become sparsely populated with many concurrent scans, and therefore, extend outside of L2 cache. Regards, Jeff Davis
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I just ran a quick test with 4 concurrent scans on a dual-core system, >> and it looks like we do "leak" buffers from the rings because they're >> pinned at the time they would be recycled. > > Yeah, I noticed the same in some tests here. I think there's not a lot > we can do about that; we don't have enough visibility into why someone > else has the buffer pinned. We could stash pinned buffers to some other list etc. and try them again later. But that gets a lot more complex. > Using a larger ring would help, by making it less probable that any > other sync-scanning backend is so far behind as to still have the oldest > element of our ring pinned. But if we do that we have the L2-cache-size > effect to worry about. Is there any actual data backing up that it's > useful to keep the ring fitting in L2, or is that just guesswork? In > the sync-scan case the idea seems pretty bogus anyway, because the > actual working set will be N backends' rings not just one. Yes, I tested different ring sizes here: http://archives.postgresql.org/pgsql-hackers/2007-05/msg00469.php The tests above showed the effect when reading a table from OS cache. I haven't seen direct evidence supporting Luke's claim that the ring makes scans of tables bigger than RAM go faster with bigger I/O hardware, because I don't have such hardware at hand. We did repeat the tests on different hardware however, and monitored the CPU usage with vmstat at the same time. The CPU usage was significantly lower with the patch, so I believe that with better I/O hardware the test would become limited by CPU and the patch would therefore make it go faster. BTW, we've been talking about the "L2 cache effect" but we don't really know for sure if the effect has anything to do with the L2 cache. But whatever it is, it's real. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi All, On 5/31/07 12:40 AM, "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: > BTW, we've been talking about the "L2 cache effect" but we don't really > know for sure if the effect has anything to do with the L2 cache. But > whatever it is, it's real. The mailing list archives contain the ample evidence of: - it's definitely an L2 cache effect - on fast I/O hardware tests show large benefits of keeping the ring in L2 I see no reason to re-open the discussion about these, can we accept these as fact and continue? - Luke
Luke Lonergan <llonergan@greenplum.com> writes: > The mailing list archives contain the ample evidence of: > - it's definitely an L2 cache effect > - on fast I/O hardware tests show large benefits of keeping the ring in L2 Please provide some specific pointers, because I don't remember that. regards, tom lane