Обсуждение: Vacuum thoughts
The more I think about this vacuum i/o problem, the more I think we have it wrong. The added i/o from vacuum really ought not be any worse than a single full table scan. And there are probably the occasional query doing full table scans already in those systems. For the folks having this issue, if you run "select count(*) from bigtable" is there as big a hit in transaction performance? On the other hand, does the vacuum performance hit kick in right away? Or only after it's been running for a bit? I think the other factor mentioned is actually the main problem: cache. The vacuum basically kills the kernel buffer cache by reading in every block of every table in the system. The difference between vacuum and a single "select count(*)" is that it does all the tables one after each other eventually overrunning the total cache available. If it's just a matter of all the read i/o from vacuum then we're best off sleeping for a few milliseconds every few kilobytes. If it's the cache then we're probably better off reading a few megabytes and then sleeping for several seconds to allow the other buffers to get touched and pushed back to the front of the LRU. Hm, I wonder if the amount of data to read between sleeps should be, something like 25% of the effective_cache_size, for example. -- greg
On Fri, 2003-10-17 at 16:22, Greg Stark wrote: > If it's just a matter of all the read i/o from vacuum then we're best off > sleeping for a few milliseconds every few kilobytes. If it's the cache then > we're probably better off reading a few megabytes and then sleeping for > several seconds to allow the other buffers to get touched and pushed back to > the front of the LRU. Uh, no -- if it is the cache, we're better off fixing the buffer replacement policy, not trying to hack around it. Replacement policies that don't suffer from sequential flooding are well known. -Neil
Neil Conway <neilc@samurai.com> writes:
> Uh, no -- if it is the cache, we're better off fixing the buffer
> replacement policy, not trying to hack around it.
If we can.  As long as we are largely depending on the kernel's buffer
cache, we may not be able to "just fix it" ...
        regards, tom lane
			
		Greg Stark wrote: > The more I think about this vacuum i/o problem, the more I think we have it > wrong. The added i/o from vacuum really ought not be any worse than a single > full table scan. And there are probably the occasional query doing full table > scans already in those systems. > > For the folks having this issue, if you run "select count(*) from bigtable" is > there as big a hit in transaction performance? On the other hand, does the > vacuum performance hit kick in right away? Or only after it's been running for > a bit? The vacuum cost is the same of a full scan table ( select count(*) ) ? Why not do a sort of "vacuum" if a scan table happen ( during a simple select that invole a full scan table for example )? Regards Gaetano Mendola
Gaetano Mendola wrote: > The vacuum cost is the same of a full scan table ( select count(*) ) ? > Why not do a sort of "vacuum" if a scan table happen ( during a simple > select that invole a full scan table for example )? I was thinking about it. How about vacuuming a page when it is been pushed out of postgresql buffer cache? It is is memory so not much IO is involved. Could it be an approach? Shridhar
Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> I was thinking about it. How about vacuuming a page when it is been
> pushed out of postgresql buffer cache? It is is memory so not much IO
> is involved.
You keep ignoring the problem of removing index entries.  To vacuum an
individual page, you need to be willing to read in (and update) all
index pages that reference the tuples-to-be-deleted.  This is hardly
tenable when the reason for pushing the page out of buffer cache was so
that you could read in something else instead --- you don't have spare
buffer slots, and you don't want to do all that I/O (and the associated
WAL log entries) before you can read in the page you originally wanted.
The latter point is really the crux of the problem.  The point of having
the VACUUM process is to keep maintenance work out of the critical path
of foreground queries.  Anything that moves even part of that
maintenance work into the critical path is going to be a net loss.
        regards, tom lane
			
		Tom Lane wrote: > Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes: > >>I was thinking about it. How about vacuuming a page when it is been >>pushed out of postgresql buffer cache? It is is memory so not much IO >>is involved. > > > You keep ignoring the problem of removing index entries. To vacuum an > individual page, you need to be willing to read in (and update) all > index pages that reference the tuples-to-be-deleted. This is hardly > tenable when the reason for pushing the page out of buffer cache was so > that you could read in something else instead --- you don't have spare > buffer slots, and you don't want to do all that I/O (and the associated > WAL log entries) before you can read in the page you originally wanted. > > The latter point is really the crux of the problem. The point of having > the VACUUM process is to keep maintenance work out of the critical path > of foreground queries. Anything that moves even part of that > maintenance work into the critical path is going to be a net loss. So the problem is an index tuple does not store transaction id information like a heap tuple does and it can not deduce that an index tuple is dead unless it points to a dead heap tuple. Is that right? If an index tuple had transaction information duplicated along with heap tuple, two types of tuples can be removed, independent of each other? Would above scheme of vacuum-on-page-expiry work in that case? Shridhar
Tom Lane <tgl@sss.pgh.pa.us> writes: > You keep ignoring the problem of removing index entries. To vacuum an > individual page, you need to be willing to read in (and update) all > index pages that reference the tuples-to-be-deleted. Hm. If the visibility information were stored in the index then that could be postponed and done when the index page was read or when it was flushed. So when you do a sequential scan it marks any tuples it finds as dead as free in the free space map. And when you do an index scan it marks any index entries it finds dead as free in the free space map. That wouldn't eliminate the need for vacuum though. It would be pretty easy to delete chunks of tuples and never look at those index values again. Especially since deleting them might not even bring the index pages in at all. That said, I guess if you never look at the pages again, the wasted space in them isn't having any performance penalty anyways. > The latter point is really the crux of the problem. The point of having > the VACUUM process is to keep maintenance work out of the critical path > of foreground queries. Anything that moves even part of that > maintenance work into the critical path is going to be a net loss. I could envision a system where the process is guaranteed a certain number of empty pages. Whenever it grabs one the last page on the LRU is queued to be cleaned. It doesn't have to actually get cleaned and flushed in the critical path unless the transaction starves of ready-to-use pages. If it does it's probably a batch transaction anyways and should switch to reusing the same pages over and over to avoid flushing everything from the shared pages. -- greg
Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
> If an index tuple had transaction information duplicated along with
> heap tuple, two types of tuples can be removed, independent of each
> other?
Possibly ... but I think we have already considered and rejected that
proposal, more than once.
        regards, tom lane
			
		Greg Stark <gsstark@MIT.EDU> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > You keep ignoring the problem of removing index entries. To vacuum an > > individual page, you need to be willing to read in (and update) all > > index pages that reference the tuples-to-be-deleted. > > Hm. If the visibility information were stored in the index then that could be > postponed and done when the index page was read or when it was flushed. Or possibly if we could reuse the space in the table without updating the index and be able to recognize that the tuple no longer corresponded to the index entry when we next saw the index entry. I suspect that would take as much space to keep track of as the visibility information though. -- greg
The world rejoiced as tgl@sss.pgh.pa.us (Tom Lane) wrote:
> The latter point is really the crux of the problem.  The point of having
> the VACUUM process is to keep maintenance work out of the critical path
> of foreground queries.  Anything that moves even part of that
> maintenance work into the critical path is going to be a net loss.
How about a "flip side" for this...
VACUUM CACHE;
This new operation vacuums only those pages of relations that are in
cache.  Tuples that have recently been trashed are quite likely to be
in cache because they had to be drawn in in order to mark them as
obsolete.  So if there is (say) a "balance table" where entries get
updated regularly, the dead tuples should be sitting in cache.
This should go "blip - walk through 10K pages of cache in memory" and
be done.  (Modulo a need to pull in some index pages to mark index
entries as dead...)
-- 
output = reverse("gro.mca" "@" "enworbbc")
http://www.ntlug.org/~cbbrowne/linux.html
Build a man a fire and he will be warm for a day.
Set a man on fire and he will be warm for the rest of his life.
			
		Christopher Browne <cbbrowne@acm.org> writes:
> How about a "flip side" for this...
> VACUUM CACHE;
> This new operation vacuums only those pages of relations that are in
> cache.
This might conceivably be a useful behavior (modulo the problem of
fixing index entries) ... but I think we'd not want to expose it as
an actual VACUUM option.  If we were going to do something like that,
it should likely be done by an invisible automatic vacuum daemon,
more like checkpoints ...
        regards, tom lane
			
		To add some medium-hard data to the discussion, I hacked a PG 7.3.4 a little. The system I am talking about below run's an artificial application that very well resembles the behaviour of a TPC-C benchmark implementation. Without vacuuming the database, it can just so sustain a factor 5 scaled database running with 50 simulated terminals. To free some bandwidth, the system is configured with scaling 4 and runs with 40 simulated terminals. In this configuration it can satisfy the responsetime requirements for 100% of all transactions when not vacuuming ... no surprise. The test driver takes 10 minute intervals and reports the percentage of transactions which qualify. If the database now is vacuumed simultaneously, the response time for transactions changes dramatically. A 10 minute interval hit by vacuum drops down from 100% to anything below 90%, I've seen it down to 75%. The system load given by a Linux 2.4 kernel jumps up from under 1.0 to anything between 5 and 8. So far, that is exactly what most DBA's are complaining about. A system that runs smoothly otherwise get's literally bogged down by any vacuum. Now I changed the cache policy. While a backend is running vacuum, a global flag is set. If this flag is set and a block is not found in the cache but must be read, it's buffer is marked BM_READ_BY_VACUUM.When the global flag is set, AddBufferToFreelist() inserts buffers so marked at the head of the freelist instead of adding them to the tail. In any case, the buffers BM_READ_BY_VACUUM flag is cleared. The effect of this simple hack is somewhat surprising. Not only can the system keep satisfying 97% or more of all transactions within time limits and the system load stays well below 2.0 (I've only seen 1.6 once), but very surprisingly VACUUM finishes about 20% faster too. I'm not a friend of jumping to conclusions, OTOH I have to try to make some sense out of it. So I would like the following be taken with a reasonable amount of salt. I think that the common theory, vacuum is similar to a sequential scan, just does not hold true for any table that is actually updated randomly. What happens instead is that vacuum not only evicts the whole buffer cache by forcing all blocks of said table and its indexes in, it also dirties a substantial amount of that and leaves the dirt to be cleaned up by all the other backends. The changes I've done above cause vacuum to work with as few shared buffers as possible for the data not already found in the cache. This avoids imposing unnecessary additional write overhead for regular backends, and causes the vacuum process to stay inside of a few virtual memory pages instead of running all over the place. I don't know how much the latter impacts the efficiency of the MMU, it might not be significant here. It is well possible that there is some other side effect in the buffer cache that impacts the behaviour of many backends doing few writes compared to one backend doing them en-gros. However, the test indicates that there is some low hanging fruit in the cache algorithm, and that it's not just a few little raspberries. Jan Tom Lane wrote: > Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes: >> I was thinking about it. How about vacuuming a page when it is been >> pushed out of postgresql buffer cache? It is is memory so not much IO >> is involved. > > You keep ignoring the problem of removing index entries. To vacuum an > individual page, you need to be willing to read in (and update) all > index pages that reference the tuples-to-be-deleted. This is hardly > tenable when the reason for pushing the page out of buffer cache was so > that you could read in something else instead --- you don't have spare > buffer slots, and you don't want to do all that I/O (and the associated > WAL log entries) before you can read in the page you originally wanted. > > The latter point is really the crux of the problem. The point of having > the VACUUM process is to keep maintenance work out of the critical path > of foreground queries. Anything that moves even part of that > maintenance work into the critical path is going to be a net loss. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes:
> What happens instead is that vacuum not only evicts the whole buffer 
> cache by forcing all blocks of said table and its indexes in, it also 
> dirties a substantial amount of that and leaves the dirt to be cleaned 
> up by all the other backends.
[ thinks about that... ]  Yeah, I believe you're right, because (plain)
vacuum just does WriteBuffer() for any page that it modifies, which only
marks the page dirty in buffer cache.  It never does anything to force
those pages to be written out to the kernel.  So, if you have a large
buffer cache, a lot of write work will be left over to be picked up by
other backends.
I think that pre-WAL the system used to handle this stuff differently,
in a way that made it more likely that VACUUM would issue its own
writes.  But optimizations intended to improve the behavior for
non-VACUUM cases have made this not so good for VACUUM.
I like your idea of penalizing VACUUM-read blocks when they go back into
the freelist.  This seems only a partial solution though, since it
doesn't directly ensure that VACUUM rather than some other process will
issue the write kernel call for the dirtied page.  Maybe we should
resurrect a version of WriteBuffer() that forces an immediate kernel
write, and use that in VACUUM.
Also, we probably need something similar for seqscan-read blocks, but
with an intermediate priority (can we insert them to the middle of the
freelist?)
        regards, tom lane
			
		Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> What happens instead is that vacuum not only evicts the whole buffer >> cache by forcing all blocks of said table and its indexes in, it also >> dirties a substantial amount of that and leaves the dirt to be cleaned >> up by all the other backends. > > [ thinks about that... ] Yeah, I believe you're right, because (plain) > vacuum just does WriteBuffer() for any page that it modifies, which only > marks the page dirty in buffer cache. It never does anything to force > those pages to be written out to the kernel. So, if you have a large > buffer cache, a lot of write work will be left over to be picked up by > other backends. > > I think that pre-WAL the system used to handle this stuff differently, > in a way that made it more likely that VACUUM would issue its own > writes. But optimizations intended to improve the behavior for > non-VACUUM cases have made this not so good for VACUUM. > > I like your idea of penalizing VACUUM-read blocks when they go back into > the freelist. This seems only a partial solution though, since it > doesn't directly ensure that VACUUM rather than some other process will > issue the write kernel call for the dirtied page. Maybe we should > resurrect a version of WriteBuffer() that forces an immediate kernel > write, and use that in VACUUM. > > Also, we probably need something similar for seqscan-read blocks, but > with an intermediate priority (can we insert them to the middle of the > freelist?) Well, "partial solution" isn't quite what I would call it, and it surely needs integration with sequential scans. I really do expect the whole hack to fall apart if some concurrent seqscans are going on since it not really penalizes the VACUUM-read blocks but more the next caller of GetFreeBuffer(). In my test case that just happens to be VACUUM most of the time. I described it only to demonstrate the existence of potential. Since the whole point of the buffer cache is to avoid the real bad thing, I/O, I don't think that the trivial double-linked list that implements it today is adequate. I can't imagine it completely yet, but what I see vaguely is a cache policy that put's a block into the freelist depending on where it was coming from (cache, seqscan, indexscan, vacuum) and what it is (heap, toast, index). That plus the possibility for vacuum to cause it to be written to kernel immediately might do it. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Mon, 2003-10-27 at 15:31, Jan Wieck wrote: > Well, "partial solution" isn't quite what I would call it, and it surely > needs integration with sequential scans. I really do expect the whole > hack to fall apart if some concurrent seqscans are going on I'd rather see us implement a buffer replacement policy that considers both frequency + recency (unlike LRU, which considers only recency). Ideally, that would work "automagically". I'm hoping to get a chance to implement ARC[1] during the 7.5 cycle. -Neil [1]: http://citeseer.nj.nec.com/megiddo03arc.html
Neil Conway wrote: > On Mon, 2003-10-27 at 15:31, Jan Wieck wrote: > > Well, "partial solution" isn't quite what I would call it, and it surely > > needs integration with sequential scans. I really do expect the whole > > hack to fall apart if some concurrent seqscans are going on > > I'd rather see us implement a buffer replacement policy that considers > both frequency + recency (unlike LRU, which considers only recency). > Ideally, that would work "automagically". I'm hoping to get a chance to > implement ARC[1] during the 7.5 cycle. Someone just started working on it this week. He emailed Jan and I. He hopes to have a patch in a few days. I will make sure he posts to hackers/patches. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
>I'd rather see us implement a buffer replacement policy that considers >both frequency + recency (unlike LRU, which considers only recency). >Ideally, that would work "automagically". I'm hoping to get a chance to >implement ARC[1] during the 7.5 cycle. Actually i've already done some work back in august. I tried to implement ARC in pgsql but failed to due to lack of time. After renewed discussion with Bruce and Jan, i've decided to give it another try. So hopefully w'll have a patch soon from me or Jan. The algorithm itself is very elegant and works "automagically". -- Cuong Bui Loki IT Solutions http://www.lokishop.nl Powers your webcommerce!