Обсуждение: 7.3.1 takes long time to vacuum table?
Hi everyone, Does anyone know of any issues with 7.3.1 which would cause it to take a long time to do a vacuum full? We have a resource table consisting of about 70M records and we have recently performed an update query on the entire table and now I'm trying to do a vacuum full to reclaim back the disk space. So far on a dual PIII 1.4GHz machine with hardware RAID5 and 2GB RAM, the vacuum full verbose is still running after 30 hours! The reason I am concerned is that before performing the update, the vacuum full would normally take about 12 hours and that was with a trigger and 5 indexes on the table. This time, before initiating the vacuum, all the indexes were dropped, and a single update performed on 1 field over the entire table. I understand that postgres has to compact the valid tuples down to the front of the file after removing the previous ones, but should it really take this long on such a powerful machine? Or have I made a gross error somewhere in the configuration? When I first configured the database, vacuuming the table took nearly 20 hours, but thanks to a post in the archives, I found some suggestions to increase vacuum mem & FSM size and reduce the number of buffers which got it down to its normal time of about 12 hours :) Here are the settings I changed from the defaults based on that post: Max_fsm_relations = 50000 Max_fsm_pages = 5000000 Vacuum_mem = 65535 Fsync = false I have also set shmmax to 800Mb just to give things some breathing space. One thing I have noticed is that the postmaster process running the vacuum has now reached 1Gb of memory and looks like it is beginning to touch swap(!) which is going to slow things even more. Can anyone help me out and reduce the time it takes to do this vacuum? Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
> -----Original Message----- > From: Martijn van Oosterhout [mailto:kleptog@svana.org] > Sent: 19 February 2003 02:14 > To: Mark Cave-Ayland > Cc: PostgreSQL General > Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table? Hi Martijn, Thanks for taking the time to look at this one. > Ok, this is a hard one. Doing a vacuum full is very hard on the disk > cache. > It basically copies a lot of tuples around. In your case it's going to be > copying every tuple from somewhere near the end of the table to somewhere > near the beginning. This makes the pattern of disk access something like: > > seek, read, seek, write, seek, read, seek, write, ... > > Which, not surprisingly, sucks (especially from RAID5 I think, but I'm not > sure about that). Note this varies a bit between versions of Unix, > postgresql and your C library. Hmmm.... so the news is not good then :( We timed an insert of about 800,000 records into the DB under 7.2.1 and that was just over 10mins... so extrapolating that out it would take at least 15 hours just to write the direct 70M records to disk without the seeking.... *sigh* > My only advice is that you can use strace to work out approximatly where > it's up to. Use /proc/ to work out which file descriptor is the table > you're > working with and then strace the backend (the -p option) to work out which > part it is reading from. It'll look like: > > seek(<fd>,<offset>,SEEK_SET) = <offset> > read(<fd>,"lots of rubbish", 8192) = 8192 > > It's the offset you want, it may jump around a bit but it should be > increasing on the whole. If your table is split into multiple files > because > it's over 1GB, take this into account when working out how far it's in. Ok, I've managed to do that.... offset just seems to keep jumping around though, not much of a specific pattern... although it does seem to be reading from 2 separate files :( But what IS interesting is that between each seek(), postgres does 2 reads of 8k (which I guess is a page) and then 4 writes! This I don't understand? Surely given the memory parameters then it should read as many pages into memory as possible, sort them, then seek back and write them? What appears to be happening is that it is only one or two pages are being moved at a time which seems really inefficient. Or is the assumption here that by limiting the pages being moved around, more memory can be given to the OS so it can cache aggressively? I've uploaded the strace file to http://www.infomapper.com/strace/strace.log.txt if you (or anyone else) would be interested in taking a look - I logged about 700ks worth. > This is probably a good argument to have VACUUM emit a notice every 10 > minutes or so giving some indication of its progress. I don't know how > hard > this would be. The vacuum has emitted a line of statistics within the first few hours which reads the following: INFO: --Relation public.res-- INFO: Pages 9167290: Changed 5366493, reaped 6000857, Empty 0, New 0; Tup 72475389: Vac 105298234, Keep/VTL 0/0, UnUsed 221144, MinLen 72, MaxLen 727; Re-using: Free/Avail. Space 48478844644/48455035400; EndEmpty/Avail. Pages 0/8458521. CPU 733.62s/151.22u sec elapsed 4656.51 sec. But now it's just sat there churning away.... I guess the above was the result of marking which tuples were to be kept and now I'm guessing its in the process of moving data around. So yes, some additional notification during this phase would be very useful for large tables like this. > > I have also set shmmax to 800Mb just to give things some breathing > > space. One thing I have noticed is that the postmaster process running > > the vacuum has now reached 1Gb of memory and looks like it is beginning > > to touch swap(!) which is going to slow things even more. Can anyone > > help me out and reduce the time it takes to do this vacuum? > > The only thing I can suggest is that SELECT * INTO newtables FROM table; > may > have been faster, though it shouldn't be. My instinct would suggest this would be faster, given that the writing would be guaranteed sequential, and a single sequential scan on the table. The problem with this is that we don't have enough disk space to maintain 2 copies of the table on one device at the moment as the table takes about 40Gb space. BTW the vacuum is of course still going as we enter the 50hr stage.... Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
> -----Original Message----- > From: Peter Childs [mailto:Blue.Dragon@blueyonder.co.uk] > Sent: 19 February 2003 06:05 > To: PostgreSQL General > Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table? > > > This is probably a good argument to have VACUUM emit a notice every 10 > > minutes or so giving some indication of its progress. I don't know how > hard > > this would be. > > > Try Vacuum Verbose; It does not print a message out every 10 minites > but it > should tell you that somthing is happerning, oh and whats happerning if > you > can make any sence of the messages..... > > Peter Childs Hi Peter, I am running vacuum verbose, however after displaying an information line a few hours in, it won't output any information now until the vacuum is complete as I guess it's now just moving pages around.... :( Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
On Wed, Feb 19, 2003 at 11:03:12AM -0000, Mark Cave-Ayland wrote: > Ok, I've managed to do that.... offset just seems to keep jumping around > though, not much of a specific pattern... although it does seem to be > reading from 2 separate files :( But what IS interesting is that between > each seek(), postgres does 2 reads of 8k (which I guess is a page) and > then 4 writes! This I don't understand? Surely given the memory > parameters then it should read as many pages into memory as possible, > sort them, then seek back and write them? What appears to be happening > is that it is only one or two pages are being moved at a time which > seems really inefficient. Hmm, odd. The offset is going *backwards*. It's starting at 981934080 and made it down to 969719808. It looks like it's reading each page. I don't think I need to tell you that from a caching point of view, it's not probably not as good as going forward. > Or is the assumption here that by limiting the pages being moved around, > more memory can be given to the OS so it can cache aggressively? I've > uploaded the strace file to > http://www.infomapper.com/strace/strace.log.txt if you (or anyone else) > would be interested in taking a look - I logged about 700ks worth. How long did it take to get that trace? Also, what are file descriptors 58, 97 and 114? > The vacuum has emitted a line of statistics within the first few hours > which reads the following: > > INFO: --Relation public.res-- > INFO: Pages 9167290: Changed 5366493, reaped 6000857, Empty 0, New 0; > Tup 72475389: Vac 105298234, Keep/VTL 0/0, UnUsed 221144, MinLen 72, > MaxLen 727; Re-using: Free/Avail. Space 48478844644/48455035400; > EndEmpty/Avail. Pages 0/8458521. > CPU 733.62s/151.22u sec elapsed 4656.51 sec. Ok, that means it has finished the vacuum stage (since that's printed at the end of scan_heap). It's now going through the heap compacting. From that output it appears that your table is around 75Gig (9 million 8K pages)! You're into the repair_frag stage where indeed it scans through the table backwards. While I can see the argument for going backwards, from a caching perspective I think it's terrible. Especially in the case where the entire table has been replaced, the entire exercise becomes a very expensive copy operation. > But now it's just sat there churning away.... I guess the above was the > result of marking which tuples were to be kept and now I'm guessing its > in the process of moving data around. So yes, some additional > notification during this phase would be very useful for large tables > like this. Let me know what those file descriptors point to and we can probably work out how far along it is. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Support bacteria! They're the only culture some people have.
Вложения
Hi Martijn, Thanks again for your reply. > -----Original Message----- > From: Martijn van Oosterhout [mailto:kleptog@svana.org] > Sent: 19 February 2003 12:11 > To: Mark Cave-Ayland > Cc: PostgreSQL General > Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table? > > On Wed, Feb 19, 2003 at 11:03:12AM -0000, Mark Cave-Ayland wrote: > > Ok, I've managed to do that.... offset just seems to keep jumping around > > though, not much of a specific pattern... although it does seem to be > > reading from 2 separate files :( But what IS interesting is that between > > each seek(), postgres does 2 reads of 8k (which I guess is a page) and > > then 4 writes! This I don't understand? Surely given the memory > > parameters then it should read as many pages into memory as possible, > > sort them, then seek back and write them? What appears to be happening > > is that it is only one or two pages are being moved at a time which > > seems really inefficient. > > Hmm, odd. The offset is going *backwards*. It's starting at 981934080 and > made it down to 969719808. It looks like it's reading each page. I don't > think I need to tell you that from a caching point of view, it's not > probably > not as good as going forward. Backwards? Ouch I guess that will break most caching strategies! But as I said before, do you know why it only appears to be moving one or two pages at a time before seeking again??? > How long did it take to get that trace? Also, what are file descriptors > 58, > 97 and 114? The trace lasted about a couple of minutes. I've listed the filesystems from /proc and uploaded them to http://www.infomapper.com/strace/fd.log.txt for you to look - I have a feeling you would find it useful to see the total number of files open in terms of their size and quantity.... > > The vacuum has emitted a line of statistics within the first few hours > > which reads the following: > > > > INFO: --Relation public.res-- > > INFO: Pages 9167290: Changed 5366493, reaped 6000857, Empty 0, New 0; > > Tup 72475389: Vac 105298234, Keep/VTL 0/0, UnUsed 221144, MinLen 72, > > MaxLen 727; Re-using: Free/Avail. Space 48478844644/48455035400; > > EndEmpty/Avail. Pages 0/8458521. > > CPU 733.62s/151.22u sec elapsed 4656.51 sec. > > Ok, that means it has finished the vacuum stage (since that's printed at > the > end of scan_heap). It's now going through the heap compacting. From that > output it appears that your table is around 75Gig (9 million 8K pages)! > You're into the repair_frag stage where indeed it scans through the table > backwards. Hmmm, a little big bigger than I was expecting, although I guess that contains old and new pages. Here is the select result: SELECT relname, relpages FROM pg_class ORDER BY relpages DESC; relname | relpages ---------------------------------+---------- res | 5586167 ...so at 8Kb/page then that's about 42Gb. > While I can see the argument for going backwards, from a caching > perspective > I think it's terrible. Especially in the case where the entire table has > been replaced, the entire exercise becomes a very expensive copy > operation. Agreed! If it ruins any caching then in my view it's something that has to change in order to keep performance. While there may be a penalty to pay on smaller tables, the benefits of caching would more than make up for the cost of going forwards - imagine how slow CPUs would be if everything was a cache miss.... > Let me know what those file descriptors point to and we can probably work > out how far along it is. Yes please, that would be really useful for us to know. Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
On Wednesday 19 Feb 2003 7:25 pm, you wrote: > Agreed! If it ruins any caching then in my view it's something that has > to change in order to keep performance. While there may be a penalty to > pay on smaller tables, the benefits of caching would more than make up > for the cost of going forwards - imagine how slow CPUs would be if > everything was a cache miss.... I have an idea. Don't know what to call it, better or stupid. If you are doing vacuum full, it means you are ready to lock the database. I am just wondering if it would be fast to dump the database, drop it and recreate it. Good old defragmentation of ext2fs..;-) Of course, having a spare 80Gigs might turn out to be a problem but I can almost bet that it would finish before 12 hours.. Shridhar
On Wed, Feb 19, 2003 at 01:55:01PM -0000, Mark Cave-Ayland wrote: > > How long did it take to get that trace? Also, what are file > descriptors > > 58, > > 97 and 114? > > The trace lasted about a couple of minutes. I've listed the filesystems > from /proc and uploaded them to > http://www.infomapper.com/strace/fd.log.txt for you to look - I have a > feeling you would find it useful to see the total number of files open > in terms of their size and quantity.... 58 = 55792/58153.15 97 = 55792/58153.54 114 = pg_xlog/000000A50000000E So it's writing to WAL. I don't know if I'm sure about this but if it's currently at file no 54 and you need to end up with 45 (each file 1GB) and you started on file 69 that means you're over halfway. But I'm really not sure. > Agreed! If it ruins any caching then in my view it's something that has > to change in order to keep performance. While there may be a penalty to > pay on smaller tables, the benefits of caching would more than make up > for the cost of going forwards - imagine how slow CPUs would be if > everything was a cache miss.... I'd like to get one of the developers views on this. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Support bacteria! They're the only culture some people have.
Вложения
Hi Martijn, Shridhar, > > Agreed! If it ruins any caching then in my view it's something that has > > to change in order to keep performance. While there may be a penalty to > > pay on smaller tables, the benefits of caching would more than make up > > for the cost of going forwards - imagine how slow CPUs would be if > > everything was a cache miss.... > > I'd like to get one of the developers views on this. Just looking at Shridhar's email, and Martin's earlier comment about SELECT * INTO newtable FROM oldtable, I'm rapidly seeing that this would be the better way go. Because instead of seeking around millions of pages with the vacuum, it would be less work to do this because then it is just a sequential read and a sequential write. Interestingly this could be used to create a speedy vacuum - that is, create a new table with a temporary name that is invisible to the database (similar to dropped columns), then taking into account the disk space left on the device, pick the last X pages from the old table and write to the new table. Then truncate the file containing the table at point X and repeat until finished. Finally kill the old table and make the new one visible. I appreciate there may be a couple of issues with oids/foreign keys but it sounds like a great solution to me! Why does vacuum bother with reordering rows? I thought that was what the CLUSTER command was for? Any developers care to comment on this? I imagine there must be a technical reason (prob to do with MVCC) as to why this hasn't been done before? In fact, my colleague has just done a test with SELECT..INTO on our dev version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a vacuum full on the same original 600Mb table which is still going after 20mins. Difficult choice! So even in a worse case scenario we could have a fully vacuumed table within a day.... we're looking at dropping some indexes in the db to reclaim enough space to be able to fit another copy of the table on the disk... this is looking very tempting at the moment.... Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> Interestingly this could be used to create a speedy vacuum - that is,
> create a new table with a temporary name that is invisible to the
> database (similar to dropped columns), then taking into account the disk
> space left on the device, pick the last X pages from the old table and
> write to the new table. Then truncate the file containing the table at
> point X and repeat until finished. Finally kill the old table and make
> the new one visible.
And if you crash midway through?
> Why does vacuum bother with reordering rows?
It's designed to be fast when there's not very much data motion required
(ie, you only need to pull a relatively small number of rows off the end
to fill in the holes elsewhere).
I have not seen any actual evidence that doing it any other way would be
faster.  Yes, it's reading the source tuples backwards instead of
forwards, but that's at most a third of the total I/O load (you've also
got tuple output and WAL writes to think about).  It's not clear that
any kernel read-ahead optimization could get a chance to work anyhow.
> In fact, my colleague has just done a test with SELECT..INTO on our dev
> version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a
> vacuum full on the same original 600Mb table which is still going after
> 20mins.
Are there indexes on the original table?  If so, this isn't a fair
comparison.
            regards, tom lane
			
		Hi Tom, Thanks for your reply. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 19 February 2003 16:28 > To: Mark Cave-Ayland > Cc: Martijn van Oosterhout; shridhar_daithankar@persistent.co.in; > PostgreSQL General > Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table? > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > Interestingly this could be used to create a speedy vacuum - that is, > > create a new table with a temporary name that is invisible to the > > database (similar to dropped columns), then taking into account the disk > > space left on the device, pick the last X pages from the old table and > > write to the new table. Then truncate the file containing the table at > > point X and repeat until finished. Finally kill the old table and make > > the new one visible. > > And if you crash midway through? I don't know if I'm looking at this too simplistically but.... Each table could have an attribute to indicate that it is being vacuumed (if one does not already exist). I imagine the rest could be based on the existing transaction code i.e. mark the last X rows of the source table as deleted and insert them (still invisible) into the destination table. On transaction commit, the destination copies become visible and source table is truncated at the file level. Or is truncation not a transaction safe activity? If a crash occurs during the vacuum then postgres can tell from the attribute that the table was in the process of being vacuumed and then can use the WAL to carry on from where it left off.... > > Why does vacuum bother with reordering rows? > > It's designed to be fast when there's not very much data motion required > (ie, you only need to pull a relatively small number of rows off the end > to fill in the holes elsewhere). > > I have not seen any actual evidence that doing it any other way would be > faster. Yes, it's reading the source tuples backwards instead of > forwards, but that's at most a third of the total I/O load (you've also > got tuple output and WAL writes to think about). It's not clear that > any kernel read-ahead optimization could get a chance to work anyhow. I see, maybe I was a little premature in my 'vacuum bashing' :) So it's optimized for the 'few holes' case while we are using it for a 'many holes' case..... things make a bit more sense now. > > In fact, my colleague has just done a test with SELECT..INTO on our dev > > version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a > > vacuum full on the same original 600Mb table which is still going after > > 20mins. > > Are there indexes on the original table? If so, this isn't a fair > comparison. Fair point actually, I should have made it a better comparison. The source table has 5 btree indexes, each on a bigint field. However, it has taken just under a minute to recreate the first! The vacuum full on the original 600Mb table has finished after 100mins, so it looks as if I used the SELECT..INTO method could be up and done in 10mins! I can continue recreating the other indexes to get a proper final time comparison if you are interested? Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> In fact, my colleague has just done a test with SELECT..INTO on our
> dev
> version and it compacted 600Mb -> 400Mb in just 40s(!). We then did
> a
> vacuum full on the same original 600Mb table which is still going
> after
> 20mins.
>>
>> Are there indexes on the original table?  If so, this isn't a fair
>> comparison.
> Fair point actually, I should have made it a better comparison. The
> source table has 5 btree indexes, each on a bigint field. However, it
> has taken just under a minute to recreate the first! The vacuum full on
> the original 600Mb table has finished after 100mins, so it looks as if I
> used the SELECT..INTO method could be up and done in 10mins! I can
> continue recreating the other indexes to get a proper final time
> comparison if you are interested?
Yeah.  Also, I don't suppose you made that a VACUUM VERBOSE and kept the
output?  It'd be interesting to see which stages took the most time.
            regards, tom lane
			
		On Wed, Feb 19, 2003 at 11:28:13AM -0500, Tom Lane wrote: > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > Interestingly this could be used to create a speedy vacuum - that is, > > create a new table with a temporary name that is invisible to the > > database (similar to dropped columns), then taking into account the disk > > space left on the device, pick the last X pages from the old table and > > write to the new table. Then truncate the file containing the table at > > point X and repeat until finished. Finally kill the old table and make > > the new one visible. > > And if you crash midway through? Messy definitly. > > Why does vacuum bother with reordering rows? > > It's designed to be fast when there's not very much data motion required > (ie, you only need to pull a relatively small number of rows off the end > to fill in the holes elsewhere). > > I have not seen any actual evidence that doing it any other way would be > faster. Yes, it's reading the source tuples backwards instead of > forwards, but that's at most a third of the total I/O load (you've also > got tuple output and WAL writes to think about). It's not clear that > any kernel read-ahead optimization could get a chance to work anyhow. Well, consider that it's reading every single page in the table from the end down to halfway (since every tuple was updated). If you went back in chunks of 128K then the kernel may get a chance to cache the following blocks. On a disk, once you are reading a sector, reading the next 128 sectors is essentially free. The marginal cost of more sectors is almost zero. It could be argued that the kernel should be noticing that you're scanning backward and start its read a meg or two before where you asked. But there are so many levels of cache (both hardware and software) that may have to play along. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Support bacteria! They're the only culture some people have.
Вложения
Martijn van Oosterhout <kleptog@svana.org> writes:
> Well, consider that it's reading every single page in the table from the end
> down to halfway (since every tuple was updated). If you went back in chunks
> of 128K then the kernel may get a chance to cache the following
> blocks.
I fear this would be optimization with blinkers on :-(.  The big reason
that VACUUM FULL scans backwards is that at the very first (last?) page
where it cannot push all the tuples down to lower-numbered pages, it
can abandon any attempt to move more tuples.  The file can't be made
any shorter by internal shuffling, so we should stop.  If you back up
multiple pages and then scan forward, you would usually find yourself
moving the wrong tuples, ie ones that cannot help you shrink the file.
I suspect that what we really want here is a completely different
algorithm (viz copy into a new file, like CLUSTER) when the initial scan
reveals that there's more than X percent of free space in the file.
            regards, tom lane
			
		Martijn van Oosterhout <kleptog@svana.org> writes:
> You could do the jump-back-in-blocks only if more than 30% of the table is
> empty and table is over 1GB. For the example here, a simple defragging
> algorithm would suffice; start at beginning and pack each tuple into the
> beginning of the file. It will move *every* tuple but it's more cache
> friendly. It's pretty extreme though.
And your evidence that it will actually be faster is ... ?
            regards, tom lane
			
		On Wednesday 19 Feb 2003 9:05 pm, you wrote: > Interestingly this could be used to create a speedy vacuum - that is, > create a new table with a temporary name that is invisible to the > database (similar to dropped columns), then taking into account the disk > space left on the device, pick the last X pages from the old table and > write to the new table. Then truncate the file containing the table at > point X and repeat until finished. Finally kill the old table and make > the new one visible. I appreciate there may be a couple of issues with > oids/foreign keys but it sounds like a great solution to me! Why does > vacuum bother with reordering rows? I thought that was what the CLUSTER > command was for? Any developers care to comment on this? I imagine there > must be a technical reason (prob to do with MVCC) as to why this hasn't > been done before? Well, One thing I can think of is the extra space required. The algo. looks good but it would be very difficult to make sure that it works all the time especially given that postgresql does not have sophisticated and/or tunable storage handling( think of tablespaces ). It is always space-time trade-off. On one hand we have vacuum which uses a constant and may be negiliible space but takes time proportional to amount of work. On other hand we have drop/recreate table which takes double the space but is extremely fast i.e. proportinal to data size at max. I/O bandwidth available.. It would be good if there is in between. Of course it would not be easy to do it. But it has to start, isn't it?..:-) > In fact, my colleague has just done a test with SELECT..INTO on our dev > version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a > vacuum full on the same original 600Mb table which is still going after > 20mins. Difficult choice! So even in a worse case scenario we could have > a fully vacuumed table within a day.... we're looking at dropping some > indexes in the db to reclaim enough space to be able to fit another copy > of the table on the disk... this is looking very tempting at the > moment.... I recommend this strategy of "vacuuming" be documented in standard documentation and FAQ. Given that postgresql is routinely deployed for databases >10GB which is greater than small/medium by any definition today, I think this will be a good move. Furthermore this strategy reduces the down time due to vacuum full locks drastically. I would say it is worth buying a 80GB IDE disk for this purpose if you have this big database.. Nice to see that my idea helped somebody..:-) Shridhar
Hi Tom, Martijn, Shridhar,
> Yeah.  Also, I don't suppose you made that a VACUUM VERBOSE and kept
the
> output?  It'd be interesting to see which stages took the most time.
>
>             regards, tom lane
I've got the complete set of timings for the table on our dev box. I'm
afraid the vacuum wasn't run in verbose mode so I can't give you any
more information :(. This information is based on the 600Mb table on our
dev system.
Time to vacuum full:
    Total:                            100m
Time to restore by copying to another table:
    SELECT INTO
40s
    CREATE INDEX 1 (bigint) using btree            59s
    CREATE INDEX 2 (int) using btree            27s
    CREATE INDEX 3 (bigint) using btree            32s
    CREATE INDEX 4 (bigint) using btree            24s
    CREATE INDEX 5 (varchar) using btree        3m 2s
    CREATE INDEX 6 (txtidx) using gist            12m 58s
    CREATE INDEX 7 (txtidx) using gist            31m 20s
    Total:                            51m 22s
OK, so my initial estimate of rebuilding in 10mins was way out because I
forgot about the 2 massive gist fti indexes I had on the table - doh.
However, I feel that the times are still meaningful in that I now have a
replica of the table at 400Mb (down by 200Mb) in just over half the time
that the vacuum full took to do the same job.
We stopped our vacuum on our live system at the 55hr stage whilst it was
still going - it took about another hour from sending the cancel request
until the vacuum stopped. We then deleted several indexes to claim back
enough Gb to hold a second copy of the table and we've just completed
the SELECT INTO into a new table.
And the result? It has taken a total of 1h 45m to generate a copy! Given
that we are rebuilding the table *WITHOUT* the large gist indexes on our
dev version, I guess that it would only be a matter of several hours
before we can rebuild the indexes back up on the table and be using it
again.
I hope that this goes some way to showing that implementing a new type
of vacuum feature, perhaps similar in working to the one suggested
earlier in the thread, would be greatly appreciated by people with
databases even more than 0.5G in size. I would gladly support/help out
anyone who felt they could implement such a feature in this way.
Cheers,
Sparks.
---
Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England
Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446
This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.
			
		On 20 Feb 2003 at 13:03, Mark Cave-Ayland wrote: > And the result? It has taken a total of 1h 45m to generate a copy! Given > that we are rebuilding the table *WITHOUT* the large gist indexes on our > dev version, I guess that it would only be a matter of several hours > before we can rebuild the indexes back up on the table and be using it > again. OK. From last thread, there was one more bell of caution. Having foreign key constraints. What I would suggest you to do is as follows. beign create new table as select into.. create any necessary indexes on new table. rename old table as something else. rename new table as original table commit drop old table. It should take care of mos practical problems that I can think of, right now. Bye Shridhar -- Acceptance testing: An unsuccessful attempt to find bugs.
Hi Shridhar, > -----Original Message----- > From: Shridhar Daithankar<shridhar_daithankar@persistent.co.in> > [mailto:shridhar_daithankar@persistent.co.in] > Sent: 20 February 2003 06:32 > To: PostgreSQL General > Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table? > > On Wednesday 19 Feb 2003 9:05 pm, you wrote: > > Interestingly this could be used to create a speedy vacuum - that is, > > create a new table with a temporary name that is invisible to the > > database (similar to dropped columns), then taking into account the disk > > space left on the device, pick the last X pages from the old table and > > write to the new table. Then truncate the file containing the table at > > point X and repeat until finished. Finally kill the old table and make > > the new one visible. I appreciate there may be a couple of issues with > > oids/foreign keys but it sounds like a great solution to me! Why does > > vacuum bother with reordering rows? I thought that was what the CLUSTER > > command was for? Any developers care to comment on this? I imagine there > > must be a technical reason (prob to do with MVCC) as to why this hasn't > > been done before? > > Well, One thing I can think of is the extra space required. The algo. > looks > good but it would be very difficult to make sure that it works all the > time > especially given that postgresql does not have sophisticated and/or > tunable > storage handling( think of tablespaces ). In some ways, extra space isn't a problem as long as you know about it. On our dev system, we've had several occasions where the disk has filled and we've had to get in and recover it. When the system was designed, it was planned to have a couple of 10s of GB spare to store additional data, but we did not plan to need to have a second copy of our largest table @ 40Gb a copy! So given that postgres falls over anyway when the disk is full, I would not see this as a reason to NOT develop additional functionality which would make use of more disk space as long as users can be made aware of this need.... > It is always space-time trade-off. On one hand we have vacuum which uses a > constant and may be negiliible space but takes time proportional to amount > of > work. On other hand we have drop/recreate table which takes double the > space > but is extremely fast i.e. proportinal to data size at max. I/O bandwidth > available.. > > It would be good if there is in between. Of course it would not be easy to > do > it. But it has to start, isn't it?..:-) The situation here is that to do a vacuum full requires locking this particular table so our system becomes practically unusable anyway. So having the process take a day as opposed to 3-4 days has been a great benefit to us. > I recommend this strategy of "vacuuming" be documented in standard > documentation and FAQ. Given that postgresql is routinely deployed for > databases >10GB which is greater than small/medium by any definition > today, I > think this will be a good move. Yes, that would be very useful if it could documented in which situations a SELECT INTO would be dramatically more efficient than a vacuum. > Furthermore this strategy reduces the down time due to vacuum full locks > drastically. I would say it is worth buying a 80GB IDE disk for this > purpose > if you have this big database.. > > Nice to see that my idea helped somebody..:-) :) Well, thank YOU very much! As I stated in the previous email, I hope this information goes some way to showing that the facility should be considered more important as people move towards larger postgresql databases. Cheers, Mark. P.S. Have just received your other email while writing this, and the procedure you described is pretty close to what we're doing. However, we need to manually add back various constraints/default values into the table columns which is a bit of a pain... wish it could be a little more automatic. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
On 20 Feb 2003 at 13:23, Mark Cave-Ayland wrote: > P.S. Have just received your other email while writing this, and the > procedure you described is pretty close to what we're doing. However, we > need to manually add back various constraints/default values into the > table columns which is a bit of a pain... wish it could be a little more > automatic. Well, postgresql DDL's are transaction safe. So if you rename a table in a transaction, that won't be visible until you commit transaction. That is the reason I described transaction in my last mail. If you encapsulate it that way, it should not affect any referential constraints anyway. At least that is the theory. Try it and let us know.. Bye Shridhar -- intoxicated, adj.: When you feel sophisticated without being able to pronounce it.
On Thu, 20 Feb 2003, Shridhar Daithankar wrote: > On 20 Feb 2003 at 13:03, Mark Cave-Ayland wrote: > > And the result? It has taken a total of 1h 45m to generate a copy! Given > > that we are rebuilding the table *WITHOUT* the large gist indexes on our > > dev version, I guess that it would only be a matter of several hours > > before we can rebuild the indexes back up on the table and be using it > > again. > > OK. From last thread, there was one more bell of caution. Having foreign key > constraints. > > What I would suggest you to do is as follows. > > beign > > create new table as select into.. > create any necessary indexes on new table. > rename old table as something else. > rename new table as original table > > commit > > drop old table. > > It should take care of mos practical problems that I can think of, right now. That won't copy foreign key constraints, unfortuntately. Foreign keys aren't to a name, they're to an object, so a constraint to the old table is still to the old table no matter what you rename it to and if something else is renamed to the the old table's old table.
Well, here is another case where partitioning would be usefull. Lets all agree the that vaccuming a small table should be better done by copying to a new one. Now, if a larger table would be partitioned, it would allow vacuuming one partition at a time. JLL P.S. Is there really a need to reorder the vaccumed table??? "Shridhar Daithankar" wrote: > [...] > Well, One thing I can think of is the extra space required. The algo. looks > good but it would be very difficult to make sure that it works all the time > especially given that postgresql does not have sophisticated and/or tunable > storage handling( think of tablespaces ). > > It is always space-time trade-off. On one hand we have vacuum which uses a > constant and may be negiliible space but takes time proportional to amount of > work. On other hand we have drop/recreate table which takes double the space > but is extremely fast i.e. proportinal to data size at max. I/O bandwidth > available.. > > It would be good if there is in between. Of course it would not be easy to do > it. But it has to start, isn't it?..:-) > > > In fact, my colleague has just done a test with SELECT..INTO on our dev > > version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a > > vacuum full on the same original 600Mb table which is still going after > > 20mins. Difficult choice! So even in a worse case scenario we could have > > a fully vacuumed table within a day.... we're looking at dropping some > > indexes in the db to reclaim enough space to be able to fit another copy > > of the table on the disk... this is looking very tempting at the > > moment.... > > I recommend this strategy of "vacuuming" be documented in standard > documentation and FAQ. Given that postgresql is routinely deployed for > databases >10GB which is greater than small/medium by any definition today, I > think this will be a good move. > > Furthermore this strategy reduces the down time due to vacuum full locks > drastically. I would say it is worth buying a 80GB IDE disk for this purpose > if you have this big database.. > > Nice to see that my idea helped somebody..:-) > > Shridhar > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Hi Stephan, Sridhar, > > OK. From last thread, there was one more bell of caution. Having foreign > key > > constraints. > > > > What I would suggest you to do is as follows. > > > > beign > > > > create new table as select into.. > > create any necessary indexes on new table. > > rename old table as something else. > > rename new table as original table > > > > commit > > > > drop old table. > > > > It should take care of mos practical problems that I can think of, right > now. > > That won't copy foreign key constraints, unfortuntately. Foreign keys > aren't to a name, they're to an object, so a constraint to the old table > is still to the old table no matter what you rename it to and if something > else is renamed to the the old table's old table. Aha I didn't realize the significance of the transaction block in Sridhar's previous email. In this case we should be ok since we're not using inheritance or foreign key constraints - I was just thinking about the more generic case when trying to come up with a better way to vacuum. But it would have been an interesting hack if it had worked :) Progress on clearing up the database is going well, we're getting real close now.... Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
Hi Tom, > I suspect that what we really want here is a completely different > algorithm (viz copy into a new file, like CLUSTER) when the initial scan > reveals that there's more than X percent of free space in the file. Yes! Sorry Tom, I missed the detail of this the first time around. However from the figures here it would need to be a very small percentage of free space, perhaps only upto 10-15% where the existing vacuum would be a better method. Hope they were useful to you. Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
Yes! That sounds like a good idea that could be used to implement 'live' table vacuuming..... Would it be possible to partition each postgres table across 2 files? You define a stripe size, say 1000, which is the number of pages that are written in turn to each file. I guess this would be chosen so as to not penalise sequential table scans too much. So the first 1000 pages would get written to file A, the next 1000 to file B, the next 1000 to file A again.... and so on. Any queries against the table must search both files to return the rows required. The immediate downside is that I guess indexes would have to be updated so they were aware that data was stored across two different files.... but let's carry on for a moment.... So now if you want to vacuum the whole table, you first lock file A and begin vacuuming it using an appropriate method. While this is running, if someone tries to delete a row from file A, the row is simply marked as deleted and moved to the end of the file during the vacuum. A deletion from file B is simply marked as deleted as normal. If someone does an insert or an update on any row in the table then the changed rows are written to file B while file A is locked and vice-versa. Assuming that each file contains roughly the same number of rows then any new updates/inserts to the table should be distributed evenly across both files since the vacuum time of each file should be roughly equal. (If this is not the case then it may be necessary to perform some form of 'balancing' by taking some rows from the larger file and moving them to the smaller one... I guess some details are missing here.). Another downside of this would be that the last < 1000 pages of each file wouldn't be vacuumed - but would people really mind if a table wasn't fully vacuumed? I don't think they would. However, while there are probably several flaws in the example I gave above, I think Jean-Luc's idea of relating partitioning to this thread could be used to eliminate many of the problems of vacuuming that currently exist.... Food for thought, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person. > -----Original Message----- > From: Jean-Luc Lachance [mailto:jllachan@nsd.ca] > Sent: 20 February 2003 16:43 > To: Shridhar Daithankar<shridhar_daithankar@persistent.co.in> > Cc: PostgreSQL General > Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table? > > Well, here is another case where partitioning would be usefull. > > Lets all agree the that vaccuming a small table should be better done by > copying to a new one. > Now, if a larger table would be partitioned, it would allow vacuuming > one partition at a time. > > > JLL > > P.S. Is there really a need to reorder the vaccumed table???
On Wed, Feb 19, 2003 at 08:53:42PM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Well, consider that it's reading every single page in the table from the end > > down to halfway (since every tuple was updated). If you went back in chunks > > of 128K then the kernel may get a chance to cache the following > > blocks. > > I fear this would be optimization with blinkers on :-(. The big reason > that VACUUM FULL scans backwards is that at the very first (last?) page > where it cannot push all the tuples down to lower-numbered pages, it > can abandon any attempt to move more tuples. The file can't be made > any shorter by internal shuffling, so we should stop. If you back up > multiple pages and then scan forward, you would usually find yourself > moving the wrong tuples, ie ones that cannot help you shrink the file. I agree with the general idea. However, in this case there are 40GB+ of tuples to move; if you moved backwards in steps of 2MB it would make no significant difference on the resulting table. It would only be a problem near the end of the compacting. Then you can stop, the remaining pages can surely be kept track of in the FSM. Next time you do a vacuum you can go back and do the compacting properly. On tables of the size that matter here, I don't think anyone will care if the last 2MB (=0.0044% of table) isn't optimally packed the first time round. Does vacuum full have to produce the optimum result the first time? > I suspect that what we really want here is a completely different > algorithm (viz copy into a new file, like CLUSTER) when the initial scan > reveals that there's more than X percent of free space in the file. You could do the jump-back-in-blocks only if more than 30% of the table is empty and table is over 1GB. For the example here, a simple defragging algorithm would suffice; start at beginning and pack each tuple into the beginning of the file. It will move *every* tuple but it's more cache friendly. It's pretty extreme though. It does preserve table order though whereas the current algorithm will reverse the order of all the tuples in the table, possibly causing similar backward-scan problems later with your index-scans. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Support bacteria! They're the only culture some people have.
Вложения
On Wed, Feb 19, 2003 at 10:37:45PM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > You could do the jump-back-in-blocks only if more than 30% of the table is > > empty and table is over 1GB. For the example here, a simple defragging > > algorithm would suffice; start at beginning and pack each tuple into the > > beginning of the file. It will move *every* tuple but it's more cache > > friendly. It's pretty extreme though. > > And your evidence that it will actually be faster is ... ? Will, in this guy's example, the first stage of the vacuum (scan_heap) finished in a few hours but the second stage is still going after 50 hours and estimating a bit over half done. The major difference between the two is that the latter goes backwards through the table and the other forwards. Sure, it's anecdotal and a bit hand wavey but I think there's something to it. Come to think of it, if that strace had used -tt it would have been more interesting. Anyway, the test would be to implement it and them time it. The vacuum code looks scary though. I'm not sure if I want to mess with it... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Support bacteria! They're the only culture some people have.
Вложения
On 20 Feb 2003 at 11:42, Jean-Luc Lachance wrote: > Well, here is another case where partitioning would be usefull. Not really. But point is well understood. > Lets all agree the that vaccuming a small table should be better done by > copying to a new one. > Now, if a larger table would be partitioned, it would allow vacuuming > one partition at a time. postgresql splits data files at 1 GB. If DBA assures postgresql that there is at least 1 gig of free space on data directory, vacuum full can use these data files as partition and compact the space pretty quickly. I think adding a flag to vacuum full to effect of use temporary table space for compaction aggresively, I think it should be possible to implement this pretty quickly. On the other hand, I think partitioining table should be pretty easy since the logic for splitting things across files is already there. Only if we add logic to splitting file location as well. Bye Shridhar -- Air Force Inertia Axiom: Consistency is always easier to defend than correctness.