Обсуждение: Slow updates, poor IO
I've just had an interesting encounter with the slow full table update problem that is inherent with MVCC The system is 64 bit linux with 2.6.25 kernel feeding scsi disks. the table is CREATE TABLE file ( fileid integer NOT NULL, fileindex integer DEFAULT 0 NOT NULL, jobid integer NOT NULL, pathid integer NOT NULL, filenameid integer NOT NULL, markid integer DEFAULT 0 NOT NULL, lstat text NOT NULL, md5 text NOT NULL, perms text ); ALTER TABLE ONLY file ADD CONSTRAINT file_pkey PRIMARY KEY (fileid); CREATE INDEX file_fp_idx ON file USING btree (filenameid, pathid); CREATE INDEX file_jobid_idx ON file USING btree (jobid); There are 2.7M rows. running update file set perms='0664' took about 10 mins during this period, vmstat reported Blocks Out holding in the 4000 to 6000 range. When I dropped the indexes this query ran in 48sec. Blocks out peaking at 55000. So there is a double whammy. MVCC requires more work to be done when indexes are defined and then this work results in much lower IO, compounding the problem. Comments anyone? --john
On Thu, Sep 25, 2008 at 1:24 PM, John Huttley <John@mib-infotech.co.nz> wrote: > I've just had an interesting encounter with the slow full table update > problem that is inherent with MVCC > > The system is 64 bit linux with 2.6.25 kernel feeding scsi disks. > > the table is > > CREATE TABLE file ( > fileid integer NOT NULL, > fileindex integer DEFAULT 0 NOT NULL, > jobid integer NOT NULL, > pathid integer NOT NULL, > filenameid integer NOT NULL, > markid integer DEFAULT 0 NOT NULL, > lstat text NOT NULL, > md5 text NOT NULL, > perms text > ); > > ALTER TABLE ONLY file > ADD CONSTRAINT file_pkey PRIMARY KEY (fileid); > > CREATE INDEX file_fp_idx ON file USING btree (filenameid, pathid); > CREATE INDEX file_jobid_idx ON file USING btree (jobid); > > There are 2.7M rows. > > running update file set perms='0664' took about 10 mins So, how many rows would already be set to 0664? Would adding a where clause speed it up? update file set perms='0664' where perms <> '0664'; > during this period, vmstat reported Blocks Out holding in the 4000 to 6000 > range. > > > When I dropped the indexes this query ran in 48sec. > Blocks out peaking at 55000. > > So there is a double whammy. > MVCC requires more work to be done when indexes are defined and then this > work > results in much lower IO, compounding the problem. That's because it becomes more random and less sequential. If you had a large enough drive array you could get that kind of performance for updating indexes, since the accesses would tend to hit different drives most the time. Under heavy load on the production servers at work we can see 30 to 60 Megs a second random access with 12 drives, meaning 2.5 to 5Megs per second per drive. Sequential throughput is about 5 to 10 times higher. What you're seeing are likely the effects of running a db on insufficient drive hardware.
On Thursday 25 September 2008, John Huttley <John@mib-infotech.co.nz> wrote: > > Comments anyone? Don't do full table updates? This is not exactly a news flash. -- Alan
On Fri, 26 Sep 2008, John Huttley wrote: > running update file set perms='0664' took about 10 mins What do you have checkpoint_segments and shared_buffers set to? If you want something that's doing lots of updates to perform well, you need to let PostgreSQL have a decent size chunk of memory to buffer the index writes with, so it's more likely they'll get combined into larger and therefore more easily sorted blocks rather than as more random ones. The randomness of the writes is why your write rate is so slow. You also need to cut down on the frequency of checkpoints which are very costly on this type of statement. Also: which version of PostgreSQL? 8.3 includes an improvement aimed at updates like this you might benefit from. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hi, On Fri, Sep 26, 2008 at 07:24:55AM +1200, John Huttley wrote: > I've just had an interesting encounter with the slow full table update > problem that is inherent with MVCC Quite apart from the other excellent observations in this thread, what makes you think this is an MVCC issue exactly? A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
Hi Andrew,
There are two problems.
The first is the that if there is a table with a index and an update is performed on a non indexed field,
the index is still re indexed. this is part of the trade-offs of MVCC.
Apparently this is documented under 'MVCC' in the manual. It should be documented under 'performance'
We should reasonably expect that the total amount of IO will go up, over a non-indexed table.
The second thing is that the disk IO throughput goes way down.
This is not an issue with MVCC, as such, except that it exposes the effect of a write to an indexed field.
--even if you don't expect it.
--john
Andrew Sullivan wrote:
There are two problems.
The first is the that if there is a table with a index and an update is performed on a non indexed field,
the index is still re indexed. this is part of the trade-offs of MVCC.
Apparently this is documented under 'MVCC' in the manual. It should be documented under 'performance'
We should reasonably expect that the total amount of IO will go up, over a non-indexed table.
The second thing is that the disk IO throughput goes way down.
This is not an issue with MVCC, as such, except that it exposes the effect of a write to an indexed field.
--even if you don't expect it.
--john
Andrew Sullivan wrote:
Hi, On Fri, Sep 26, 2008 at 07:24:55AM +1200, John Huttley wrote:I've just had an interesting encounter with the slow full table update problem that is inherent with MVCCQuite apart from the other excellent observations in this thread, what makes you think this is an MVCC issue exactly? A
Hi Greg, I've got 32M shared on a 1G machine and 16 checkpoint segments. I'll run some tests against 64 segments and see what happens. Your previous postings were extremely helpful wrt the MVCC issue. I thank you! -john Greg Smith wrote: > On Fri, 26 Sep 2008, John Huttley wrote: > >> running update file set perms='0664' took about 10 mins > > What do you have checkpoint_segments and shared_buffers set to? If > you want something that's doing lots of updates to perform well, you > need to let PostgreSQL have a decent size chunk of memory to buffer > the index writes with, so it's more likely they'll get combined into > larger and therefore more easily sorted blocks rather than as more > random ones. The randomness of the writes is why your write rate is > so slow. You also need to cut down on the frequency of checkpoints > which are very costly on this type of statement. > > Also: which version of PostgreSQL? 8.3 includes an improvement aimed > at updates like this you might benefit from. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD >
On Sat, 27 Sep 2008, John Huttley wrote: > I've got 32M shared on a 1G machine and 16 checkpoint segments. > I'll run some tests against 64 segments and see what happens. Increase shared_buffers to 256MB as well. That combination should give you much better performance with the type of update you're doing. Right now the database server has to write the index blocks updated to disk all the time because it has so little working room to store them in. If an index block is updated but there is room to keep it memory, it doesn't have to get written out, which considerably lowers the overhead here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, Sep 26, 2008 at 5:03 PM, John Huttley <John@mib-infotech.co.nz> wrote: > Hi Andrew, > There are two problems. > The first is the that if there is a table with a index and an update is > performed on a non indexed field, > the index is still re indexed. I assume you mean updated, not reindexed, as reindexed has a different meaning as regards postgresql. Also, this is no longer true as of version 8.3. If you're updating non-indexed fields a lot and you're not running 8.3 you are doing yourself a huge disservice. >this is part of the trade-offs of MVCC. was... was a part of the trade-offs. > We should reasonably expect that the total amount of IO will go up, over a > non-indexed table. > > The second thing is that the disk IO throughput goes way down. > > This is not an issue with MVCC, as such, except that it exposes the effect > of a write to an indexed field. It's really an effect of parallel updates / writes / accesses, and is always an issue for a database running on a poor storage subsystem. A db with a two drive mirror set is always going to be at a disadvantage to one running on a dozen or so drives in a RAID-10
On Sat, Sep 27, 2008 at 4:33 PM, John Huttley <John@mib-infotech.co.nz> wrote: > > > > this is part of the trade-offs of MVCC. > > > was... was a part of the trade-offs. > > You are thinking of HOT? > I don't think it applies in the case of full table updates?? Sure, you just need a table with plenty of empty space in it, either from vacuumed previous deletes / inserts or with a low fill factor like 50%. > It's really an effect of parallel updates / writes / accesses, and is > always an issue for a database running on a poor storage subsystem. A > db with a two drive mirror set is always going to be at a disadvantage > to one running on a dozen or so drives in a RAID-10 > > Oh well, I'm forever going to be disadvantaged. Why? A decent caching raid controller and a set of 4 to 8 SATA drives can make a world of difference and the cost is not that high for the gain in performance. Even going to 4 drives in a software RAID-10 can make a lot of difference in these situations, and that can be done with spare machines and hard drives.
Scott Marlowe wrote:
On Fri, Sep 26, 2008 at 5:03 PM, John Huttley <John@mib-infotech.co.nz> wrote:Hi Andrew, There are two problems. The first is the that if there is a table with a index and an update is performed on a non indexed field, the index is still re indexed.I assume you mean updated, not reindexed, as reindexed has a different meaning as regards postgresql. Also, this is no longer true as of version 8.3. If you're updating non-indexed fields a lot and you're not running 8.3 you are doing yourself a huge disservice.
Yes sorry, I mean all indexes are updated even when the updated field is not indexed.
I'm running 8.3.3
You are thinking of HOT?this is part of the trade-offs of MVCC.was... was a part of the trade-offs.
I don't think it applies in the case of full table updates??
Oh well, I'm forever going to be disadvantaged.We should reasonably expect that the total amount of IO will go up, over a non-indexed table. The second thing is that the disk IO throughput goes way down. This is not an issue with MVCC, as such, except that it exposes the effect of a write to an indexed field.It's really an effect of parallel updates / writes / accesses, and is always an issue for a database running on a poor storage subsystem. A db with a two drive mirror set is always going to be at a disadvantage to one running on a dozen or so drives in a RAID-10
John Huttley <John@mib-infotech.co.nz> writes: > Scott Marlowe wrote: >> was... was a part of the trade-offs. > You are thinking of HOT? > I don't think it applies in the case of full table updates?? Sure, as long as there's enough free space on each page. If you wanted to make a table that was optimized for this kind of thing, you could try creating it with fillfactor 50. regards, tom lane
I have had great success using FILLFACTOR on certain tables where big updates like this occur and improving performance. It is still not as fast as I would like, but there are significant gains. A big disk array won't help you as much as it should -- yes it will be faster, but it will still be chugging during one of these sorts of large updates and very inefficiently at that.
On some of my cases, a FILLFACTOR of 95 or 98 is enough to do the trick. On others, 80 or 70 works.
It depends on the size of your rows versus the size of the modifications you make. A fillfactor of 99 holds between ~80 bytes and one row-width worth of free space in every page, and is all that is needed if you have larger rows and only modify small fields such as ints. I'm not sure why FILLFACTOR = 99 isn't the default, to be honest. The size difference on disk is far less than 1% since most tables can't fit an exact number of rows in one page, and the benefit for updates is huge in certain cases.
On the other hand, your table has a narrow row width and will fit many rows on one page, and if you are modifying text or varchars, you may need more space for those reserved in the fillfactor void and a smaller FILLFACTOR setting on the table, down to about 50 for updates where the updated rows account for a big fraction of the row width.
A second benefit of using a fillfactor is that you can CLUSTER on an index and the table will retain that ordering for longer while inserts/updates/deletes occur. A fillfactor setting, REINDEX, then CLUSTER sequence can have a big impact.
On some of my cases, a FILLFACTOR of 95 or 98 is enough to do the trick. On others, 80 or 70 works.
It depends on the size of your rows versus the size of the modifications you make. A fillfactor of 99 holds between ~80 bytes and one row-width worth of free space in every page, and is all that is needed if you have larger rows and only modify small fields such as ints. I'm not sure why FILLFACTOR = 99 isn't the default, to be honest. The size difference on disk is far less than 1% since most tables can't fit an exact number of rows in one page, and the benefit for updates is huge in certain cases.
On the other hand, your table has a narrow row width and will fit many rows on one page, and if you are modifying text or varchars, you may need more space for those reserved in the fillfactor void and a smaller FILLFACTOR setting on the table, down to about 50 for updates where the updated rows account for a big fraction of the row width.
A second benefit of using a fillfactor is that you can CLUSTER on an index and the table will retain that ordering for longer while inserts/updates/deletes occur. A fillfactor setting, REINDEX, then CLUSTER sequence can have a big impact.
On Sun, Sep 28, 2008 at 7:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
John Huttley <John@mib-infotech.co.nz> writes:> Scott Marlowe wrote:
>> was... was a part of the trade-offs.
> You are thinking of HOT?
> I don't think it applies in the case of full table updates??
Ahh! I've not dealt with that before. I'll look it up.
Thanks Tom.
Tom Lane wrote:
Thanks Tom.
Tom Lane wrote:
John Huttley <John@mib-infotech.co.nz> writes:You are thinking of HOT? I don't think it applies in the case of full table updates??Sure, as long as there's enough free space on each page. If you wanted to make a table that was optimized for this kind of thing, you could try creating it with fillfactor 50. regards, tom lane
Thanks to everyone that responded. I've done some benchmarking checkpoint _segments=16 is fine, going to 64 made no improvement. Using "update file set size=99" as a statement, but changing 99 on each run.. With 32M shared memory, time in sec and leaving the system idle long enough between runs for auto vacuum to complete. 415 421 470 The I decided to drop the Db and restore from a dump 1150 1500 1018 1071 1077 1140 Then I tried shared_mem=256M as suggested. 593 544 So thats made a big difference. vmstat showed a higher, more consistent, IO level I wondered why it slowed down after a restore. I thought it would improve, less fragmentation and all that. So I tried a reindex on all three indexes. 209 228 So thats it! lots of ram and reindex as part of standard operation. Interestingly, the reindexing took about 16s each. The update on the table with no indexes took about 48sec So the aggregate time for each step would be about 230s. I take that as being an indicator that it is now maximally efficient. The option of having more spindles for improved IO request processing isn't feasible in most cases. With the requirement for redundancy, we end with a lot of them, needing an external enclosure. They would have to be expensive SCSI/SAS/FC drives too, since SATA just don't have the IO processing. It will be interesting to see what happens when good performing SSD's appear. Meanwhile RAM is cheaper than that drive array! It would be nice if thing like * The effect of updates on indexed tables * Fill Factor * reindex after restore Were mentioned in the 'performance' section of the manual, since that's the part someone will go to when looking for a solution. Again, thanks to everyone, --John
On Mon, 29 Sep 2008, John Huttley wrote: > checkpoint _segments=16 is fine, going to 64 made no improvement. You might find that it does *after* increasing shared_buffers. If the buffer cache is really small, the checkpoints can't have very much work to do, so their impact on performance is smaller. Once you've got a couple of hundred MB on there, the per-checkpoint overhead can be considerable. > It would be nice if thing like > * The effect of updates on indexed tables > * Fill Factor > * reindex after restore > Were mentioned in the 'performance' section of the manual, since that's > the part someone will go to when looking for a solution. If you have to reindex after restore to get good performance, that means what you should do instead is drop the indexes on the table during the restore and then create them once the data is there. The REINDEX is more aimed at when the system has been running for a while and getting fragmented. Unfortunately most of the people who know enough about those topics to really do a good treatment of them are too busy fixing slow systems to have time to write about it. There are many articles on this general topic trickling out at http://wiki.postgresql.org/wiki/Performance_Optimization you might find valuable in addition to the manual. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Mon, 29 Sep 2008, John Huttley wrote: > >> checkpoint _segments=16 is fine, going to 64 made no improvement. > > You might find that it does *after* increasing shared_buffers. If the > buffer cache is really small, the checkpoints can't have very much > work to do, so their impact on performance is smaller. Once you've > got a couple of hundred MB on there, the per-checkpoint overhead can > be considerable. > Ahh bugger, I've just trashed my test setup. I've settled on 64Mb shared memory since I've only got 1Gb or RAM and the system impact of 256M is severe. Also it uses FB-DIMMS which cost arm+leg+first born >> It would be nice if thing like >> * The effect of updates on indexed tables >> * Fill Factor >> * reindex after restore >> Were mentioned in the 'performance' section of the manual, since >> that's the part someone will go to when looking for a solution. > > If you have to reindex after restore to get good performance, that > means what you should do instead is drop the indexes on the table > during the restore and then create them once the data is there. The > REINDEX is more aimed at when the system has been running for a while > and getting fragmented. I thought that the pg_dump generated files did that, so I dismissed it initially. Maybe I did a data only restore into an existing schema.. > > Unfortunately most of the people who know enough about those topics to > really do a good treatment of them are too busy fixing slow systems to > have time to write about it. There are many articles on this general > topic trickling out at > http://wiki.postgresql.org/wiki/Performance_Optimization you might > find valuable in addition to the manual. > An of course this is now in mail archive! > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > >
On Sun, Sep 28, 2008 at 8:01 PM, John Huttley <John@mib-infotech.co.nz> wrote: > Ahh bugger, I've just trashed my test setup. > I've settled on 64Mb shared memory since I've only got 1Gb or RAM and the > system impact of 256M is severe. > Also it uses FB-DIMMS which cost arm+leg+first born http://www.crucial.com/search/searchresults.aspx?keywords=buffered Fully buffered memory there is $56.99 for a 1 Gig stick. That's hardly an arm and a leg. Considering many pgsql DBAs make that in 1 to 3 hours, it's not much at all really. A lot cheaper than pulling your hair out trying to make a db server run on 1 Gig.
Ah yess... actually I can get the Kingston stuff locally.
However at the moment I'm happily married and want to keep it that way!
Everything is in pairs too. Actually its a lot cheaper than when it first came out, but still
a lot more than your corner shop DDR-2 stuff.
--John
Scott Marlowe wrote:
However at the moment I'm happily married and want to keep it that way!
Everything is in pairs too. Actually its a lot cheaper than when it first came out, but still
a lot more than your corner shop DDR-2 stuff.
--John
Scott Marlowe wrote:
On Sun, Sep 28, 2008 at 8:01 PM, John Huttley <John@mib-infotech.co.nz> wrote:Ahh bugger, I've just trashed my test setup. I've settled on 64Mb shared memory since I've only got 1Gb or RAM and the system impact of 256M is severe. Also it uses FB-DIMMS which cost arm+leg+first bornhttp://www.crucial.com/search/searchresults.aspx?keywords=buffered Fully buffered memory there is $56.99 for a 1 Gig stick. That's hardly an arm and a leg. Considering many pgsql DBAs make that in 1 to 3 hours, it's not much at all really. A lot cheaper than pulling your hair out trying to make a db server run on 1 Gig.
On Sep 28, 2008, at 10:01 PM, John Huttley wrote: > > > Greg Smith wrote: >> On Mon, 29 Sep 2008, John Huttley wrote: >> >>> checkpoint _segments=16 is fine, going to 64 made no improvement. >> >> You might find that it does *after* increasing shared_buffers. If >> the buffer cache is really small, the checkpoints can't have very >> much work to do, so their impact on performance is smaller. Once >> you've got a couple of hundred MB on there, the per-checkpoint >> overhead can be considerable. >> > Ahh bugger, I've just trashed my test setup. Pardon? How did you do that? -- Dan Langille http://langille.org/
I've canned the db and got rid my of data. I'm in the midst of doing some other benchmarking for a possible change to the bacula database. Loading up 1M records into a table of 60M records complete with indexes. It's still going... --john Dan Langille wrote: > > On Sep 28, 2008, at 10:01 PM, John Huttley wrote: > >> >> >> Greg Smith wrote: >>> On Mon, 29 Sep 2008, John Huttley wrote: >>> >>>> checkpoint _segments=16 is fine, going to 64 made no improvement. >>> >>> You might find that it does *after* increasing shared_buffers. If >>> the buffer cache is really small, the checkpoints can't have very >>> much work to do, so their impact on performance is smaller. Once >>> you've got a couple of hundred MB on there, the per-checkpoint >>> overhead can be considerable. >>> >> Ahh bugger, I've just trashed my test setup. > > Pardon? How did you do that? >
On Sun, Sep 28, 2008 at 9:08 PM, John Huttley <John@mib-infotech.co.nz> wrote: > Ah yess... actually I can get the Kingston stuff locally. > However at the moment I'm happily married and want to keep it that way! > > Everything is in pairs too. Actually its a lot cheaper than when it first > came out, but still > a lot more than your corner shop DDR-2 stuff. I don't mean to keep arguing here, but it's not any more expensive than the same speed DDR-2 667MHz memory. for ECC memory memory, they're almost the same price. http://www.crucial.com/store/listparts.aspx?model=PowerEdge%201950 That's memory for my Dell PowerEdge web server, and it's $105.99 for 2 1Gig sticks. $56.99 * 2 = $113.98. It's only 7.99 more. I get the point about not wanting to anger the wife, but maybe if you ask for it nice for Christmas? :)