Обсуждение: High inserts, bulk deletes - autovacuum vs scheduled vacuum
I am developing an application that has very predictable database operations: -inserts several thousand rows into 3 tables every 5 minutes. (table contain around 10 million rows each) -truncates and rebuilds aggregate tables of this data every 5 minutes. (several thousand rows each) -regular reads of aggregate table and sometimes large tables by user interaction -every night, hundreds of thousands of rows are deleted from these 3 tables (old data) -20-30 other tables get inserted/updated slowly throughout the day In order to optimize performance of the inserts, I disabled autovacuum/row-level stats and instead run "vacuum analyze" on the whole DB every hour. However this operation takes around 20 minutes of each hour. This means that the database is involved in vacuum/analyzing tables 33% of the time. I'd like any performance advice, but my main concern is the amount of time vacuum/analyze runs and its possible impact on the overall DB performance. Thanks! I am running 8.2 (will be 8.2.1 soon). The box is Windows with 2GB RAM connected to a SAN over fiber. The data and pg_xlog are on separate partitions. Modified configuration: effective_cache_size = 1000MB random_page_cost = 3 default_statistics_target = 50 maintenance_work_mem = 256MB shared_buffers = 400MB temp_buffers = 10MB work_mem = 10MB max_fsm_pages = 1500000 checkpoint_segments = 30 stats_row_level = off stats_start_collector = off
* Jeremy Haile: > I'd like any performance advice, but my main concern is the amount of > time vacuum/analyze runs and its possible impact on the overall DB > performance. Thanks! You could partition your data tables by date and discard old data simply by dropping the tables. This is far more effective than vacuuming, but obviously, this approach cannot be used in all cases (e.g. if you need more dynamic expiry rules). -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Good advice on the partitioning idea. I may have to restructure some of my queries, since some of them query across the whole range - but it may be a much more performant solution. How is the performance when querying across a set of partitioned tables vs. querying on a single table with all rows? This may be a long term idea I could tackle, but is probably not feasible for my current time-frame. Does my current approach of disabling autovacuum and manually vacuuming once-an-hour sound like a good idea, or would I likely have better results by auto-vacuuming and turning row-level stats back on? On Tue, 09 Jan 2007 19:02:25 +0100, "Florian Weimer" <fweimer@bfk.de> said: > * Jeremy Haile: > > > I'd like any performance advice, but my main concern is the amount of > > time vacuum/analyze runs and its possible impact on the overall DB > > performance. Thanks! > > You could partition your data tables by date and discard old data > simply by dropping the tables. This is far more effective than > vacuuming, but obviously, this approach cannot be used in all cases > (e.g. if you need more dynamic expiry rules). > > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Kriegsstraße 100 tel: +49-721-96201-1 > D-76133 Karlsruhe fax: +49-721-96201-99
* Jeremy Haile: > Good advice on the partitioning idea. I may have to restructure some of > my queries, since some of them query across the whole range - but it may > be a much more performant solution. How is the performance when > querying across a set of partitioned tables vs. querying on a single > table with all rows? Locality of access decreases, of course, and depending on your data size, you hit something like to 2 or 4 additional disk seeks per partition for index-based accesses. Sequential scans are not impacted. > Does my current approach of disabling autovacuum and manually vacuuming > once-an-hour sound like a good idea, or would I likely have better > results by auto-vacuuming and turning row-level stats back on? Sorry, I haven't got much experience with autovacuum, since most of other databases are INSERT-only (or get VACUUMed automatically after major updates). -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Tue, Jan 09, 2007 at 12:26:41PM -0500, Jeremy Haile wrote: > I am developing an application that has very predictable database > operations: > -inserts several thousand rows into 3 tables every 5 minutes. (table > contain around 10 million rows each) > -truncates and rebuilds aggregate tables of this data every 5 minutes. > (several thousand rows each) > -regular reads of aggregate table and sometimes large tables by user > interaction > -every night, hundreds of thousands of rows are deleted from these 3 > tables (old data) > -20-30 other tables get inserted/updated slowly throughout the day > > In order to optimize performance of the inserts, I disabled > autovacuum/row-level stats and instead run "vacuum analyze" on the whole > DB every hour. However this operation takes around 20 minutes of each > hour. This means that the database is involved in vacuum/analyzing > tables 33% of the time. > > I'd like any performance advice, but my main concern is the amount of > time vacuum/analyze runs and its possible impact on the overall DB > performance. Thanks! If much of the data in the database isn't changing that often, then why continually re-vacuum the whole thing? I'd suggest trying autovacuum and see how it does (though you might want to tune it to be more or less aggressive, and you'll probably want to enable the cost delay). The only cases where manual vacuum makes sense to me is if you've got a defined slow period and vacuuming during that slow period is still frequent enough to keep up with demand, or if you've got tables that have a very high churn rate and need to be kept small. In the later case, I'll usually setup a cronjob to vacuum those tables once a minute with no cost delay. I'm sure there might be some other cases where not using autovac might make sense, but generally I'd much rather let autovac worry about this so I don't have to. > I am running 8.2 (will be 8.2.1 soon). The box is Windows with 2GB RAM > connected to a SAN over fiber. The data and pg_xlog are on separate > partitions. > > Modified configuration: > effective_cache_size = 1000MB > random_page_cost = 3 > default_statistics_target = 50 > maintenance_work_mem = 256MB > shared_buffers = 400MB > temp_buffers = 10MB > work_mem = 10MB > max_fsm_pages = 1500000 One other useful manual vacuum to consider is running vacuumdb -av periodically (say, once a month) and looking at the last few lines of output. That will give you a good idea on how large you should set max_fsm_pages. Running the output of vacuumdb -av through pgFouine will give you other useful data. > checkpoint_segments = 30 > stats_row_level = off > stats_start_collector = off Unless you're really trying to get the last ounce of performance out, it's probably not worth turning those stats settings off. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Please cc the list so others can help and learn. On Wed, Jan 10, 2007 at 03:43:00PM -0500, Jeremy Haile wrote: > > I'd suggest trying autovacuum and see how it does (though you might want > > to tune it to be more or less aggressive, and you'll probably want to > > enable the cost delay). > > What are some decent default values for the cost delay vacuum settings? > I haven't used these before. I find that simply setting vacuum_cost_delay to 20 is generally a good starting point. I'll usually do that and then run a vacuum while watching disk activity; I try and tune it so that the disk is ~90% utilized with vacuum running. That allows a safety margin without stretching vacuums out forever. > Also - do the default autovacuum settings make sense for tables on the > scale of 10 million rows? For example, using the defaults it would > require about a million rows (250 + 0.1 * 10 million) to be > inserted/updated/deleted before analyzing - which seems high. (about 2 > million for vacuum) Or am I overestimating how often I would need to > vacuum/analyze these tables? Depends on your application... the way I look at it is that a setting of 0.1 means 10% dead space in the table. While 5% or 1% would be better, you hit a point of diminishing returns since you have to read the entire table and it's indexes to vacuum it. BTW, that's the default values for analyze... the defaults for vacuum are 2x that. > Do most people use the default autovacuum settings successfully, or are > they usually modified? I generally use the 8.2 defaults (which are much better than the 8.1 defaults) unless I'm really trying to tune things. What's more important is to make sure critical tables (such as queue tables) are getting vacuumed frequently so that they stay small. (Of course you also need to ensure there's no long running transactions). -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
> BTW, that's the default values for analyze... the defaults for vacuum > are 2x that. Yeah - I was actually more concerned that tables would need to be analyzed more often than I was about vacuuming too often, so I used analyze as the example. Since my app is inserting constantly throughout the day and querying for "recent" data - I want to make sure the query planner realizes that there are lots of rows with new timestamps on them. In other words, if I run a query "select * from mytable where timestamp > '9:00am'" - I want to make sure it hasn't been a day since the table was analyzed, so the planner thinks there are zero rows greater than 9:00am today. > What's more important > is to make sure critical tables (such as queue tables) are getting > vacuumed frequently so that they stay small. Is the best way to do that usually to lower the scale factors? Is it ever a good approach to lower the scale factor to zero and just set the thresholds to a pure number of rows? (when setting it for a specific table) Thanks, Jeremy Haile
On Wed, Jan 10, 2007 at 04:48:42PM -0500, Jeremy Haile wrote: > > BTW, that's the default values for analyze... the defaults for vacuum > > are 2x that. > > Yeah - I was actually more concerned that tables would need to be > analyzed more often than I was about vacuuming too often, so I used > analyze as the example. Since my app is inserting constantly throughout > the day and querying for "recent" data - I want to make sure the query > planner realizes that there are lots of rows with new timestamps on > them. In other words, if I run a query "select * from mytable where > timestamp > '9:00am'" - I want to make sure it hasn't been a day since > the table was analyzed, so the planner thinks there are zero rows > greater than 9:00am today. Well, analyze is pretty cheap. At most it'll read only 30,000 pages, which shouldn't take terribly long on a decent system. So you can be a lot more aggressive with it. > > What's more important > > is to make sure critical tables (such as queue tables) are getting > > vacuumed frequently so that they stay small. > > Is the best way to do that usually to lower the scale factors? Is it > ever a good approach to lower the scale factor to zero and just set the > thresholds to a pure number of rows? (when setting it for a specific > table) The problem is what happens if autovac goes off and starts vacuuming some large table? While that's going on your queue table is sitting there bloating. If you have a separate cronjob to handle the queue table, it'll stay small, especially in 8.2. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > > Is the best way to do that usually to lower the scale factors? Is it > > ever a good approach to lower the scale factor to zero and just set the > > thresholds to a pure number of rows? (when setting it for a specific > > table) > > The problem is what happens if autovac goes off and starts vacuuming > some large table? While that's going on your queue table is sitting > there bloating. If you have a separate cronjob to handle the queue > table, it'll stay small, especially in 8.2. You mean "at least in 8.2". In previous releases, you could vacuum that queue table until you were blue on the face, but it would achieve nothing because it would consider that the dead tuples were visible to a running transaction: that running the vacuum on the large table. This is an annoyance that was fixed in 8.2. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Thu, Jan 11, 2007 at 12:10:34AM -0300, Alvaro Herrera wrote: > Jim C. Nasby wrote: > > > > Is the best way to do that usually to lower the scale factors? Is it > > > ever a good approach to lower the scale factor to zero and just set the > > > thresholds to a pure number of rows? (when setting it for a specific > > > table) > > > > The problem is what happens if autovac goes off and starts vacuuming > > some large table? While that's going on your queue table is sitting > > there bloating. If you have a separate cronjob to handle the queue > > table, it'll stay small, especially in 8.2. > > You mean "at least in 8.2". In previous releases, you could vacuum > that queue table until you were blue on the face, but it would achieve > nothing because it would consider that the dead tuples were visible to a > running transaction: that running the vacuum on the large table. This > is an annoyance that was fixed in 8.2. True, but in many environments there are other transactions that run long enough that additional vacuums while a long vacuum was running would still help. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)