Обсуждение: optimization ideas for frequent, large(ish) updates in frequently accessed DB?
optimization ideas for frequent, large(ish) updates in frequently accessed DB?
От
"Marinos J. Yannikos"
Дата:
Hi, one of our tables has to be updated frequently, but concurrently running SELECT-queries must also have low latency times (it's being accessed through a web interface). I'm looking for ideas that might improve the interactive performance of the system, without slowing down the updates too much. Here are the characteristics of the table and its use: - approx. 2 million rows - approx. 4-5 million rows per day are replaced in short bursts of 1-200k rows (average ~3000 rows per update) - the table needs 6 indexes (not all indexes are used all the time, but keeping them all the time slows the system down less than re-creating some of them just before they're needed and dropping them afterwards) - an "update" means that 1-200k rows with a common value in a particular field are replaced with an arbitrary number of new rows (with the same value in that field), i.e.: begin transaction; delete from t where id=5; insert into t (id,...) values (5,...); ... [1-200k rows] end; The problem is, that a large update of this kind can delay SELECT queries running in parallel for several seconds, so the web interface used by several people will be unusable for a short while. Currently, I'm using temporary tables: create table xyz as select * from t limit 0; insert into xyz ... ... begin transaction; delete from t where id=5; insert into t select * from xyz; end; drop table xyz; This is slightly faster than inserting directly into t (and probably faster than using COPY, even though using that might reduce the overall load on the database). What other possibilities are there, other than splitting up the 15 columns of that table into several smaller tables, which is something I'd like to avoid? Would replication help? (I doubt it, but haven't tried it yet...) Writing to one table (without indexes) and creating indexes and renaming it to the "read table" periodically in a double buffering-like fashion wouldn't work either(?), since the views and triggers would have to be re-created every time as well and other problems might arise. The postgresql.conf options are already reasonably tweaked for performance(IMHO), but perhaps some settings are particularly critical: shared_buffers=100000 (I tried many values, this seems to work well for us - 12GB RAM) wal_buffers=500 sort_mem=800000 checkpoint_segments=16 effective_cache_size=1000000 etc. Any help/suggestions would be greatly appreciated... Even if it's something like "you need a faster db box, there's no other way" ;-) Regards, Marinos
Re: optimization ideas for frequent, large(ish) updates in frequently accessed DB?
От
Josh Berkus
Дата:
Marinos, > shared_buffers=100000 > (I tried many values, this seems to work well for us - 12GB RAM) > wal_buffers=500 > sort_mem=800000 > checkpoint_segments=16 > effective_cache_size=1000000 > etc. 800MB for sort mem? Are you sure you typed that correctly? You must be counting on not having a lot of concurrent queries. It sure will speed up index updating, though! I think you might do well to experiment with using the checkpoint_delay and checkpoint_sibilings settings in order to get more efficient batch processing of updates while selects are going on. I would also suggest increasing checkpoint segments as much as your disk space will allow; I know one reporting database I run that does batch loads is using 128 (which is about a gig of disk, I think). What have you set max_fsm_relations and max_fsm_pages to? The latter should be very high for you, like 10,000,000 For that matter, what *version* of PostgreSQL are you running? Also, make sure that your tables get vaccuumed regularly. > Any help/suggestions would be greatly appreciated... Even if it's > something like "you need a faster db box, there's no other way" ;-) Well, a battery-backed RAID controller with a fast cache would certainly help. You'll also be glad to know that a *lot* of the improvements in the upcoming PostgreSQL 7.5 are aimed at giving better peformance on large, high-activity databases like yours. -- -Josh Berkus Aglio Database Solutions San Francisco
Re: optimization ideas for frequent, large(ish) updates in frequently accessed DB?
От
Manfred Koizar
Дата:
On Fri, 13 Feb 2004 01:58:34 +0100, "Marinos J. Yannikos" <mjy@geizhals.at> wrote: >I'm looking for ideas that might improve the interactive performance of >the system, without slowing down the updates too much. IOW, you could accept slower updates. Did you actually try and throttle down the insert rate? > Here are the >characteristics of the table and its use: > >- approx. 2 million rows Doesn't sound worrying. What's the min/max/average size of these rows? How large is this table? SELECT relpages FROM pg_class WHERE relname='...'; What else is in this database, how many tables, how large is the database (du $PGDATA)? >- approx. 4-5 million rows per day are replaced in short bursts of >1-200k rows (average ~3000 rows per update) How often do you VACUUM [ANALYSE]? >- the table needs 6 indexes (not all indexes are used all the time, but >keeping them all the time slows the system down less than re-creating >some of them just before they're needed and dropping them afterwards) I agree. >- an "update" means that 1-200k rows with a common value in a particular >field are replaced with an arbitrary number of new rows (with the same >value in that field), i.e.: > >begin transaction; > delete from t where id=5; > insert into t (id,...) values (5,...); > ... [1-200k rows] >end; This is a wide variation in the number of rows. You told us the average batch size is 3000. Is this also a *typical* batch size? And what is the number of rows where you start to get the feeling that it slows down other sessions? Where do the new values come from? I don't think they are typed in :-) Do they come from external sources or from the same database? If the latter, INSERT INTO ... SELECT ... might help. >The problem is, that a large update of this kind can delay SELECT >queries running in parallel for several seconds, so the web interface >used by several people will be unusable for a short while. Silly question: By SELECT you mean pure SELECT transactions and not some transaction that *mostly* reads from the database? I mean, you are sure your SELECT queries are slowed down and not blocked by the "updates". Show us the EXPLAIN ANALYSE output for the same SELECT, once when it is fast and once when it is slow. BTW, what is fast and what is slow? >Currently, I'm using temporary tables: > [...] >This is slightly faster than inserting directly into t (and probably >faster than using COPY, even though using that might reduce the overall >load on the database). You might try using a prepared INSERT statement or COPY. >shared_buffers=100000 >(I tried many values, this seems to work well for us - 12GB RAM) >wal_buffers=500 >sort_mem=800000 >checkpoint_segments=16 >effective_cache_size=1000000 See Josh's comments. >Any help/suggestions would be greatly appreciated... Even if it's >something like "you need a faster db box, there's no other way" ;-) We have to find out, what is the bottleneck. Tell us about your environment (hardware, OS, ...). Run top and/or vmstat and look for significant differences between times of normal processing and slow phases. Post top/vmstat output here if you need help. Servus Manfred
Re: optimization ideas for frequent, large(ish) updates in frequently accessed DB?
От
Manfred Koizar
Дата:
Marinos, while you are busy answering my first set of questions :-), here is an idea that might help even out resource consumption. On Fri, 13 Feb 2004 01:58:34 +0100, "Marinos J. Yannikos" <mjy@geizhals.at> wrote: >begin transaction; > delete from t where id=5; > insert into t (id,...) values (5,...); > ... [1-200k rows] >end; > >The problem is, that a large update of this kind can delay SELECT >queries running in parallel for several seconds, so the web interface >used by several people will be unusable for a short while. CREATE TABLE idmap ( internalid int NOT NULL PRIMARY KEY, visibleid int NOT NULL, active bool NOT NULL ); CREATE INDEX ipmap_visible ON idmap(visibleid); Populate this table with INSERT INTO idmap SELECT id, id, true FROM t; Change SELECT ... FROM t WHERE t.id = 5; to SELECT ... FROM t INNER JOIN idmap ON (idmap.internalid = t.id AND idmap.active) WHERE idmap.visibleid = 5; When you have to replace the rows in t for id=5, start by INSERT INTO idmap VALUES (12345, 5, false); Then repeatedly INSERT INTO t (id, ...) VALUES (12345, ...); at a rate as slow as you can accept. You don't have to wrap all INSERTs into a single transaction, but batching together a few hundred to a few thousand INSERTs will improve performance. When all the new values are in the database, you switch to the new id in one short transaction: BEGIN; UPDATE idmap SET active = false WHERE visibleid = 5 AND active; UPDATE idmap SET active = true WHERE internalid = 12345; COMMIT; Do the cleanup in off-peak hours (pseudocode): FOR delid IN (SELECT internalid FROM idmap WHERE NOT active) BEGIN DELETE FROM t WHERE id = delid; DELETE FROM idmap WHERE internalid = delid; END; VACUUM ANALYSE t; VACUUM ANALYSE idmap; To prevent this cleanup from interfering with INSERTs in progress, you might want to add a "beinginserted" flag to idmap. HTH. Servus Manfred
Re: optimization ideas for frequent, large(ish) updates in frequently accessed DB?
От
Manfred Koizar
Дата:
On Fri, 13 Feb 2004 16:21:29 +0100, I wrote: >Populate this table with > INSERT INTO idmap > SELECT id, id, true > FROM t; This should be INSERT INTO idmap SELECT DISTINCT id, id, true FROM t; Servus Manfred
Josh Berkus wrote: > 800MB for sort mem? Are you sure you typed that correctly? You must be > counting on not having a lot of concurrent queries. It sure will speed up > index updating, though! 800MB is correct, yes... There are usually only 10-30 postgres processes active (imagine 5-10 people working on the web front-end while cron jobs access the db occasionally). Very few queries can use such large amounts of memory for sorting, but they do exist. > I think you might do well to experiment with using the checkpoint_delay and > checkpoint_sibilings settings in order to get more efficient batch processing > of updates while selects are going on. [commit_*?] I thought that could improve only concurrent transactions... > What have you set max_fsm_relations and max_fsm_pages to? The latter should > be very high for you, like 10,000,000 good guess ;-) the former is set to 10,000 (I'm not sure how useful this is for those temporary tables) > For that matter, what *version* of PostgreSQL are you running? 7.4.1 > Also, make sure that your tables get vaccuumed regularly. There is a noticeable difference between a properly vacuumed db (nightly "vacuum full") and a non-vacuumed one and people will start complaining immediately if something goes wrong there... > Well, a battery-backed RAID controller with a fast cache would certainly help. http://www.lsilogic.com/products/ultra320_scsi_megaraid_storage_adapters/320x4128t.html (RAID-5 with 9 15k rpm drives; at a hindsight, perhaps we should have tried a 0+1) > You'll also be glad to know that a *lot* of the improvements in the upcoming > PostgreSQL 7.5 are aimed at giving better peformance on large, high-activity > databases like yours. That's good to hear... Regards, Marinos
> 800MB is correct, yes... There are usually only 10-30 postgres processes > active (imagine 5-10 people working on the web front-end while cron > jobs access the db occasionally). Very few queries can use such large > amounts of memory for sorting, but they do exist. But remember that means that if you have 4 people doign 2 sorts each at the same time, postgres will use 6.4GB RAM maximum. The sort_mem parameter means that if a sort is larger than the max, it will be done in disk swap. Chris
On Feb 14, 2004, at 9:02 PM, Marinos J. Yannikos wrote: > Josh Berkus wrote: > >> 800MB for sort mem? Are you sure you typed that correctly? You >> must be counting on not having a lot of concurrent queries. It sure >> will speed up index updating, though! > > 800MB is correct, yes... There are usually only 10-30 postgres > processes active (imagine 5-10 people working on the web front-end > while cron jobs access the db occasionally). Very few queries can use > such large amounts of memory for sorting, but they do exist. > Remember that it is going to allocate 800MB per sort. It is not "you can allocate up to 800MB, so if you need 1 meg, use one meg". Some queries may end up having a few sort steps. In terms of sort mem it is best to set a system default to a nice good value for most queries. and then in your reporting queries or other ones set sort_mem for that session (set sort_mem = 800000) then only that session will use the looney sort_mem It would be interesting to know if your machine is swapping. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
On Feb 14, 2004, at 9:02 PM, Marinos J. Yannikos wrote: > Josh Berkus wrote: > >> 800MB for sort mem? Are you sure you typed that correctly? You >> must be counting on not having a lot of concurrent queries. It sure >> will speed up index updating, though! > > 800MB is correct, yes... There are usually only 10-30 postgres > processes active (imagine 5-10 people working on the web front-end > while cron jobs access the db occasionally). Very few queries can use > such large amounts of memory for sorting, but they do exist. > Remember that it is going to allocate 800MB per sort. It is not "you can allocate up to 800MB, so if you need 1 meg, use one meg". Some queries may end up having a few sort steps. In terms of sort mem it is best to set a system default to a nice good value for most queries. and then in your reporting queries or other ones set sort_mem for that session (set sort_mem = 800000) then only that session will use the looney sort_mem It would be interesting to know if your machine is swapping. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Jeff Trout wrote: > Remember that it is going to allocate 800MB per sort. It is not "you > can allocate up to 800MB, so if you need 1 meg, use one meg". Some > queries may end up having a few sort steps. I didn't know that it always allocates the full amount of memory specificed in the configuration (e.g. the annotated configuration guide says: "Note that for a complex query, several sorts might be running in parallel, and each one _will be allowed to use_ as much memory as this value specifies before it starts to put data into temporary files."). The individual postgres processes don't look like they're using the full amount either (but that could be because the memory isn't written to). > In terms of sort mem it is best to set a system default to a nice good > value for most queries. and then in your reporting queries or other > ones set sort_mem for that session (set sort_mem = 800000) then only > that session will use the looney sort_mem Queries from the web front-end use up to ~130MB sort memory (according to pgsql_tmp), so I set this to 150MB - thanks. > It would be interesting to know if your machine is swapping. It's not being monitored closely (other than with the occasional "top"), but it's highly unlikely: Mem: 12441864k total, 10860648k used, 1581216k free, 84552k buffers Swap: 4008176k total, 2828k used, 4005348k free, 9762628k cached (that's a typical situation - the "2828k used" are probably some rarely used processes that have lower priority than the cache ...) Regards, Marinos
"Marinos J. Yannikos" <mjy@geizhals.at> writes: > Jeff Trout wrote: >> Remember that it is going to allocate 800MB per sort. > I didn't know that it always allocates the full amount of memory > specificed in the configuration It doesn't ... but it could use *up to* that much before starting to spill to disk. If you are certain your sorts won't use that much, then you could set the limit lower, hm? Also keep in mind that sort_mem controls hash table size as well as sort size. The hashtable code is not nearly as accurate as the sort code about honoring the specified limit exactly. So you really oughta figure that you could need some multiple of sort_mem per active backend. regards, tom lane