Обсуждение: Turning off transactions completely.
Hi. I'm writing a very specialised app which has absolutely NO need for transactions. However, I have to milk every drop of performance out of it cause I'm performing some fairly complex queries on tens of millions of rows. Is there any way I can completely avoid the overhead of the implicit transaction per SQL statement? Is there some switch or some (fairly easy) editting that can be done to the code? My transaction log files often reach 800 MB in size and I'd like to avoid the overhead if at all possible... Thanks in advance. --Arsalan. ------------------------------------------------------------------- People often hate those things which they do not know, or cannot understand. --Ali Ibn Abi Talib (AS)
On Sat, Jan 05, 2002 at 02:21:44PM +0530, Arsalan Zaidi wrote: > Hi. > > I'm writing a very specialised app which has absolutely NO need for > transactions. However, I have to milk every drop of performance out of it > cause I'm performing some fairly complex queries on tens of millions of > rows. > > Is there any way I can completely avoid the overhead of the implicit > transaction per SQL statement? Is there some switch or some (fairly easy) > editting that can be done to the code? My transaction log files often reach > 800 MB in size and I'd like to avoid the overhead if at all possible... Well, every statement must be within a transaction, so to reduce the number of transactions start one explicitly and commit when you're done. Then you only have one transaction for the whole thing. HTH, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Terrorists can only take my life. Only my government can take my freedom.
On Sat, Jan 05, 2002 at 02:21:44PM +0530, Arsalan Zaidi wrote: >> My transaction log files often reach >> 800 MB in size and I'd like to avoid the overhead if at all possible... They shouldn't get that big. What PG version are you running? (Any answer less than 7.1.3 is the wrong answer...) regards, tom lane
On Mon, Jan 07, 2002 at 11:56:12AM +0530, Arsalan Zaidi wrote: > > Well, every statement must be within a transaction, so to reduce the > number > > of transactions start one explicitly and commit when you're done. Then you > > only have one transaction for the whole thing. > > > > I already do this. However, I don't want *any* transactions at all. I don't > care if I lose data in the middle of, say an update. The nature of the app > is such, that that data can be reconstructed. Well, no transactions is not possible. The whole data storage system is built around it almost. Besides, as long as everything is in one transaction, there is *no* overhead IIRC. > However, as it currently stands, the app takes around 30 hrs to finish it's > run. I wish to reduce that to 24hr or less. Wow. I can insert hundreds of rows per second within a transaction and my hardware is not even particulatly good. That would be 21 million rows in that time. How big is your data set? Are you using COPY or INSERT to insert the data? > I found a comment from 1999 where someone asked a similiar Q and Mimijian > responded that that was not possible in pg. Is that still true? Can it be > easily changed in the code? I think you're working under the assumption that transactions == overhead whereas I don't beleive that's true. Work out where the bottleneck is. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Terrorists can only take my life. Only my government can take my freedom.
> Well, no transactions is not possible. The whole data storage system is > built around it almost. Besides, as long as everything is in one > transaction, there is *no* overhead IIRC. > How can that be? You're doing *some* record keeping after all; so there has to be some over-head! > > However, as it currently stands, the app takes around 30 hrs to finish it's > > run. I wish to reduce that to 24hr or less. > > Wow. I can insert hundreds of rows per second within a transaction and my > hardware is not even particulatly good. That would be 21 million rows in > that time. How big is your data set? Are you using COPY or INSERT to insert > the data? > Using COPY for the init load and insert for subsequent handling. Can't use COPY there cause there are WHERE clauses in there I need. I start off with 3.3GB of data and things just grow from that with a fresh 3GB file every day! Plenty of dup data in it though. Lots of complex queries get run on this data, which generates even more info... At the end of the initial run, I get ~50GB of data in the pg data dir (course that's got a lot of additional info than just a plain text file...) with ~10GB per subsequent run. Haven't checked yet, but I think I get ~30 million rows in the init run. BTW, I've got a dual proc machine with a RAID-0 array and 1 GB of RAM, but pg only uses one CPU at a time. Would have been great if it had been multi-threaded or something. > > I found a comment from 1999 where someone asked a similiar Q and Mimijian > > responded that that was not possible in pg. Is that still true? Can it be > > easily changed in the code? > > I think you're working under the assumption that transactions == overhead > whereas I don't beleive that's true. Work out where the bottleneck is. I've got too much data? :-) --Arsalan.
> Well, every statement must be within a transaction, so to reduce the number > of transactions start one explicitly and commit when you're done. Then you > only have one transaction for the whole thing. > I already do this. However, I don't want *any* transactions at all. I don't care if I lose data in the middle of, say an update. The nature of the app is such, that that data can be reconstructed. However, as it currently stands, the app takes around 30 hrs to finish it's run. I wish to reduce that to 24hr or less. I found a comment from 1999 where someone asked a similiar Q and Mimijian responded that that was not possible in pg. Is that still true? Can it be easily changed in the code? --Arsalan.
Hi, > Using COPY for the init load and insert for subsequent handling. Can't use > COPY there cause there are WHERE clauses in there I need. > > I start off with 3.3GB of data and things just grow from that with a fresh > 3GB file every day! Plenty of dup data in it though. Lots of complex > queries get run on this data, which generates even more info... At the > end of the initial run, I get ~50GB of data in the pg data dir (course > that's got a lot of additional info than just a plain text file...) with > ~10GB per subsequent run. I would suggest to optimize your Application! Try to: - cache previously inserted IDs and other things (eg. use Perl hashes) - create simpler Where clauses - look on your indexes, perhaps you can create an index on two columns? And: use vacuum analyze after the Database is freshly build and filled with the first ~ 100 000 rows. Later vacuum analyze every 1 000 000 or 10 000 000 rows ... > BTW, I've got a dual proc machine with a RAID-0 array and 1 GB of RAM, but > pg only uses one CPU at a time. Would have been great if it had been > multi-threaded or something. if you use two inserting processes, Postgres also should use two. AFAIK! :) Ciao Alvar -- http://www.teletrust.info/ http://www.odem.org/ || http://www.odem.org/insert_coin/imkp2001.html -- AGI :: Hohnerstrasse 23, 70469 Stuttgart Fon +49 (0)711.490 320-0, Fax +49 (0)711.490 320-150 AGI auf Platz 3 im neuen Multimedia-Kreativranking http://www.agi.de/tagebuch/
> I found a comment from 1999 where someone asked a similiar Q and Mimijian > responded that that was not possible in pg. Is that still true? Can it be > easily changed in the code? It is possible to turn off transactions. The problem is would the system be any faster, and would a single INSERT failure due to a constraint check make the system unusable. You can turn off fsync in postgresql.conf and maybe there is a way to prevent pre-change image logging to WAL, but other than that, I can't think of any more speedups that keep the system usable. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
"Arsalan Zaidi" <azaidi@directi.com> writes: > I already do this. However, I don't want *any* transactions at all. Perhaps you want MySQL ;-) Seriously, transactions per se are not your problem, and you can't turn them off anyway, so there's no point in continuing to bark up that tree. What I think you are really unhappy about is the WAL/checkpoint activity, and that you can control to some extent. For starters, have you turned off fsync? Another thing to consider is increasing the CHECKPOINT_SEGMENTS parameter (and possibly WAL_FILES too), which will increase the amount of disk space used for the WAL log, but should reduce the amount of datafile disk traffic. If the postmaster log shows a lot of snippy little notices about "consider increasing WAL_FILES", then you should do that. If you have multiple disk drives, see if you can put the WAL files (pg_xlog subdirectory) on a different disk spindle than the data files. There are also the usual tuning questions about whether you've set the number of shared buffers to something reasonable, etc. etc. regards, tom lane
> I would suggest to optimize your Application! First thing I did! :-) > - cache previously inserted IDs and other things > (eg. use Perl hashes) I'm using Java/JDBC. Also, there's very little data I can cache and reuse. > > - create simpler Where clauses > Sorry, I need to be specific! Besides, I've used EXPLAIN extensively and these are the fastest possible queries (atleast I hope so) which do what I want them to do. > - look on your indexes, perhaps you can create an index on two columns? > Got them up the wazoo. Two column ones as well... Just want to know, is an index on (foo,bar) different from (bar,foo)? Does the order in which they appear in the index creation statement and in subsequent queries make a difference? > And: > use vacuum analyze after the Database is freshly build and filled > with the first ~ 100 000 rows. > Later vacuum analyze every 1 000 000 or 10 000 000 rows ... > I'm vacuuming like crazy (after the initial data COPY and then once the run ends(to prepare for the next run)) but there are two issues... 1. VACUUM is dead slow. The CPU monitor shows long stretches when there's barely any activity at all while the vacuum is going on. 2. I do vacuum analyze's to help the database figure out how to best run my queries. But it *still* doesnt use some of them... > > BTW, I've got a dual proc machine with a RAID-0 array and 1 GB of RAM, but > > pg only uses one CPU at a time. Would have been great if it had been > > multi-threaded or something. > > if you use two inserting processes, Postgres also should use two. AFAIK! :) > Ah, interesting point. I spent the last two (working) days converting my single process app into a multi threaded one, to allow me to fire off multiple queries at the same time. I was hoping this would lead to a more optimum use of system resources... Didn't work. Best case, the multi-threaded app is just as fast as the single -threaded one and in the worst case; it's much slower. I'm guessing that the various complex queries being fired at the same time are forcing the use of Swap (yes, it on another HDD entirely) ; which is slowing down the machine. --Arsalan.
> Perhaps you want MySQL ;-) I plan to bench-mark it today. Pg performs fine for normal business use and we're using it for our other needs, but this app is rather unusual... > > Seriously, transactions per se are not your problem, and you can't turn > them off anyway, so there's no point in continuing to bark up that tree. > 'Kay; it was just a thought. <info about checkpoint tuning> Am already doing this and have increased the checkpointing segs, wal files and buffers. I no longer get any advice in the logs about increasing this... > > There are also the usual tuning questions about whether you've set the > number of shared buffers to something reasonable, etc. etc. > Already done...
Arsalan writes: > > - look on your indexes, perhaps you can create an index on > two columns? > > > > Got them up the wazoo. Two column ones as well... > Make sure you don't have any you don't absolutely need. Each index you have adds overhead to any DML statement as it needs to be adjusted. > Just want to know, is an index on (foo,bar) different from (bar,foo)? Does > the order in which they appear in the index creation statement and in > subsequent queries make a difference? Yes, very much so. An index on (foo, bar, qux) will only be used for queries in which the WHERE clause contains one, two, or three of the fields starting from the first (left). So, "foo = ?", or "bar = ? AND foo = ?", or "foo = ? AND qux = ? AND bar = ?" will use the index. The textual order in the query is irrelevant as long as the fields themselves are there. These, "bar = ?", "qux = ?", "qux = ? AND bar = ?" will not use the index, and this "foo = ? AND qux = ?" will only partially use the index (for the foo lookup). Unfortunately I can't explain it better, hence the long example. Also remeber the optimiser will only choose a single index for use in a query and discard any others you have. So plan wisely as you're balancing INSERT/UPDATE overhead with SELECT speed.
> Arsalan writes: > > > - look on your indexes, perhaps you can create an index on > > two columns? > > > > > > > Got them up the wazoo. Two column ones as well... > > > > Make sure you don't have any you don't absolutely need. Each index you have > adds overhead to any DML statement as it needs to be adjusted. > DML? If you're talking about slower inserts because of index's; I drop them them on the really heavy tables before I do a COPY/INSERT. > > Just want to know, is an index on (foo,bar) different from (bar,foo)? Does > > the order in which they appear in the index creation statement and in > > subsequent queries make a difference? > <snipped nice explaination> Cool. My index's are in order and so that's ok. I just wanted to be sure about this. > >
On 01/08/2002 10:02:39 AM Arsalan Zaidi wrote:
>
> > - look on your indexes, perhaps you can create an index on two columns?
>
> Got them up the wazoo. Two column ones as well...
You know that too many indices hurts insert/update/delete performance, right? For each of these actions, all related indices would need to be updated. So if you have any indices that are not used or you could do without, remove them.
> Just want to know, is an index on (foo,bar) different from (bar,foo)? Does
> the order in which they appear in the index creation statement and in
> subsequent queries make a difference?
Yes, the order does make a difference. For one, if you have an index on (foo,bar) and you have a where-clause that only restricts on bar (but not foo), this index won't be used.
> Ah, interesting point. I spent the last two (working) days converting my
> single process app into a multi threaded one, to allow me to fire off
> multiple queries at the same time. I was hoping this would lead to a more
> optimum use of system resources... Didn't work. Best case, the
> multi-threaded app is just as fast as the single -threaded one and in the
> worst case; it's much slower.
Are you using the same database connection for each thread in your multi-threaded approach? Postgresql will only benefit from multiple processors if there are multiple postgres processes running. In application speach that means that you need to have multiple database connections open (i.e. one backend process per connection).
Maarten
----
Maarten Boekhold, maarten.boekhold@reuters.com
Reuters Consulting / TIBCO Finance Technology Inc.
Dubai Media City
Building 1, 5th Floor
PO Box 1426
Dubai, United Arab Emirates
tel:+971(0)4 3918300 ext 249
fax:+971(0)4 3918333
mob:+971(0)505526539
-------------------------------------------------------------- --
Visit our Internet site at http://www.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.
On Tue, Jan 08, 2002 at 11:32:39AM +0530, Arsalan Zaidi wrote: > I'm vacuuming like crazy (after the initial data COPY and then once the run > ends(to prepare for the next run)) but there are two issues... Quick question: you are creating the indices *after* you insert the data, right? > 1. VACUUM is dead slow. The CPU monitor shows long stretches when there's > barely any activity at all while the vacuum is going on. > > 2. I do vacuum analyze's to help the database figure out how to best run my > queries. But it *still* doesnt use some of them... A vacuum analyze does a vacuum also so you only need to do the former. > Ah, interesting point. I spent the last two (working) days converting my > single process app into a multi threaded one, to allow me to fire off > multiple queries at the same time. I was hoping this would lead to a more > optimum use of system resources... Didn't work. Best case, the > multi-threaded app is just as fast as the single -threaded one and in the > worst case; it's much slower. Strange, that would indicate a serious bottleneck. Are you saturating the disks? You can use vmstat to work out the amount of disk activity. > I'm guessing that the various complex queries being fired at the same time > are forcing the use of Swap (yes, it on another HDD entirely) ; which is > slowing down the machine. You say you had 1 GB of memory? Could you give an example of a query that uses this amount of memory. HTH, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Terrorists can only take my life. Only my government can take my freedom.
> You know that too many indices hurts insert/update/delete performance, > right? For each of these actions, all related indices would need to be > updated. So if you have any indices that are not used or you could do > without, remove them. As mentioned, tables which see a lot of data modification have their index's removed and then re-created. I don't bother doing this with tables which aren't so heavily modified. > Are you using the same database connection for each thread in your > multi-threaded approach? Postgresql will only benefit from multiple > processors if there are multiple postgres processes running. In > application speach that means that you need to have multiple database > connections open (i.e. one backend process per connection). Each thread opens up a new connection. Using (g)top I have already verified that there are several postgres instances running. The machine is entirely dedicated to this task and there were no other users. Just to re-iterate. 1. I've done quite a bit of tweaking with the WAL*, shared buffers and sort mem over several weeks. They're about as optimum as I can get them. Turning off fsync was one of the first things I did BTW. 2. The executable was compiled on the machine with as many optimisations as gcc could reasonably support. 3. The queries are just about as good as I can get them and have been throughly EXPLAIN'ed with live/large amounts of data in the tables. No IN's used, only EXISTS (where required). This one bit me right at the start. I had to kill one query after it ran for ~36hrs! BTW, EXISTS seems to be far more efficient than a JOIN. Is this always true? 4. Temp tables are used to simplify complex queries (and speed them up I hope). 5. RAID-0 (SCSI/HW) + Dual Proc + 1GB RAM. Linux 2.4.17(smp) (pure Linus. No other patches). Swap is on a seperate IDE drive. 6. COPY is used where-ever it can be. Index's are dropped before heavy modification and then recreated. I don't use CLUSTER. 7. Driver app was multi-threaded. It made things worse. BTW, the apps jobs consists largely of firing off SQL queries in the correct sequence; so you'd better not go blaming my code! Is there anything I've missed out? --Arsalan.
> Quick question: you are creating the indices *after* you insert the data, > right? That is correct. > > 1. VACUUM is dead slow. The CPU monitor shows long stretches when there's > > barely any activity at all while the vacuum is going on. > > > > 2. I do vacuum analyze's to help the database figure out how to best run my > > queries. But it *still* doesnt use some of them... > > A vacuum analyze does a vacuum also so you only need to do the former. Am aware of that. I only do 'analyzes'. Sorry if I didn't make that clear. :-) > Strange, that would indicate a serious bottleneck. Are you saturating the > disks? You can use vmstat to work out the amount of disk activity. Right. I guess I'll do that. > > > I'm guessing that the various complex queries being fired at the same time > > are forcing the use of Swap (yes, it on another HDD entirely) ; which is > > slowing down the machine. > > You say you had 1 GB of memory? Could you give an example of a query that > uses this amount of memory. > OKay.... When I create a double index on a table with ~3.3 GB of Data (size of the initial plain text file). Course, on the second run, it'll have ~6.6 GB of data... I haven't reached that far yet. I ran out of space on my array on the first run itself. BTW, that's production data. For testing I use a 300 MB file. The sort mem is set so... -S 350000 So that could be the reason. I don't mind the mem being used in the single-threaded app (there's no one else there), but with multi-threads, it could be a problem... --Arsalan
Hi, >> I would suggest to optimize your Application! > > First thing I did! :-) OK, then sorry :) I've a application where the tome consumption and amount of data is similar, so this was my first idea. > 1. VACUUM is dead slow. The CPU monitor shows long stretches when there's > barely any activity at all while the vacuum is going on. yes :-( I hope for the 7.2, it seems that Vacuum is much more faster there (but I didn't tested it yet). > 2. I do vacuum analyze's to help the database figure out how to best run > my queries. But it *still* doesnt use some of them... I had the same problem; a simple (but dirty) solution was to disable sequential scans in postgresql.conf: enable_seqscan = false Ciao Alvar -- AGI :: Hohnerstrasse 23, 70469 Stuttgart Fon +49 (0)711.490 320-0, Fax +49 (0)711.490 320-150 AGI auf Platz 3 im neuen Multimedia-Kreativranking http://www.agi.de/tagebuch/
On Tue, 8 Jan 2002, Arsalan Zaidi wrote: > Just to re-iterate. > 1. I've done quite a bit of tweaking with the WAL*, shared buffers and sort > mem over several weeks. They're about as optimum as I can get them. Could you expand on that. What makes you say they are "optimun". > 3. The queries are just about as good as I can get them and have been > throughly EXPLAIN'ed with live/large amounts of data in the tables. No IN's > used, only EXISTS (where required). I am far from an SQL expert, but it may not hurt to share with the list what are these queries. Perhaps even post them. > 4. Temp tables are used to simplify complex queries (and speed them up I > hope). Could you expand on that? > 5. RAID-0 (SCSI/HW) + Dual Proc + 1GB RAM. Linux 2.4.17(smp) (pure Linus. No > other patches). Swap is on a seperate IDE drive. How about more memory? This would certainly help substantially. What is the speed of the HDs? 10K RPM? 15K RPM? Is the Swap ever hit? Hitting Swap always hurts, but given that you put an IDE for swap it would slow you down even more. What is the speed of the CPUs? What type of memory PC100/PC133/Rambus/DDR??? How many disks on your Raid 0? Although there is much you could perhaps do through software configuration once you get to the levels you are discribing there is no substitute for fast hardware. Probably the cheapest upgrade you can do is adding more memory. This will likely help. You also need to pay close attention to your schema. You can to have your most heavily used data in a small table. To give you an example let me make up an scenario simmilar to something I did. Let's say I have a people's list(not particularly correct SQL just to give you an idea) id serial name varchar(20) last varchar(20) addres varchar(30) addres2 varchar(30) state char(2) zip int comment varchar(30) education char(1) Now let's say that I use this table extremely heavily when doing joins with other tables (i.e. OLAP type of analyses against other tables). What I did was that I broke it off into two tables and kept only the fields which I used %90+ of the time. something like id name zip Then did another table with the rest and linked by ID. I didn't do any particular time difference analyses, but I got a substantial improvement. On my actual case I had about 50 fields which totalled something on the neighborhood of 300 bytes. After breaking it up the abbreviated version of the table was less than 50 bytes. > 7. Driver app was multi-threaded. It made things worse. Expand on this. This is where the number of buffers vs the amount of memory comes into play. If by firing multiple copies you ended up hitting swap this would have made things much worse. Remember postgresql fires up processes. It doesn't use threads. >BTW, the apps jobs consists largely of firing off SQL queries >in the correct sequence; It may help if you tell us more about these queries. In particular the one you mentioned that takes 36 hours.