Обсуждение: [jules@jellybean.co.uk: Performance on inserts]
Weird. When I sent this 24 hours ago, it didn't go through to the list, although I successfully sent a message to the list only an hour or so before that.... as far as my mail-server can tell it went off in the right direcion... ----- Forwarded message from Jules Bean <jules@jellybean.co.uk> ----- Date: Thu, 24 Aug 2000 11:04:25 +0100 From: Jules Bean <jules@jellybean.co.uk> To: pgsql-hackers@postgresql.org Subject: Performance on inserts User-Agent: Mutt/1.2i [Meta: Please advise if this is the wrong list. I think, since this observation relates to Pg internals, this might be the right one, but feel free to move it back to -general if I'm wrong; I subscribe to both in any case] As some of you will have inferred if you've read my last couple of posts, I'm working on a database with a structure like this (I'm abstracting; I'm not unfortunately allowed to show you what my client is really doing here) A table called 'things' with two columns 'name' and 'category'. The pair ('name','category') is a primary key. There are ~ 10 000 000 rows in the table, and category takes values from a more-or-less fixed set of ~1000 possibilities. As previously described the most popular category holds around half the rows, the next most popular holds nearly half of those left, and most categories occur very rarely indeed. The median is probably around 1000 (which is less than the 10 000 you'd expect). Anyhow, this question isn't about speeding up queries --- we already have that in the histogram thread. This question is about speeding up inserts. My standard configuration is to have a unique index on (name,category) and a non-unique index on (category). The main table is ~ 2G on disk, the index on (name,cat) is about the same size, the index on (cat) is around 0.6G. In this set-up inserts have dropped to the terrifyingly slow rate of several hours per 10 000 insertions. This is not adequate to my needs, I occasionally have to process 1 000 000 insertions or more! I have several ideas for speeding this up at the SQL level (including inserting into a temp table and then using INSERT ... SELECT to remove the overhead of separate inserts) but that's not what I want to talk about either... What I did to day, which made a staggering difference, is dropping the non-unique index on (category). Suddenly I can insert at approx 40 000 insertions per minute, which is fine for my needs! So why is updating the huge (2G) unique index on (name,cat) not too much of a problem, but updating the small (600M) non-unique index on (cat) sufficient to increase speed by around two orders of magnitude? A possible reason has occurred to me: The real slow-down is noticeable when I'm doing a bulk insert where all new rows belong to the most popular category. I know that some btree implementations don't behave particularly sanely with several million rows in a single key.. is the btree implementation used too slow in this case? I haven't collected all the performance statistics I'd like to have, due to external time pressures, but I can say that under the new faster configuration, the insertion process is CPU bound, with disk access far below the levels the machine is capable of. If I have a chance I'll collect these stats for the old method too. Any ideas as to what's going on here appreciated (if not, perhaps it will point you towards an optimisation you ought to make for 7.1) Jules ----- End forwarded message ----- -- Jules Bean | Any sufficiently advanced jules@debian.org | technology is indistinguishable jules@jellybean.co.uk | from a perl script
Jules Bean <jules@jellybean.co.uk> writes: > So why is updating the huge (2G) unique index on (name,cat) not too > much of a problem, but updating the small (600M) non-unique index on > (cat) sufficient to increase speed by around two orders of magnitude? > The real slow-down is noticeable when I'm doing a bulk insert where > all new rows belong to the most popular category. I know that some > btree implementations don't behave particularly sanely with several > million rows in a single key.. is the btree implementation used too > slow in this case? Indeed, the btree code used in 7.0 and before is pretty awful in the face of large numbers of equal keys. I have improved it somewhat, but it's still got a problem once there are enough equal keys to span many index pages. I just did this experiment: create table fill(f1 text); create index filli on fill(f1); insert into fill values('foo'); insert into fill values('bar'); insert into fill values('baz'); insert into fill select * from fill; -- repeat this many times The runtime of the insert/select scales in a pretty horrid fashion: # tuples inserted 6.5 current 1536 <1sec <1sec 3072 1.56 <1sec 6144 3.70 1.84 12288 9.73 4.07 24576 93.26 38.72 49152 363.23 231.56 At the end of this process we have about 100 pages of index entries for each of the three distinct key values, with about 330 items per page. The fixes that I applied a month or two back improve behavior for multiple equal keys within a page, but they do little for multiple pages full of the same key value. Here's the problem: the code for INSERT starts by finding the first occurrence of the target key (using btree descent, so that's fast enough). But then it does this: /* * If we will need to split the page to put the item here, * check whether we can put the tuple somewhereto the right, * instead. Keep scanning until we find enough free space or * reach the last page wherethe tuple can legally go. */ while (PageGetFreeSpace(page) < itemsz && !P_RIGHTMOST(lpageop)&& _bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0) { /* step rightone page */ BlockNumber rblkno = lpageop->btpo_next; _bt_relbuf(rel, buf, BT_WRITE); buf = _bt_getbuf(rel, rblkno, BT_WRITE); page = BufferGetPage(buf); lpageop = (BTPageOpaque) PageGetSpecialPointer(page); } In other words, we do a linear scan over all the pages containing equal key values, in the hope of finding one where we can shoehorn in the new entry. If the keys are roughly equal-sized, that hope is usually vain, and we end up splitting the rightmost page that contains any keys equal to the target key. Subsequent inserts of the same key value fill first the left-half split page, then the right-half, then split the right-half page again and repeat --- but for each insert we first had to scan over all the preceding pages containing equal keys. That means inserting N equal keys takes O(N^2) time, for large enough N. After thinking about this for a while, I propose a probabilistic solution. Suppose that we add to the above loop a random-number check that causes us to fall out of the loop with probability 1/3 or so; that is, if the current page is full of the target key, then we move right to check the next page with probability 2/3, but with probability 1/3 we give up and split the current page to insert the new key. With this approach we'd hardly ever move more than, say, 10 pages before splitting, so the time to insert a new key is bounded no matter how many duplicates it has. The reason for using a random decision is that if we always stopped after scanning a fixed number of pages, then the space freed up by the previous decision to split would always be just out of reach, and we'd end up with lots of half-full pages of the same key. With a random decision we will visit and fill both the left and right halves of a previously split page. Comments? Is there a better way? What's the best probability to use? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [000825 08:25] wrote: > > In other words, we do a linear scan over all the pages containing equal > key values, in the hope of finding one where we can shoehorn in the new > entry. If the keys are roughly equal-sized, that hope is usually vain, > and we end up splitting the rightmost page that contains any keys > equal to the target key. Subsequent inserts of the same key value fill > first the left-half split page, then the right-half, then split the > right-half page again and repeat --- but for each insert we first had > to scan over all the preceding pages containing equal keys. That means > inserting N equal keys takes O(N^2) time, for large enough N. > > After thinking about this for a while, I propose a probabilistic > solution. Suppose that we add to the above loop a random-number check > that causes us to fall out of the loop with probability 1/3 or so; > that is, if the current page is full of the target key, then we move > right to check the next page with probability 2/3, but with probability > 1/3 we give up and split the current page to insert the new key. > > With this approach we'd hardly ever move more than, say, 10 pages before > splitting, so the time to insert a new key is bounded no matter how many > duplicates it has. > > The reason for using a random decision is that if we always stopped > after scanning a fixed number of pages, then the space freed up by the > previous decision to split would always be just out of reach, and we'd > end up with lots of half-full pages of the same key. With a random > decision we will visit and fill both the left and right halves of a > previously split page. > > Comments? Is there a better way? What's the best probability to use? I'm unsure if it's possible, but somehow storing the last place one 'gave up' and decided to split the page could offer a useful next-start for the next insert. Sort of attempting the split work amongst multiple requests. For some reason it looks like your algorithm might cause problems because it plain gives up after 10 pages? hope this helps, -Alfred
Alfred Perlstein <bright@wintelcom.net> writes: > I'm unsure if it's possible, but somehow storing the last place one > 'gave up' and decided to split the page could offer a useful next-start > for the next insert. I think that that would create problems with concurrency --- the Lehman-Yao btree algorithm is designed around the assumption that writers only move right and never want to go back left to change a prior page. So once we've moved right we don't get to go back to the start of the chain of duplicates. > For some reason it looks like your algorithm might cause > problems because it plain gives up after 10 pages? "Give up" just means "stop looking for already-existing free space, and make some the hard way". The steady-state average space utilization of this way would be somewhat worse than the existing code, probably, but I don't see that as a big problem. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [000825 11:30] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > I'm unsure if it's possible, but somehow storing the last place one > > 'gave up' and decided to split the page could offer a useful next-start > > for the next insert. > > I think that that would create problems with concurrency --- the > Lehman-Yao btree algorithm is designed around the assumption that > writers only move right and never want to go back left to change > a prior page. So once we've moved right we don't get to go back to > the start of the chain of duplicates. Yes, but inconsistancies about the starting page should be handled because i'm sure the code takes care of multiple scans from the start. A slightly incorrect startscan point is better than starting from the beginning every time. It's a hack, but no more than a random guess where in my situation (if it's do-able) will eventually get past the first bunch of full pages. > > For some reason it looks like your algorithm might cause > > problems because it plain gives up after 10 pages? > > "Give up" just means "stop looking for already-existing free space, > and make some the hard way". The steady-state average space utilization > of this way would be somewhat worse than the existing code, probably, > but I don't see that as a big problem. Well there's a possibility of the end of the sequence containing free space, allowing previous failures to be accounted for can make a difference. But it's just a suggestion from someone who really ought to be studying the internals a bit more. :) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
Alfred Perlstein <bright@wintelcom.net> writes: > A slightly incorrect startscan point is better than starting > from the beginning every time. Not for lookups, it isn't ;-). I did think about making the btree-descending code act differently for inserts than lookups, but AFAICS that wouldn't buy anything for this problem, since until you get down to the leaf level you can't see where there's free space anyway. You'd simply be exchanging the problem of missing free space that's too far to the right for the problem of missing free space that's to the left of where you chose to start looking. The algorithm does seem to work quite nicely just the way I described it, although it turns out I was off base about a good probability setting. I find that something up around 0.99 seems to be good. Using the same (perhaps overly simplistic) test case: # tuples inserted 6.5 current+random hack @ 0.99 Time index size Time index size 1536 <1sec 90112 <1sec 106496 3072 1.56 163840 <1sec 188416 6144 3.70 286720 1.40 376832 12288 9.73 532480 2.65 688128 24576 93.26 1024000 5.22 1368064 49152 363.23 2007040 10.34 2727936 98304 22.07 5545984 196608 45.60 11141120 393216 92.53 22290432 I tried probabilities from 0.67 to 0.999 and found that runtimes didn't vary a whole lot (though this is near the minimum), while index size consistently got larger as the probability of moving right decreased. The runtime is nicely linear throughout the range. The index size increase might look like a bit of a jump, but actually this is right where we want it to be. The old code was effectively packing each page as full as it could be under these conditions. That's not what you want for a btree. Steady-state load factor for btrees is usually quoted as somewhere around 70%, and this method manages to approximate that pretty well with a move-right probability of 0.99 or a bit less. regards, tom lane
> Comments? Is there a better way? What's the best probability to use? For this particular example, "partial indices" seems to be the best fit. The index can be chosen to omit the most common value(s), since those would indicate a sequential scan anyway. Other DBs allow a parameter to set the "fill ratio" of index pages, which might also help. But probably not as much as you might like when one is doing a large number of inserts at a time. Your "randomized" algorithm looks very promising. What is the status of partial indices? Are they functional now, or have they been broken forever (I'm not recalling)? - Thomas
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > What is the status of > partial indices? Are they functional now, or have they been broken > forever (I'm not recalling)? They've been diked out of gram.y's syntax for CREATE INDEX at least since Postgres95. No way to tell who did that, why or when, AFAIK. There is still an awful lot of support code for them, however. I have no good way to guess how much bit-rot has occurred in all that unexercised code ... but it'd be interesting to try to get it going again. regards, tom lane
> I have no good way to guess how much bit-rot has occurred in all that > unexercised code ... but it'd be interesting to try to get it going > again. Yes, it is a *great* feature, since it directly addresses the problems associates with one of the most common non-random data distributions (the index can be constructed to completely ignore those most common values, and hence be smaller, less often updated, and holding only those values which might actually be used in an index scan). If we don't get to outer joins, this would be a good second choice for 7.1 ;) - Thomas
Tom Lane wrote: > > Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > What is the status of > > partial indices? Are they functional now, or have they been broken > > forever (I'm not recalling)? > > They've been diked out of gram.y's syntax for CREATE INDEX at least > since Postgres95. No way to tell who did that, why or when, AFAIK. > There is still an awful lot of support code for them, however. I suspect that current indexes don't store nulls (and are thereby partial indexes in relation to nulls ;) At least the following suggests it: ---8<------------8<------------8<------------8<--------- hannu=> explain select * from test1 where i=777; NOTICE: QUERY PLAN: Index Scan using test1_i_ndx on test1 (cost=2.05 rows=2 width=8) EXPLAIN hannu=> explain select * from test1 where i is null; NOTICE: QUERY PLAN: Seq Scan on test1 (cost=3144.36 rows=27307 width=8) ---8<------------8<------------8<------------8<--------- As the logic to include or not include something in index seems to be there for nulls (and thus can't be very badly bit-rotten) it should be possible to extend it for simpler =x or in(x,y,z) conditions. > I have no good way to guess how much bit-rot has occurred in all that > unexercised code ... but it'd be interesting to try to get it going > again. Of course the IS NULL case may just be hard-wired in the optimiser in which case there may be not much use of current code. IIRC Postgres95 was the first postgres with SQL, so if already that did not have partial indexes then no SQL-grammar for postgreSQL had. (Or maybe Illustra did but it was a separate effort anyway ;) --------------- Hannu
On Fri, Aug 25, 2000 at 07:00:22PM -0400, Tom Lane wrote: > The algorithm does seem to work quite nicely just the way I described > it, although it turns out I was off base about a good probability > setting. I find that something up around 0.99 seems to be good. > Using the same (perhaps overly simplistic) test case: > > # tuples inserted 6.5 current+random hack @ 0.99 > Time index size Time index size > 1536 <1sec 90112 <1sec 106496 > 3072 1.56 163840 <1sec 188416 > 6144 3.70 286720 1.40 376832 > 12288 9.73 532480 2.65 688128 > 24576 93.26 1024000 5.22 1368064 > 49152 363.23 2007040 10.34 2727936 > 98304 22.07 5545984 > 196608 45.60 11141120 > 393216 92.53 22290432 > > I tried probabilities from 0.67 to 0.999 and found that runtimes didn't > vary a whole lot (though this is near the minimum), while index size > consistently got larger as the probability of moving right decreased. > The runtime is nicely linear throughout the range. That looks brilliant!! (Bearing in mind that I have over 10 million tuples in my table, you can imagine what performance was like for me!) Is there any chance you could generate a patch against released 7.0.2 to add just this functionality... It would be the kiss of life for my code! (Not in a hurry, I'm not back in work until Wednesday, as it happens) And, of course, what would /really/ get my code going speedily would be the partial indices mentioned elsewhere in this thread. If the backend could automagically drop keys containing > 10% (tunable) of the rows from the index, then my index would be (a) about 70% smaller! and (b) only used when it's faster. [This means it would have to update some simple histogram data. However, I can't see that being much of an overhead] For the short term, if I can get a working version of the above randomisation patch, I think I shall 'fake' a partial index by manually setting 'enable_seqscan=off' for all but the 4 or 5 most common categories. Those two factors combined will speed up my bulk inserts a lot. One other idea, though: Is there any simple way for Pg to combine inserts into one bulk? Specifically, their effect on the index files. It has always seemed to me to be one of the (many) glaring flaws in SQL that the INSERT statement only takes one row at a time. But, using INSERT ... SELECT, I can imagine that it might be possible to do 'bulk' index updating. so that scanning process is done once per 'batch'. If I can make an analogy with sorted files (which indices are rather like), if I wanted to add another 100 lines to a 1000 line sorted file, I'd sort the 100 first, and then 'merge' them in. Whilst I realise that indices aren't stored sorted (no need), I think it ought to be possible to construct an efficient algorithm for merging two btrees? Jules -- Jules Bean | Any sufficiently advanced jules@debian.org | technology is indistinguishable jules@jellybean.co.uk | from a perl script
On Sat, 26 Aug 2000, Jules Bean wrote: > Is there any simple way for Pg to combine inserts into one bulk? > Specifically, their effect on the index files. It has always seemed > to me to be one of the (many) glaring flaws in SQL that the INSERT > statement only takes one row at a time. One of MySQL's little syntax abuses allows: INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); which is nice for avoiding database round trips. It's one of the reasons that mysql can do a bulk import so quickly. > But, using INSERT ... SELECT, I can imagine that it might be possible > to do 'bulk' index updating. so that scanning process is done once per > 'batch'. Logic for these two cases would be excellent. Matthew.
* Matthew Kirkwood <matthew@hairy.beasts.org> [000826 04:22] wrote: > On Sat, 26 Aug 2000, Jules Bean wrote: > > > Is there any simple way for Pg to combine inserts into one bulk? > > Specifically, their effect on the index files. It has always seemed > > to me to be one of the (many) glaring flaws in SQL that the INSERT > > statement only takes one row at a time. > > One of MySQL's little syntax abuses allows: > > INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); > > which is nice for avoiding database round trips. It's one > of the reasons that mysql can do a bulk import so quickly. That would be an _extremely_ useful feature if it made a difference in postgresql's insert speed. > > > But, using INSERT ... SELECT, I can imagine that it might be possible > > to do 'bulk' index updating. so that scanning process is done once per > > 'batch'. > > Logic for these two cases would be excellent. We do this sometimes, works pretty nicely. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
Matthew Kirkwood <matthew@hairy.beasts.org> writes: > One of MySQL's little syntax abuses allows: > INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); Actually, that's perfectly standard SQL92, just an item we haven't got round to supporting yet. (Until we do the fabled querytree restructuring, it seems a lot harder than it is worth.) COPY FROM stdin is definitely the fastest way of inserting data, however, since you avoid a ton of parse/plan overhead that way. Of course you also lose the ability to have column defaults computed for you, etc ... there's no free lunch ... regards, tom lane
Jules Bean <jules@jellybean.co.uk> writes: > Is there any chance you could generate a patch against released 7.0.2 > to add just this functionality... It would be the kiss of life for my > code! Will look at it. Are you brave enough to want to try the rest of the 7.1 rewrite of the btree code, or do you just want this one hack? > And, of course, what would /really/ get my code going speedily would > be the partial indices mentioned elsewhere in this thread. If the > backend could automagically drop keys containing > 10% (tunable) of > the rows from the index, then my index would be (a) about 70% smaller! I don't think anyone was envisioning "automagic" drop of most common values. The partial-index support that's partially there ;-) is designed around manual specification of a predicate, ie, you'd say CREATE INDEX mypartialindex ON table (column) WHERE column != 42 AND column != 1066 if you wanted a partial index omitting values 42 and 1066. The backend would then consider using the index to process queries wherein it can prove that the query's WHERE implies the index predicate. For example SELECT * FROM table WHERE column = 11 would be able to use this index but SELECT * FROM table WHERE column < 100 would not. You could certainly write a little periodic-maintenance script to determine the most common values in your tables and recreate your partial indexes accordingly ... but I doubt it'd make sense to try to get the system to do that automatically on-the-fly. > For the short term, if I can get a working version of the above > randomisation patch, I think I shall 'fake' a partial index by > manually setting 'enable_seqscan=off' for all but the 4 or 5 most > common categories. Those two factors combined will speed up my bulk > inserts a lot. Uh, enable_seqscan has nothing to do with how inserts are handled... > Is there any simple way for Pg to combine inserts into one bulk? COPY. > Specifically, their effect on the index files. This particular problem couldn't be cured by batching inserts anyway. The performance problem was coming from the actual act of inserting a key (or more specifically, making room for the key) and that's just got to be done for each key AFAICS. regards, tom lane
On Sat, Aug 26, 2000 at 12:32:20PM -0400, Tom Lane wrote: > Jules Bean <jules@jellybean.co.uk> writes: > > Is there any chance you could generate a patch against released 7.0.2 > > to add just this functionality... It would be the kiss of life for my > > code! > > Will look at it. Are you brave enough to want to try the rest of the > 7.1 rewrite of the btree code, or do you just want this one hack? Hmm. I don't know :-) because I don't really know the scope of the changes... I do need this database to be stable, but OTOH, changes to the index code are unlikely to corrupt my data. > > I don't think anyone was envisioning "automagic" drop of most common > values. The partial-index support that's partially there ;-) is > designed around manual specification of a predicate, ie, you'd say > > CREATE INDEX mypartialindex ON table (column) > WHERE column != 42 AND column != 1066 Fair enough. > > > For the short term, if I can get a working version of the above > > randomisation patch, I think I shall 'fake' a partial index by > > manually setting 'enable_seqscan=off' for all but the 4 or 5 most > > common categories. Those two factors combined will speed up my bulk > > inserts a lot. > > Uh, enable_seqscan has nothing to do with how inserts are handled... Um, that was a thinko! What I was trying to say is: the randomisation will speed up my bulk inserts, whereas enable_seqscan=off will speed up the other slow queries in my application, namely "SELECT * where category='foo'" type queries. > > Specifically, their effect on the index files. > > This particular problem couldn't be cured by batching inserts anyway. > The performance problem was coming from the actual act of inserting > a key (or more specifically, making room for the key) and that's just > got to be done for each key AFAICS. In principle, it could be helpful to know you're inserting 10000 tuples with category='xyz'. Then you only make the left-to-right scan once, and you fill up every hole you see, before finally splitting the last page and inserting approx (10000)/(num_per_page) new pages all at once. This is surely quicker that doing the left-to-right scan for each tuple (even though the difference would be far less noticeable in the presence of your probablistic patch). Jules -- Jules Bean | Any sufficiently advanced jules@debian.org | technology is indistinguishable jules@jellybean.co.uk | from a perl script
Tom Lane <tgl@sss.pgh.pa.us> writes: > Jules Bean <jules@jellybean.co.uk> writes: >> Is there any chance you could generate a patch against released 7.0.2 >> to add just this functionality... It would be the kiss of life for my >> code! > Will look at it. I looked at it and decided that I don't want to mess with it. The BTP_CHAIN logic in 7.0 is so weird and fragile that it's hard to tell what will happen if we try to split anything but the last page of a chain of duplicates. The idea I'd had of just dropping in the whole btree module from current sources doesn't look very practical either; a lot of changes that span module boundaries would have to be backed out of it. My recommendation is to try out current sources (from a nightly snapshot or CVS). I realize that running a non-released version might make you uncomfortable, and rightly so; but I believe that current sources are in pretty good shape right now. In any case, working out any bugs lurking in current strikes me as a much more profitable activity than trying to back-patch the old btree code. regards, tom lane
Matthew Kirkwood wrote: > > On Sat, 26 Aug 2000, Jules Bean wrote: > > > Is there any simple way for Pg to combine inserts into one bulk? > > Specifically, their effect on the index files. It has always seemed > > to me to be one of the (many) glaring flaws in SQL that the INSERT > > statement only takes one row at a time. > > One of MySQL's little syntax abuses allows: > > INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); Wouldn't.... INSERT INTO tab (col1, ..); INSERT INTO tab VALUES (val1, ..); INSERT INTO tab (val2, ..); also avoid a database round trip??
On Sat, Aug 26, 2000 at 12:14:06PM +0100, Matthew Kirkwood wrote: > On Sat, 26 Aug 2000, Jules Bean wrote: > > > Is there any simple way for Pg to combine inserts into one bulk? > > Specifically, their effect on the index files. It has always seemed > > to me to be one of the (many) glaring flaws in SQL that the INSERT > > statement only takes one row at a time. > > One of MySQL's little syntax abuses allows: > > INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); > > which is nice for avoiding database round trips. It's one > of the reasons that mysql can do a bulk import so quickly. > copy seems to be very fast ... i dont know why all people use multiple inserts ??? just use copy ... yours, oliver teuber
Oliver Teuber wrote: > > On Sat, Aug 26, 2000 at 12:14:06PM +0100, Matthew Kirkwood wrote: > > On Sat, 26 Aug 2000, Jules Bean wrote: > > > > > Is there any simple way for Pg to combine inserts into one bulk? > > > Specifically, their effect on the index files. It has always seemed > > > to me to be one of the (many) glaring flaws in SQL that the INSERT > > > statement only takes one row at a time. > > > > One of MySQL's little syntax abuses allows: > > > > INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); > > > > which is nice for avoiding database round trips. It's one > > of the reasons that mysql can do a bulk import so quickly. > > > copy seems to be very fast ... i dont know why all people use > multiple inserts ??? just use copy ... > Could copy be extended to support a more SQL-friendly syntax. like COPY tablename FROM VALUES((x1,y1,z1),(x2,y2,z2),(x3,y3,z3) ); The main reason I have not used COPY very much is it's non-SQL syntax for field values. Extending the COPY command would probably be much easier than speeding up INSERTS. ------------- Hannu
> Could copy be extended to support a more SQL-friendly syntax. like > COPY tablename FROM VALUES( > (x1,y1,z1), > (x2,y2,z2), > (x3,y3,z3) > ); > Extending the COPY command would probably be much easier than speeding up > INSERTS. That syntax is a lot like a real SQL9x INSERT. Supporting multiple rows for inserts is probably not that difficult; but since INSERT is used so much we would have to make sure we don't screw up something else. At the moment, expanding a line of SQL into multiple internal querytree nodes is a bit crufty but is possible. - Thomas
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > That syntax is a lot like a real SQL9x INSERT. Multiple row constructors in INSERT is one of my to-do items for the planned querytree redesign. I have not thought it was worth messing with until we're ready to bite that bullet, however. regards, tom lane
Thomas Lockhart wrote: > > > Could copy be extended to support a more SQL-friendly syntax. like > > COPY tablename FROM VALUES( > > (x1,y1,z1), > > (x2,y2,z2), > > (x3,y3,z3) > > ); > > Extending the COPY command would probably be much easier than speeding up > > INSERTS. > > That syntax is a lot like a real SQL9x INSERT. Supporting multiple rows > for inserts is probably not that difficult; but since INSERT is used so > much we would have to make sure we don't screw up something else. At the > moment, expanding a line of SQL into multiple internal querytree nodes > is a bit crufty but is possible. What I actually had in mind was a more SQL-like syntax for copy, i.e. no default arguments, all fields required etc. that would we easy to bolt on current copy machinery but still use 'SQL' syntax (no . or \. or \\. for EOD, NULL for NULL values, quotes around strings ...) ------------ Hannu
Tom Lane wrote: > > Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > That syntax is a lot like a real SQL9x INSERT. > > Multiple row constructors in INSERT is one of my to-do items for the > planned querytree redesign. I have not thought it was worth messing > with until we're ready to bite that bullet, however. What is the status of this querytree redesign ? I'v spent some time trying to get a grip ofthe _exact_ meaning of the WITH RECURSIVE syntax in SQL3/99 as I badly need it in a project of mine. (I know the _general_ meaning - it is for querying tree-structured data ;) The things the new querytree should address sould be (at least ;) - 1. OUTER JOINS 2. WITH RECURSIVE 3. Support for positioned UPDATE & DELETE (requires probably lot more than just querytree redesign) 4. Easyer special-casing of optimisations (like using an index on x for 'select max(x) from t', but not for 'select max(x)from t where n=7' Is the special mailing-list for querytree redesing active ? -------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > What is the status of this querytree redesign ? Waiting for 7.2 cycle, as far as I know. > The things the new querytree should address sould be (at least ;) - > 2. WITH RECURSIVE I don't think RECURSIVE is a querytree issue --- it looks like a much bigger problem than that :-( The things I'm concerned about fixing with querytree redesign are* full SQL92 joins* subselects in FROM* view bugs (groupingand aggregates in views)* INSERT ... SELECT bugs* reimplement UNION/INTERSECT/EXCEPT in a less hacky way, makecases like SELECT ... UNION ... ORDER BY work. Not to mention UNION etc in a subselect or in INSERT/SELECT.* convertWHERE x IN (subselect) to a join-like representation These are all things that have gone unfixed for years because they're essentially unfixable with the current single-level representation of a query. regards, tom lane
Hannu Krosing <hannu@tm.ee> writes: > What I actually had in mind was a more SQL-like syntax for copy, > i.e. no default arguments, all fields required etc. that would we easy > to bolt on current copy machinery but still use 'SQL' syntax (no . or > \. or \\. for EOD, NULL for NULL values, quotes around strings ...) Seems like a reasonable idea, although I'd recommend sticking to the convention that \. on a line means EOD, to avoid having to hack the client-side libraries. As long as you leave that alone, libpq, libpgtcl, etc etc should be transparent to the copy data format. regards, tom lane
> The things I'm concerned about fixing with querytree redesign are ... And it will be a building block for *distributed databases *replication *CORBA/alternate protocols *other cool stuff ;) - Thomas
On Sat, Aug 26, 2000 at 05:09:53PM -0400, Tom Lane wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > Jules Bean <jules@jellybean.co.uk> writes: > >> Is there any chance you could generate a patch against released 7.0.2 > >> to add just this functionality... It would be the kiss of life for my > >> code! > > > Will look at it. > > I looked at it and decided that I don't want to mess with it. The > BTP_CHAIN logic in 7.0 is so weird and fragile that it's hard to tell > what will happen if we try to split anything but the last page of a > chain of duplicates. The idea I'd had of just dropping in the whole > btree module from current sources doesn't look very practical either; > a lot of changes that span module boundaries would have to be backed out > of it. > > My recommendation is to try out current sources (from a nightly snapshot > or CVS). I realize that running a non-released version might make you > uncomfortable, and rightly so; but I believe that current sources are in > pretty good shape right now. In any case, working out any bugs lurking > in current strikes me as a much more profitable activity than trying to > back-patch the old btree code. OK. Thanks very much for going through this with me. I'm actually going to simply do without the index for this release of my software -- it's an internal project, and the speed problems of having no index aren't disastrous. However, I'd rather not fight any instabilities, this project runs long-term and mostly unattended. I'll certainly keenly upgrade to 7.1 when it comes out, though, with the new btree logic. Jules
> Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > What is the status of > > partial indices? Are they functional now, or have they been broken > > forever (I'm not recalling)? > > They've been diked out of gram.y's syntax for CREATE INDEX at least > since Postgres95. No way to tell who did that, why or when, AFAIK. > There is still an awful lot of support code for them, however. It may have been me. Not sure. At the time, no one was sure what they did, or if we even wanted them. They were broken, I think. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > 98304 22.07 5545984 > > 196608 45.60 11141120 > > 393216 92.53 22290432 > > > > I tried probabilities from 0.67 to 0.999 and found that runtimes didn't > > vary a whole lot (though this is near the minimum), while index size > > consistently got larger as the probability of moving right decreased. > > The runtime is nicely linear throughout the range. > > That looks brilliant!! (Bearing in mind that I have over 10 million > tuples in my table, you can imagine what performance was like for me!) > Is there any chance you could generate a patch against released 7.0.2 > to add just this functionality... It would be the kiss of life for my > code! > > (Not in a hurry, I'm not back in work until Wednesday, as it happens) > > And, of course, what would /really/ get my code going speedily would > be the partial indices mentioned elsewhere in this thread. If the > backend could automagically drop keys containing > 10% (tunable) of > the rows from the index, then my index would be (a) about 70% smaller! > and (b) only used when it's faster. [This means it would have to > update some simple histogram data. However, I can't see that being > much of an overhead] > > For the short term, if I can get a working version of the above > randomisation patch, I think I shall 'fake' a partial index by > manually setting 'enable_seqscan=off' for all but the 4 or 5 most > common categories. Those two factors combined will speed up my bulk > inserts a lot. What would be really nifty is to take the most common value found by VACUUM ANALYZE, and cause sequential scans if that value represents more than 50% of the entries in the table. Added to TODO: * Prevent index lookups (or index entries using partial index) on most common values; instead use sequential scan -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Matthew Kirkwood <matthew@hairy.beasts.org> writes: > > One of MySQL's little syntax abuses allows: > > INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); > > Actually, that's perfectly standard SQL92, just an item we haven't > got round to supporting yet. (Until we do the fabled querytree > restructuring, it seems a lot harder than it is worth.) Added to TODO: INSERT INTO tab (col1, ..) VALUES (val1, ..), (val2, ..); -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Thomas Lockhart <lockhart@alumni.caltech.edu> writes: >>>> What is the status of partial indices? >> >> They've been diked out of gram.y's syntax for CREATE INDEX at least >> since Postgres95. No way to tell who did that, why or when, AFAIK. >> There is still an awful lot of support code for them, however. > It may have been me. Not sure. At the time, no one was sure what they > did, or if we even wanted them. They were broken, I think. Wasn't you unless you were hacking the code before it entered our CVS system. I checked the oldest CVS version (Postgres95 release) and saw that partial indexes were disabled in gram.y in that version. regards, tom lane
> * Prevent index lookups (or index entries using partial index) on most > common values; instead use sequential scan This behavior already exists for the most common value, and would exist for any additional values that we had stats for. Don't see why you think a separate TODO item is needed. regards, tom lane
> > * Prevent index lookups (or index entries using partial index) on most > > common values; instead use sequential scan > > This behavior already exists for the most common value, and would > exist for any additional values that we had stats for. Don't see > why you think a separate TODO item is needed. You mean the optimizer already skips an index lookup for the most common value, and instead does a sequential scan? Seems you are way ahead of me. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >>>> * Prevent index lookups (or index entries using partial index) on most >>>> common values; instead use sequential scan >> >> This behavior already exists for the most common value, and would >> exist for any additional values that we had stats for. Don't see >> why you think a separate TODO item is needed. > You mean the optimizer already skips an index lookup for the most common > value, and instead does a sequential scan? No, it goes for the sequential scan if it estimates the cost of the indexscan as more than sequential. Indexscan cost depends on estimated number of retrieved rows --- which it can estimate from pg_statistic if the query is WHERE column = mostcommonvalue. So which plan you get depends on just how common the most common value is. Hard-wiring either choice of plan for the most common value would be inferior to what the code already does, AFAICS. But for values other than the-most-common, we don't have adequate stats in pg_statistic, and so you may or may not get a good estimated row count and hence a good choice of plan. That's what needs to be fixed. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >>>> * Prevent index lookups (or index entries using partial index) on most > >>>> common values; instead use sequential scan > >> > >> This behavior already exists for the most common value, and would > >> exist for any additional values that we had stats for. Don't see > >> why you think a separate TODO item is needed. > > > You mean the optimizer already skips an index lookup for the most common > > value, and instead does a sequential scan? > > No, it goes for the sequential scan if it estimates the cost of the > indexscan as more than sequential. Indexscan cost depends on estimated > number of retrieved rows --- which it can estimate from pg_statistic > if the query is WHERE column = mostcommonvalue. So which plan you get > depends on just how common the most common value is. > > Hard-wiring either choice of plan for the most common value would be > inferior to what the code already does, AFAICS. But for values other > than the-most-common, we don't have adequate stats in pg_statistic, > and so you may or may not get a good estimated row count and hence > a good choice of plan. That's what needs to be fixed. OK, I remember now. If the most common value is used as a constant, it uses the value from pg_statistic for most common, rather than use the dispersion value. That is great. What I am more concerned about is a join that uses the most common value. We do an index scan in that case. I wonder of we could get something into the executor that would switch to sequential scan when the most common value is hit. Is that worth persuing? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > What I am more concerned about is a join that uses the most common > value. We do an index scan in that case. No, we do whichever plan looks cheapest. Again, it's all about statistics. Right now, eqjoinsel() is just a stub that returns a constant selectivity estimate. It might be useful to compute some more sophisticated value based on pg_statistic entries for the two columns, but right now I doubt you could tell much. Should keep the join case in mind when we extend the statistics... regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > What I am more concerned about is a join that uses the most common > > value. We do an index scan in that case. > > No, we do whichever plan looks cheapest. Again, it's all about > statistics. > > Right now, eqjoinsel() is just a stub that returns a constant > selectivity estimate. It might be useful to compute some more > sophisticated value based on pg_statistic entries for the two > columns, but right now I doubt you could tell much. Should keep > the join case in mind when we extend the statistics... OK, let me be more specific. Suppose the most common value in a column is 3. For a query "col = 3", we know 3 is most common, and use the most common statistics rather than the dispersion statistic, right? OK, let's assume use of the most common statistic causes a sequential scan, but use of dispersion causes an index scan. The query "col = 3" uses sequential scan. In the query "col = tab2.col2", the dispersion statistic is used, causing an index scan. However, assume tab2.col2 equals 3. I assume this would cause an index scan because the executor doesn't know about the most common value, right? Is it worth trying to improve that? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > However, assume tab2.col2 equals 3. I assume this would cause an index > scan because the executor doesn't know about the most common value, > right? Is it worth trying to improve that? Oh, I see: you are assuming that a nestloop join is being done, and wondering if it's worthwhile to switch dynamically between seqscan and indexscan for each scan of the inner relation, depending on exactly what value is being supplied from the outer relation for that scan. Hmm. Not sure if it's worth the trouble or not. Nestloop is usually a last-resort join strategy anyway, and is unlikely to be picked when the tables are large enough to make performance be a big issue. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > However, assume tab2.col2 equals 3. I assume this would cause an index > > scan because the executor doesn't know about the most common value, > > right? Is it worth trying to improve that? > > Oh, I see: you are assuming that a nestloop join is being done, and > wondering if it's worthwhile to switch dynamically between seqscan > and indexscan for each scan of the inner relation, depending on exactly > what value is being supplied from the outer relation for that scan. > Hmm. > > Not sure if it's worth the trouble or not. Nestloop is usually a > last-resort join strategy anyway, and is unlikely to be picked when the > tables are large enough to make performance be a big issue. Yes, I realize only nested loop has this problem. Mergejoin and Hashjoin actually would grab the whole table via sequential scan, so the index is not involved, right? Let me ask, if I do the query, "tab1.col = tab2.col and tab2.col = 3", the system would use an index to get tab2.col, but then what method would it use to join to tab1? Nested loop because it thinks it is going to get only one row from tab1.col1. If 3 is the most common value in tab1.col, it is going to get lots more than one row, right? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yes, I realize only nested loop has this problem. Mergejoin and > Hashjoin actually would grab the whole table via sequential scan, so the > index is not involved, right? They'd grab the whole table after applying restriction clauses. An indexscan might be used if there's an appropriate restriction clause for either table, or to sort a table for merge join... > Let me ask, if I do the query, "tab1.col = tab2.col and tab2.col = 3", > the system would use an index to get tab2.col, but then what method > would it use to join to tab1? Nested loop because it thinks it is going > to get only one row from tab1.col1. I don't think it'd think that. The optimizer is not presently smart enough to make the transitive deduction that tab1.col = 3 (it does recognize transitive equality of Vars, but doesn't extend that to non-variable values). So it won't see any restriction clause for tab1 here. If it thinks that tab2.col = 3 will yield one row, it might well choose a nested loop with tab2 as the outer, rather than merge or hash join. So an inner indexscan for tab1 is definitely a possible plan. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Yes, I realize only nested loop has this problem. Mergejoin and > > Hashjoin actually would grab the whole table via sequential scan, so the > > index is not involved, right? > > They'd grab the whole table after applying restriction clauses. An > indexscan might be used if there's an appropriate restriction clause > for either table, or to sort a table for merge join... > > > Let me ask, if I do the query, "tab1.col = tab2.col and tab2.col = 3", > > the system would use an index to get tab2.col, but then what method > > would it use to join to tab1? Nested loop because it thinks it is going > > to get only one row from tab1.col1. > > I don't think it'd think that. The optimizer is not presently smart > enough to make the transitive deduction that tab1.col = 3 (it does > recognize transitive equality of Vars, but doesn't extend that to > non-variable values). So it won't see any restriction clause for > tab1 here. > > If it thinks that tab2.col = 3 will yield one row, it might well choose > a nested loop with tab2 as the outer, rather than merge or hash join. > So an inner indexscan for tab1 is definitely a possible plan. Yes, that was my point, that a nested loop could easily be involved if the joined table has a restriction. Is there a TODO item here? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> So an inner indexscan for tab1 is definitely a possible plan. > Yes, that was my point, that a nested loop could easily be involved if > the joined table has a restriction. Is there a TODO item here? More like a "to investigate" --- I'm not sold on the idea that a dynamic switch in plan types would be a win. Maybe it would be, but... One thing to think about is that it'd be critically dependent on having accurate statistics. Currently, the planner only places bets on the average behavior over a whole join. If you make a separate bet on each scan, then you open up the risk of betting wrong every time, should your stats be out-of-date or otherwise misleading. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> So an inner indexscan for tab1 is definitely a possible plan. > > > Yes, that was my point, that a nested loop could easily be involved if > > the joined table has a restriction. Is there a TODO item here? > > More like a "to investigate" --- I'm not sold on the idea that a > dynamic switch in plan types would be a win. Maybe it would be, > but... > > One thing to think about is that it'd be critically dependent on having > accurate statistics. Currently, the planner only places bets on the > average behavior over a whole join. If you make a separate bet on each > scan, then you open up the risk of betting wrong every time, should > your stats be out-of-date or otherwise misleading. I agree. Not sure how to approach this, but I am sure it is dealt with by most database systems. Can someone find out how other db's handle this? Is there any research on it? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026