Обсуждение: CLUSTER and indisclustered
Hi all, It occured to me on the plane home that now that CLUSTER is fixed we may be able to put pg_index.indisclustered to use. If CLUSTER was to set indisclustered to true when it clusters a heap according to the given index, we could speed up sequantial scans. There are two possible ways. 1) Planner determines that a seqscan is appropriate *and* the retrieval is qualified by the key(s) of one of the relation's indexes 2) Planner determines that the relation is clustered on disk according to the index over the key(s) used to qualify the retrieval 3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?) 4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ? 5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie, different from SeqNext) called SeqClusterNext 6) SeqClusterNext() has all the heapgettup() logic with two exceptions: a) we find the first tuple more intelligently (instead of scanning from the first page) b) if we have found tuple(s) matching the ScanKey when we encounter an non-matching tuple (via HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the scan Any reason this isn't possible? Any reason it couldn't dramatically speed up the performance of the type of query i've mentioned? Gavin
Gavin Sherry <swm@linuxworld.com.au> writes: > It occured to me on the plane home that now that CLUSTER is fixed we may > be able to put pg_index.indisclustered to use. If CLUSTER was to set > indisclustered to true when it clusters a heap according to the given > index, we could speed up sequantial scans. AFAICT you're assuming that the table is *exactly* ordered by the clustered attribute. While this is true at the instant CLUSTER completes, the exact ordering will be destroyed by the first insert or update :-(. I can't see much value in creating a whole new scan type that's only usable on a perfectly-clustered table. The existing approach to making the planner smart about clustered tables is to compute a physical-vs-logical-order-correlation statistic and use that to adjust the estimated cost of indexscans. I believe this is a more robust approach than considering a table to be "clustered" or "not clustered", since it can deal with the gradual degradation of clustered order over time. However, I will not make any great claims for the specific equations currently used for this purpose --- they're surely in need of improvement. Feel free to take a look and see if you have any ideas. The collection of the statistic is in commands/analyze.c and the use of it is in optimizer/path/costsize.c. regards, tom lane
On Sat, 3 Aug 2002, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > It occured to me on the plane home that now that CLUSTER is fixed we may > > be able to put pg_index.indisclustered to use. If CLUSTER was to set > > indisclustered to true when it clusters a heap according to the given > > index, we could speed up sequantial scans. > > AFAICT you're assuming that the table is *exactly* ordered by the > clustered attribute. While this is true at the instant CLUSTER > completes, the exact ordering will be destroyed by the first insert or > update :-(. I can't see much value in creating a whole new scan type Sorry, I meant to say that heap_insert() etc would need to set indisclustered to false. I do see some worth in this however. Naturally, in a situation where a database is being modified very often this is of little value. However, for applications focussed on analysing large amounts of static data this could increase performance significantly. Once I get some time I will attempt to explore this further in `diff -c` format :-). Gavin
Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > It occured to me on the plane home that now that CLUSTER is fixed we may > > be able to put pg_index.indisclustered to use. If CLUSTER was to set > > indisclustered to true when it clusters a heap according to the given > > index, we could speed up sequantial scans. > > AFAICT you're assuming that the table is *exactly* ordered by the > clustered attribute. While this is true at the instant CLUSTER > completes, the exact ordering will be destroyed by the first insert or > update :-(. I can't see much value in creating a whole new scan type > that's only usable on a perfectly-clustered table. > > The existing approach to making the planner smart about clustered tables > is to compute a physical-vs-logical-order-correlation statistic and use > that to adjust the estimated cost of indexscans. I believe this is a > more robust approach than considering a table to be "clustered" or "not > clustered", since it can deal with the gradual degradation of clustered > order over time. However, I will not make any great claims for the > specific equations currently used for this purpose --- they're surely in > need of improvement. Feel free to take a look and see if you have any > ideas. The collection of the statistic is in commands/analyze.c and the > use of it is in optimizer/path/costsize.c. Tom, should we be updating that flag after we CLUSTER instead of requiring an ANALYZE after the CLUSTER? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Gavin Sherry wrote: > Hi all, > > It occured to me on the plane home that now that CLUSTER is fixed we may > be able to put pg_index.indisclustered to use. If CLUSTER was to set > indisclustered to true when it clusters a heap according to the given > index, we could speed up sequantial scans. There are two possible ways. > > 1) Planner determines that a seqscan is appropriate *and* the retrieval is > qualified by the key(s) of one of the relation's indexes > 2) Planner determines that the relation is clustered on disk according to > the index over the key(s) used to qualify the retrieval > 3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?) > 4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ? > 5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie, > different from SeqNext) called SeqClusterNext > 6) SeqClusterNext() has all the heapgettup() logic with two > exceptions: a) we find the first tuple more intelligently (instead of > scanning from the first page) b) if we have found tuple(s) matching the > ScanKey when we encounter an non-matching tuple (via > HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the > scan Gavin, is that a big win compared to just using the index and looping through the entries, knowing that the index matches are on the same page, and the heap matches are on the same page. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sat, 3 Aug 2002, Bruce Momjian wrote: > Gavin Sherry wrote: > > Hi all, > > > > It occured to me on the plane home that now that CLUSTER is fixed we may > > be able to put pg_index.indisclustered to use. If CLUSTER was to set > > indisclustered to true when it clusters a heap according to the given > > index, we could speed up sequantial scans. There are two possible ways. > > > > 1) Planner determines that a seqscan is appropriate *and* the retrieval is > > qualified by the key(s) of one of the relation's indexes > > 2) Planner determines that the relation is clustered on disk according to > > the index over the key(s) used to qualify the retrieval > > 3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?) > > 4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ? > > 5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie, > > different from SeqNext) called SeqClusterNext > > 6) SeqClusterNext() has all the heapgettup() logic with two > > exceptions: a) we find the first tuple more intelligently (instead of > > scanning from the first page) b) if we have found tuple(s) matching the > > ScanKey when we encounter an non-matching tuple (via > > HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the > > scan > > Gavin, is that a big win compared to just using the index and looping > through the entries, knowing that the index matches are on the same > page, and the heap matches are on the same page. Bruce, It would cut out the index over head. Besides at (1) (above) we would have determined that an index scan was too expensive and we would be using a SeqScan instead. This would just be faster, since a) we would locate the tuples more intelligently b) we wouldn't need to scan the whole heap once we'd found all tuples matching the scan key. Gavin
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom, should we be updating that flag after we CLUSTER instead of > > requiring an ANALYZE after the CLUSTER? > > Could do that I suppose, but I'm not super-excited about it. ANALYZE is > quite cheap these days (especially in comparison to CLUSTER ;-)). I'd > settle for a note in the CLUSTER docs that recommends a subsequent > ANALYZE --- this seems no different from recommending ANALYZE after bulk > data load or other major update of a table. OK. I am sure it is not obvious to people to ANALYZE because the data in their table hasn't changed, just the ordering. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Gavin Sherry wrote: > > Gavin, is that a big win compared to just using the index and looping > > through the entries, knowing that the index matches are on the same > > page, and the heap matches are on the same page. > > Bruce, > > It would cut out the index over head. Besides at (1) (above) we would have > determined that an index scan was too expensive and we would be using a > SeqScan instead. This would just be faster, since a) we would locate the > tuples more intelligently b) we wouldn't need to scan the whole heap once > we'd found all tuples matching the scan key. Yes, but in a clustered table, an index scan is _never_ (?) more expensive than a sequential scan, at least if the optimizer is working correctly. Index scans are slower only because they assume random heap access, but with a clustered table, there is no random heap access. The index takes to right to the spot to start. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Gavin Sherry <swm@linuxworld.com.au> writes: > On Sat, 3 Aug 2002, Tom Lane wrote: >> AFAICT you're assuming that the table is *exactly* ordered by the >> clustered attribute. While this is true at the instant CLUSTER >> completes, the exact ordering will be destroyed by the first insert or >> update :-(. I can't see much value in creating a whole new scan type > Sorry, I meant to say that heap_insert() etc would need to set > indisclustered to false. <<itch>> You could do that, but only if you are prepared to invent a mechanism that will instantly invalidate any existing query plans that assume the clustered ordering is good. Up to now we've only allowed the planner to make decisions that impact performace, not correctness of the result. I'm uncomfortable with the idea that a "clusterscan" plan could silently return wrong answers after someone else updates the table and doesn't tell us they did. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom, should we be updating that flag after we CLUSTER instead of > requiring an ANALYZE after the CLUSTER? Could do that I suppose, but I'm not super-excited about it. ANALYZE is quite cheap these days (especially in comparison to CLUSTER ;-)). I'd settle for a note in the CLUSTER docs that recommends a subsequent ANALYZE --- this seems no different from recommending ANALYZE after bulk data load or other major update of a table. regards, tom lane
Gavin Sherry wrote: >On Sat, 3 Aug 2002, Bruce Momjian wrote: > >>Gavin Sherry wrote: >> >>>Hi all, >>> >>>It occured to me on the plane home that now that CLUSTER is fixed we may >>>be able to put pg_index.indisclustered to use. If CLUSTER was to set >>>indisclustered to true when it clusters a heap according to the given >>>index, we could speed up sequantial scans. There are two possible ways. >>> >>>1) Planner determines that a seqscan is appropriate *and* the retrieval is >>>qualified by the key(s) of one of the relation's indexes >>>2) Planner determines that the relation is clustered on disk according to >>>the index over the key(s) used to qualify the retrieval >>>3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?) >>>4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ? >>>5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie, >>>different from SeqNext) called SeqClusterNext >>>6) SeqClusterNext() has all the heapgettup() logic with two >>>exceptions: a) we find the first tuple more intelligently (instead of >>>scanning from the first page) b) if we have found tuple(s) matching the >>>ScanKey when we encounter an non-matching tuple (via >>>HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the >>>scan >>> >>Gavin, is that a big win compared to just using the index and looping >>through the entries, knowing that the index matches are on the same >>page, and the heap matches are on the same page. >> > >Bruce, > >It would cut out the index over head. Besides at (1) (above) we would have >determined that an index scan was too expensive and we would be using a >SeqScan instead. This would just be faster, since a) we would locate the >tuples more intelligently b) we wouldn't need to scan the whole heap once >we'd found all tuples matching the scan key. > >Gavin > Gavin and Bruce, I am not so sure index access in these cases is such an overhead - since the clustered nature of the table means that many index elements will point to table data that is located in a few pages . Ok, this change would save you the initial access of the index structure itself - but isnt the usual killer for indexes is the "thrashing" that happens when the "pointed to" table data is spread over a many pages. I did some tests on this a while ago ( 7.1 era) and discovered that for a clustered table a sequential scan did not start to win until the target dataset was ~30% of the table itself. While the suggested change would of course mean that seq scans would do much better than they did in my tests, it would be interesting to see some timings... regards Mark
On Sun, 4 Aug 2002, mark Kirkwood wrote: > Ok, this change would save you the initial access of the index > structure itself - but isnt the usual killer for indexes is the > "thrashing" that happens when the "pointed to" table data is spread > over a many pages. Yeah, no kidding on this one. I've reduced queries from 75 seconds to 0.6 seconds by clustering on the appropriate field. But after doing some benchmarking of various sorts of random reads and writes, it occurred to me that there might be optimizations that could help a lot with this sort of thing. What if, when we've got an index block with a bunch of entries, instead of doing the reads in the order of the entries, we do them in the order of the blocks the entries point to? That would introduce a certain amount of "sequentialness" to the reads that the OS is not capable of introducing (since it can't reschedule the reads you're doing, the way it could reschedule, say, random writes). cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
On Wed, 7 Aug 2002, Curt Sampson wrote: > But after doing some benchmarking of various sorts of random reads > and writes, it occurred to me that there might be optimizations > that could help a lot with this sort of thing. What if, when we've > got an index block with a bunch of entries, instead of doing the > reads in the order of the entries, we do them in the order of the > blocks the entries point to? That would introduce a certain amount > of "sequentialness" to the reads that the OS is not capable of > introducing (since it can't reschedule the reads you're doing, the > way it could reschedule, say, random writes). This sounds more or less like the method employed by Firebird as described by Ann Douglas to Tom at OSCON (correct me if I get this wrong). Basically, firebird populates a bitmap with entries the scan is interested in. The bitmap is populated in page order so that all entries on the same heap page can be fetched at once. This is totally different to the way postgres does things and would require significant modification to the index access methods. Gavin
Curt Sampson <cjs@cynic.net> writes: > But after doing some benchmarking of various sorts of random reads > and writes, it occurred to me that there might be optimizations > that could help a lot with this sort of thing. What if, when we've > got an index block with a bunch of entries, instead of doing the > reads in the order of the entries, we do them in the order of the > blocks the entries point to? I thought to myself "didn't I just post something about that?" and then realized it was on a different mailing list. Here ya go (and no, this is not the first time around on this list either...) I am currently thinking that bitmap indexes per se are not all that interesting. What does interest me is bitmapped index lookup, which came back into mind after hearing Ann Harrison describe how FireBird/ InterBase does it. The idea is that you don't scan the index and base table concurrently as we presently do it. Instead, you scan the index and make a list of the TIDs of the table tuples you need to visit. This list can be conveniently represented as a sparse bitmap. After you've finished looking at the index, you visit all the required table tuples *in physical order* using the bitmap. This eliminates multiple fetches of the same heap page, and can possibly let you get some win from sequential access. Once you have built this mechanism, you can then move on to using multiple indexes in interesting ways: you can do several indexscans in one query and then AND or OR their bitmaps before doing the heap scan. This would allow, for example, "WHERE a = foo and b = bar" to be handled by ANDing results from separate indexes on the a and b columns, rather than having to choose only one index to use as we do now. Some thoughts about implementation: FireBird's implementation seems to depend on an assumption about a fixed number of tuple pointers per page. We don't have that, but we could probably get away with just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page. Also, the main downside of this approach is that the bitmap could get large --- but you could have some logic that causes you to fall back to plain sequential scan if you get too many index hits. (It's interesting to think of this as lossy compression of the bitmap... which leads to the idea of only being fuzzy in limited areas of the bitmap, rather than losing all the information you have.) A possibly nasty issue is that lazy VACUUM has some assumptions in it about indexscans holding pins on index pages --- that's what prevents it from removing heap tuples that a concurrent indexscan is just about to visit. It might be that there is no problem: even if lazy VACUUM removes a heap tuple and someone else then installs a new tuple in that same TID slot, you should be okay because the new tuple is too new to pass your visibility test. But I'm not convinced this is safe. regards, tom lane
On Wed, 7 Aug 2002, Tom Lane wrote: > I thought to myself "didn't I just post something about that?" > and then realized it was on a different mailing list. Here ya go > (and no, this is not the first time around on this list either...) Wow. I'm glad to see you looking at this, because this feature would so *so* much for the performance of some of my queries, and really, really impress my "billion-row-database" client. > The idea is that you don't scan the index and base table concurrently > as we presently do it. Instead, you scan the index and make a list > of the TIDs of the table tuples you need to visit. Right. > Also, the main downside of this approach is that the bitmap could > get large --- but you could have some logic that causes you to fall > back to plain sequential scan if you get too many index hits. Well, what I was thinking of, should the list of TIDs to fetch get too long, was just to break it down in to chunks. If you want to limit to, say, 1000 TIDs, and your index has 3000, just do the first 1000, then the next 1000, then the last 1000. This would still result in much less disk head movement and speed the query immensely. (BTW, I have verified this emperically during testing of random read vs. random write on a RAID controller. The writes were 5-10 times faster than the reads because the controller was caching a number of writes and then doing them in the best possible order, whereas the reads had to be satisfied in the order they were submitted to the controller.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Curt Sampson <cjs@cynic.net> writes: > On Wed, 7 Aug 2002, Tom Lane wrote: >> Also, the main downside of this approach is that the bitmap could >> get large --- but you could have some logic that causes you to fall >> back to plain sequential scan if you get too many index hits. > Well, what I was thinking of, should the list of TIDs to fetch get too > long, was just to break it down in to chunks. But then you lose the possibility of combining multiple indexes through bitmap AND/OR steps, which seems quite interesting to me. If you've visited only a part of each index then you can't apply that concept. Another point to keep in mind is that the bigger the bitmap gets, the less useful an indexscan is, by definition --- sooner or later you might as well fall back to a seqscan. So the idea of lossy compression of a large bitmap seems really ideal to me. In principle you could seqscan the parts of the table where matching tuples are thick on the ground, and indexscan the parts where they ain't. Maybe this seems natural to me as an old JPEG campaigner, but if you don't see the logic I recommend thinking about it a little ... regards, tom lane
On Wed, 7 Aug 2002, Tom Lane wrote: > But then you lose the possibility of combining multiple indexes through > bitmap AND/OR steps, which seems quite interesting to me. If you've > visited only a part of each index then you can't apply that concept. Right. It'd be a shame to lose that, but a little is better than nothing at all, if one ends up being faced with that decision. > Another point to keep in mind is that the bigger the bitmap gets, the > less useful an indexscan is, by definition --- sooner or later you might > as well fall back to a seqscan. Well, yes, so long as you chose the correct values of "big." I'd want this to be able to optimize queries against a two billion row table about 150 GB in size. And that might even get bigger in a few years. > Maybe this seems natural > to me as an old JPEG campaigner, but if you don't see the logic I > recommend thinking about it a little ... Well, photos are certainly not random, but database tables may be in essentially random order far more often. How much that applies, I'm not sure, since I don't really know a lot about this stuff. I'll take your word for it on what's best there. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
On Wed, 2002-08-07 at 10:12, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > On Wed, 7 Aug 2002, Tom Lane wrote: > >> Also, the main downside of this approach is that the bitmap could > >> get large --- but you could have some logic that causes you to fall > >> back to plain sequential scan if you get too many index hits. > > > Well, what I was thinking of, should the list of TIDs to fetch get too > > long, was just to break it down in to chunks. > > But then you lose the possibility of combining multiple indexes through > bitmap AND/OR steps, which seems quite interesting to me. If you've > visited only a part of each index then you can't apply that concept. When the tuples are small relative to pagesize, you may get some "compression" by saving just pages and not the actual tids in the the bitmap. ------------- Hannu
On Wed, 2002-08-07 at 06:46, Hannu Krosing wrote: > On Wed, 2002-08-07 at 10:12, Tom Lane wrote: > > Curt Sampson <cjs@cynic.net> writes: > > > On Wed, 7 Aug 2002, Tom Lane wrote: > > >> Also, the main downside of this approach is that the bitmap could > > >> get large --- but you could have some logic that causes you to fall > > >> back to plain sequential scan if you get too many index hits. > > > > > Well, what I was thinking of, should the list of TIDs to fetch get too > > > long, was just to break it down in to chunks. > > > > But then you lose the possibility of combining multiple indexes through > > bitmap AND/OR steps, which seems quite interesting to me. If you've > > visited only a part of each index then you can't apply that concept. > > When the tuples are small relative to pagesize, you may get some > "compression" by saving just pages and not the actual tids in the the > bitmap. Now I remembered my original preference for page bitmaps (vs. tuple bitmaps): one can't actually make good use of a bitmap of tuples because there is no fixed tuples/page ratio and thus no way to quickly go from bit position to actual tuple. You mention the same problem but propose a different solution. Using page bitmap, we will at least avoid fetching any unneeded pages - essentially we will have a sequential scan over possibly interesting pages. If we were to use page-bitmap index for something with only a few values like booleans, some insert-time local clustering should be useful, so that TRUEs and FALSEs end up on different pages. But I guess that CLUSTER support for INSERT will not be touched for 7.3 as will real bitmap indexes ;) --------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > Now I remembered my original preference for page bitmaps (vs. tuple > bitmaps): one can't actually make good use of a bitmap of tuples because > there is no fixed tuples/page ratio and thus no way to quickly go from > bit position to actual tuple. You mention the same problem but propose a > different solution. > Using page bitmap, we will at least avoid fetching any unneeded pages - > essentially we will have a sequential scan over possibly interesting > pages. Right. One form of the "lossy compression" idea I suggested is to switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets too large to work with. Again, one could imagine doing that only in denser areas of the bitmap. > But I guess that CLUSTER support for INSERT will not be touched for 7.3 > as will real bitmap indexes ;) All of this is far-future work I think. Adding a new scan type to the executor would probably be pretty localized, but the ramifications in the planner could be extensive --- especially if you want to do plans involving ANDed or ORed bitmaps. regards, tom lane
On Wed, 2002-08-07 at 15:26, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > Now I remembered my original preference for page bitmaps (vs. tuple > > bitmaps): one can't actually make good use of a bitmap of tuples because > > there is no fixed tuples/page ratio and thus no way to quickly go from > > bit position to actual tuple. You mention the same problem but propose a > > different solution. > > > Using page bitmap, we will at least avoid fetching any unneeded pages - > > essentially we will have a sequential scan over possibly interesting > > pages. > > Right. One form of the "lossy compression" idea I suggested is to > switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets > too large to work with. If it is a real bitmap, should it not be easyeast to allocate at the start ? a page bitmap for a 100 000 000 tuple table with 10 tuples/page will be sized 10000000/8 = 1.25 MB, which does not look too big for me for that amount of data (the data table itself would occupy 80 GB). Even having the bitmap of 16 bits/page (with the bits 0-14 meaning tuples 0-14 and bit 15 meaning "seq scan the rest of page") would consume just 20 MB of _local_ memory, and would be quite justifyiable for a query on a table that large. For a real bitmap index the tuples-per-page should be a user-supplied tuning parameter. > Again, one could imagine doing that only in denser areas of the bitmap. I would hardly call the resulting structure "a bitmap" ;) And I'm not sure the overhead for a more complex structure would win us any additional performance for most cases. > > But I guess that CLUSTER support for INSERT will not be touched for 7.3 > > as will real bitmap indexes ;) > > All of this is far-future work I think. After we do that we will probably be able claim support for "datawarehousing" ;) > Adding a new scan type to the > executor would probably be pretty localized, but the ramifications in > the planner could be extensive --- especially if you want to do plans > involving ANDed or ORed bitmaps. Also going to "smart inserter" which can do local clustering on sets of real bitmap indexes for INSERTS (and INSERT side of UPDATE) would probably be a major change from our current "stupid inserter" ;) This will not be needed for bitmap resolution higher than 1bit/page but default local clustering on bitmap indexes will probably buy us some extra performance. by avoiding data page fetches when such indexes are used. AN anyway the support for INSERT being aware of clustering will probably come up sometime. ------------ Hannu
On Wed, 2002-08-07 at 04:31, Curt Sampson wrote: > On Sun, 4 Aug 2002, mark Kirkwood wrote: > > > Ok, this change would save you the initial access of the index > > structure itself - but isnt the usual killer for indexes is the > > "thrashing" that happens when the "pointed to" table data is spread > > over a many pages. > > Yeah, no kidding on this one. I've reduced queries from 75 seconds > to 0.6 seconds by clustering on the appropriate field. > > But after doing some benchmarking of various sorts of random reads > and writes, it occurred to me that there might be optimizations > that could help a lot with this sort of thing. What if, when we've > got an index block with a bunch of entries, instead of doing the > reads in the order of the entries, we do them in the order of the > blocks the entries point to? That would introduce a certain amount > of "sequentialness" to the reads that the OS is not capable of > introducing (since it can't reschedule the reads you're doing, the > way it could reschedule, say, random writes). > I guess this could be solved elegantly using threading - one thread scans index and pushes tids into a btree or some other sorted structure, while other thread loops continuously (or "elevatorly" back and forth) over that structure in tuple order and does the actual data reads. This would have the added benefit of better utilising multiprocessor computers. --------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > On Wed, 2002-08-07 at 15:26, Tom Lane wrote: >> Right. One form of the "lossy compression" idea I suggested is to >> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets >> too large to work with. > If it is a real bitmap, should it not be easyeast to allocate at the > start ? But it isn't a "real bitmap". That would be a really poor implementation, both for space and speed --- do you really want to scan over a couple of megs of zeroes to find the few one-bits you care about, in the typical case? "Bitmap" is a convenient term because it describes the abstract behavior we want, but the actual data structure will probably be nontrivial. If I recall Ann's description correctly, Firebird's implementation uses run length coding of some kind (anyone care to dig in their source and get all the details?). If we tried anything in the way of lossy compression then there'd be even more stuff lurking under the hood. regards, tom lane
On Wed, 2002-08-07 at 16:26, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > On Wed, 2002-08-07 at 15:26, Tom Lane wrote: > >> Right. One form of the "lossy compression" idea I suggested is to > >> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets > >> too large to work with. > > > If it is a real bitmap, should it not be easyeast to allocate at the > > start ? > > But it isn't a "real bitmap". That would be a really poor > implementation, both for space and speed --- do you really want to scan > over a couple of megs of zeroes to find the few one-bits you care about, > in the typical case? I guess that depends on data. The typical case should be somthing the stats process will find out so the optimiser can use it The bitmap must be less than 1/48 (size of TID) full for best uncompressed "active-tid-list" to be smaller than plain bitmap. If there were some structure above list then this ratio would be even higher. I have had good experience using "compressed delta lists", which will scale well ofer the whole "fullness" spectrum of bitmap, but this is for storage, not for initial constructing of lists. > "Bitmap" is a convenient term because it describes > the abstract behavior we want, but the actual data structure will > probably be nontrivial. If I recall Ann's description correctly, > Firebird's implementation uses run length coding of some kind (anyone > care to dig in their source and get all the details?). Plain RLL is probably a good way to store it and for merging two or more bitmaps, but not as good for constructing it bit-by-bit. I guess the most effective structure for updating is often still a plain bitmap (maybe not if it is very sparse and all of it does not fit in cache), followed by some kind of balanced tree (maybe rb-tree). If the bitmap is relatively full then the plain bitmap is almost always the most effective to update. > If we tried anything in the way of lossy compression then there'd > be even more stuff lurking under the hood. Having three-valued (0,1,maybe) RLL-encoded "tritmap" would be a good way to represent lossy compression, and it would also be quite straightforward to merge two of these using AND or OR. It may even be possible to easily construct it using a fixed-length b-tree and going from 1 to "maybe" for nodes that get too dense. --------------- Hannu
Tom Lane dijo: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom, should we be updating that flag after we CLUSTER instead of > > requiring an ANALYZE after the CLUSTER? > > Could do that I suppose, but I'm not super-excited about it. ANALYZE is > quite cheap these days (especially in comparison to CLUSTER ;-)). I'd > settle for a note in the CLUSTER docs that recommends a subsequent > ANALYZE --- this seems no different from recommending ANALYZE after bulk > data load or other major update of a table. What if I [try to] extend the grammar to support an additional ANALYZE in CLUSTER, so that it analyzes the table automatically? Say CLUSTER <index> ON <table> [ANALYZE]; Or maybe just do an analyze of the table automatically after the CLUSTERing. What does everybody think? -- Alvaro Herrera (<alvherre[a]atentus.com>) "Para tener mas hay que desear menos"
Alvaro Herrera <alvherre@atentus.com> writes: > What if I [try to] extend the grammar to support an additional ANALYZE > in CLUSTER, so that it analyzes the table automatically? I don't like this -- it seems like bloat. What's the advantage of CLUSTER foo ON bar ANALYZE; over CLUSTER foo ON bar; ANALYZE; > Or maybe just do an analyze of the table automatically after the > CLUSTERing. Hmmm... I don't really see the problem with adding a note in the docs suggesting that users following a CLUSTER with an ANALYZE (of course, that assumes that the CLUSTER will significantly change the ordering of the data in the table, which isn't always the case -- which is another reason why make this automatic seems unwarranted, IMHO). It seems like you're looking for a solution to a non-existent problem. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Neil Conway dijo: > Alvaro Herrera <alvherre@atentus.com> writes: > > What if I [try to] extend the grammar to support an additional ANALYZE > > in CLUSTER, so that it analyzes the table automatically? > > I don't like this -- it seems like bloat. Maybe you are right. > > Or maybe just do an analyze of the table automatically after the > > CLUSTERing. > > Hmmm... I don't really see the problem with adding a note in the docs > suggesting that users following a CLUSTER with an ANALYZE (...). ANALYZE is an inexpensive operation (compared to CLUSTER, anyway), so it can't hurt to have it done automatically. -- Alvaro Herrera (<alvherre[a]atentus.com>) "Linux transformó mi computadora, de una `máquina para hacer cosas', en un aparato realmente entretenido, sobre el cual cada día aprendo algo nuevo" (Jaime Salinas)
> > > Or maybe just do an analyze of the table automatically after the > > > CLUSTERing. > > > > Hmmm... I don't really see the problem with adding a note in the docs > > suggesting that users following a CLUSTER with an ANALYZE (...). > > ANALYZE is an inexpensive operation (compared to CLUSTER, anyway), so it > can't hurt to have it done automatically. Well we have previously had discussions on the topic of adding analyze to the end of dumps, etc. and the result has always been in favour of keeping the command set orthogonal and not doing an automatic analyze... Chris
Christopher Kings-Lynne dijo: > > > > Or maybe just do an analyze of the table automatically after the > > > > CLUSTERing. > > Well we have previously had discussions on the topic of adding analyze to > the end of dumps, etc. and the result has always been in favour of keeping > the command set orthogonal and not doing an automatic analyze... Oh. Sorry for the noise. I'm trying to look at other things in the TODO so I stop pestering about CLUSTER. -- Alvaro Herrera (<alvherre[a]atentus.com>) "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis)
> > Well we have previously had discussions on the topic of adding > analyze to > > the end of dumps, etc. and the result has always been in favour > of keeping > > the command set orthogonal and not doing an automatic analyze... > > Oh. Sorry for the noise. > > I'm trying to look at other things in the TODO so I stop pestering about > CLUSTER. All I can say is - thanks for fixing CLUSTER. As soon as we upgrade to 7.3 I'm going on a CLUSTERing spree :) Chris
If you're looking for something very useful to work on, see if Gavin Sherry(?) can post his old CREATE OR REPLACE VIEW code. I'm pretty sure he (or someone) said that he had an old patch, that needed to be synced with HEAD... This functionality is pretty essential for 7.3... Chris > -----Original Message----- > From: Alvaro Herrera [mailto:alvherre@atentus.com] > Sent: Friday, 9 August 2002 10:21 AM > To: Christopher Kings-Lynne > Cc: Neil Conway; Tom Lane; Bruce Momjian; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] CLUSTER and indisclustered > > > Christopher Kings-Lynne dijo: > > > > > > Or maybe just do an analyze of the table automatically after the > > > > > CLUSTERing. > > > > Well we have previously had discussions on the topic of adding > analyze to > > the end of dumps, etc. and the result has always been in favour > of keeping > > the command set orthogonal and not doing an automatic analyze... > > Oh. Sorry for the noise. > > I'm trying to look at other things in the TODO so I stop pestering about > CLUSTER. > > -- > Alvaro Herrera (<alvherre[a]atentus.com>) > "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, > sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis) >
Neil Conway <nconway@klamath.dyndns.org> writes: > Alvaro Herrera <alvherre@atentus.com> writes: >> What if I [try to] extend the grammar to support an additional ANALYZE >> in CLUSTER, so that it analyzes the table automatically? > I don't like this -- it seems like bloat. My reaction exactly. regards, tom lane
> -----Original Message----- > From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] > Sent: 09 August 2002 03:57 > To: Alvaro Herrera > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] CLUSTER and indisclustered > > > If you're looking for something very useful to work on, see if Gavin > Sherry(?) can post his old CREATE OR REPLACE VIEW code. I'm > pretty sure he (or someone) said that he had an old patch, > that needed to be synced with HEAD... This functionality is > pretty essential for 7.3... > I'll second that... Dave.
I wanted to comment on this bitmapped index discussion because I am hearing a lot about star joins, data warehousing, and bitmapped indexes recently. It seems we have several uses for bitmapped indexes: Do index lookups in sequential heap orderAllow joining of bitmapped indexes to construct arbitrary indexes There is a web page about "star joins" used a lot in data warehousing, where you don't know what queries are going to be required and what indexes to create: http://www.dbdomain.com/a100397.htm They show some sample queries, which is good. Here is some interesting text: Star Transformation If there are bitmap indexes on SALES_REP_ID, PRODUCT_ID, andDEPARTMENT_ID in the SALES table, then Oracle can resolve thequeryusing merges of the bitmap indexes.Because Oracle can efficiently merge multiple bitmap indexes, you can create asingle bitmap index on each of the foreign-key columns in thefact table rather than on every possible combination of columns.Thislets you support all possible combinations of dimensions withoutcreating an unreasonable number of indexes. Added to TODO:* Use bitmaps to fetch heap pages in sequential order [performance] * Use bitmaps to combine existing indexes[performance] and I will add some of these emails to TODO.detail/performance. --------------------------------------------------------------------------- Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > But after doing some benchmarking of various sorts of random reads > > and writes, it occurred to me that there might be optimizations > > that could help a lot with this sort of thing. What if, when we've > > got an index block with a bunch of entries, instead of doing the > > reads in the order of the entries, we do them in the order of the > > blocks the entries point to? > > I thought to myself "didn't I just post something about that?" > and then realized it was on a different mailing list. Here ya go > (and no, this is not the first time around on this list either...) > > > I am currently thinking that bitmap indexes per se are not all that > interesting. What does interest me is bitmapped index lookup, which > came back into mind after hearing Ann Harrison describe how FireBird/ > InterBase does it. > > The idea is that you don't scan the index and base table concurrently > as we presently do it. Instead, you scan the index and make a list > of the TIDs of the table tuples you need to visit. This list can > be conveniently represented as a sparse bitmap. After you've finished > looking at the index, you visit all the required table tuples *in > physical order* using the bitmap. This eliminates multiple fetches > of the same heap page, and can possibly let you get some win from > sequential access. > > Once you have built this mechanism, you can then move on to using > multiple indexes in interesting ways: you can do several indexscans > in one query and then AND or OR their bitmaps before doing the heap > scan. This would allow, for example, "WHERE a = foo and b = bar" > to be handled by ANDing results from separate indexes on the a and b > columns, rather than having to choose only one index to use as we do > now. > > Some thoughts about implementation: FireBird's implementation seems > to depend on an assumption about a fixed number of tuple pointers > per page. We don't have that, but we could probably get away with > just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page. > Also, the main downside of this approach is that the bitmap could > get large --- but you could have some logic that causes you to fall > back to plain sequential scan if you get too many index hits. (It's > interesting to think of this as lossy compression of the bitmap... > which leads to the idea of only being fuzzy in limited areas of the > bitmap, rather than losing all the information you have.) > > A possibly nasty issue is that lazy VACUUM has some assumptions in it > about indexscans holding pins on index pages --- that's what prevents > it from removing heap tuples that a concurrent indexscan is just about > to visit. It might be that there is no problem: even if lazy VACUUM > removes a heap tuple and someone else then installs a new tuple in that > same TID slot, you should be okay because the new tuple is too new to > pass your visibility test. But I'm not convinced this is safe. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2002-08-13 at 09:25, Bruce Momjian wrote: > > There is a web page about "star joins" used a lot in data warehousing, > where you don't know what queries are going to be required and what > indexes to create: > > http://www.dbdomain.com/a100397.htm > > They show some sample queries, which is good. Here is some > interesting text: > > Star Transformation > > If there are bitmap indexes on SALES_REP_ID, PRODUCT_ID, and > DEPARTMENT_ID in the SALES table, then Oracle can resolve the query > using merges of the bitmap indexes. > > Because Oracle can efficiently merge multiple bitmap indexes, you can > create a single bitmap index on each of the foreign-key columns in the > fact table rather than on every possible combination of columns. Another way to achive the similar result would be using segmented hash indexes, where each column maps directly to some part of hash value. > This > lets you support all possible combinations of dimensions without > creating an unreasonable number of indexes. ----------- Hannu