Обсуждение: Best way to scan on-disk bitmaps
Greetings. I have questions on how to implement on-disk bitmap scan. I've been working on on-disk bitmaps as an ordinary Index Access Method, but now it's clear to me, that it'll lose all it's strengths this way. One on-disk bitmap has exactly one list of indexed table's TIDs and potentially unlimited number of bitmaps (number of index attributes multiplied by attribute's cardinality, to be precise). So, for better performance, one should first retrieve all the needed bitmaps from the index, then join all bitmaps according to the query clauses and, as the last phase, retrieve TIDs from index, that matches final bitmap. According to the docs "the index access method is responsible for regurgitating the TIDs ...", but for on-disk bitmaps index scan is devided into 3 phases. So, to perform the scan in phases, to my mind, executor should be involved. (I'd like to mention again, that this is the first time I got so deep inside postgres code). I wanted to use Tom's nodeBitmap* stuff, but it's problematic a bit. Bitmaps in nodeBitmap* are built upon list of TIDs retrieved during relation scan. For on-disk bitmap indexes, there's no need for that, as all bitmaps are already inside the index. The question is: Is it possible to extend nodeBitmap functionality in such a way, that Executor can either build bitmap after list of TIDs, obtained from RelationScan, or ask index access method to give bitmaps it contain (and TIDs at given position in the map later)? This will, probably, require more functions in the pg_am catalog. Or should I create a completely new node for on-disk bitmaps? -- Victor Y. Yegorov
"Victor Y. Yegorov" <viy@mits.lv> writes: > I have questions on how to implement on-disk bitmap scan. I think your best plan might be 1. Be sure that all the indexable WHERE conditions are passed to the indexscan as indexquals. This might be, say,WHEREa = 42 and b = 'foo' 2. Within the index AM, form the AND of the relevant bitmaps (here the ones for a = 42 and b = 'foo'). 3. Within the index AM, pick up the TIDs for the remaining one-bits, and pass them back. 4. Let the existing machinery handle the OR-ing problem as well as actual fetching of the heap rows. This can be done without any restructuring of the index AM API. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [12.05.2005 23:09]: > 1. Be sure that all the indexable WHERE conditions are passed to the > indexscan as indexquals. This might be, say, > WHERE a = 42 and b = 'foo' If I have on-disk bitmapON (a, b, c) will the planner pick an index scan forWHERE a = 42 AND b = 'foo' (i.e. only part of the index attributes are involved)? Any modifications needed to achieve this functionality? To my mind, bitmap scan even for 1 attribute of a multi-column index would be a win, though I haven't tested this yet. -- Victor Y. Yegorov
"Victor Y. Yegorov" <viy@mits.lv> writes: > If I have on-disk bitmap > ON (a, b, c) > will the planner pick an index scan for > WHERE a = 42 AND b = 'foo' > (i.e. only part of the index attributes are involved)? Any modifications > needed to achieve this functionality? Hmm. That particular case will work, but the planner believes that only consecutive columns in the index are usable --- that is, if you have quals for a and c but not for b, it will think that the condition for c isn't usable with the index. This is true for btree and gist indexes, so I suppose we'd need to introduce a pg_am column that tells what to do. [ thinks some more ... ] Plan B would be to remove that restriction and teach btree and gist to cope. While a btree couldn't use a nonconsecutive restriction as part of its where-to-scan logic, I don't see any good reason why it couldn't still perform the test before returning the TID, thus possibly saving a trip to the heap. Offhand it seems this should be true of gist as well, but I don't know that code well enough to be sure. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Plan B would be to remove that restriction and teach btree and gist to > cope. While a btree couldn't use a nonconsecutive restriction as part > of its where-to-scan logic, I don't see any good reason why it couldn't > still perform the test before returning the TID, thus possibly saving a > trip to the heap. Offhand it seems this should be true of gist as well, > but I don't know that code well enough to be sure. Not long ago there was some discussion about how gist indexes don't really handle multicolumn indexes usefully currently. They use only the first column to determine page splits. The discussion wandered and it became clear that it wasn't even clear what a multicolumn gist index should mean. I suggested treating each column as independent axes. Independently ask each column's datatype for the "distance" value and then calculate the inner product as a kind of geometric n-dimensional distance. There was some resistance since this limits gist indexes to always basing page splits on a single "distance" based algorithm. (Though all current gist index methods in the postgres source tree do work this way, mostly with copy-pasted code in fact.) In this model the columns listed in the gist index are unordered. Any subset of columns can be used to perform an index lookup. Making it more like the bitmap index behaviour you're looking at than the btree index behaviour. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Plan B would be to remove that restriction and teach btree and gist to >> cope. While a btree couldn't use a nonconsecutive restriction as part >> of its where-to-scan logic, I don't see any good reason why it couldn't >> still perform the test before returning the TID, thus possibly saving a >> trip to the heap. > [ snip ] > In this model the columns listed in the gist index are unordered. Any subset > of columns can be used to perform an index lookup. Making it more like the > bitmap index behaviour you're looking at than the btree index behaviour. I thought some more about this since sending my earlier message. As far as I can recall at the moment, there really isn't anything fundamental that depends on the consecutive-columns rule. The one place where the rubber meets the road is in the index cost estimation functions: if we were to relax that rule, then btcostestimate would have to be taught to include only the consecutive columns when estimating how much of a btree index is going to be touched. And more than that: if you've studied the btree code at all, you realize that that's only an incomplete heuristic anyway. For instance, if the leading key is a > xxx, second keys like b > yyy and b < yyy act completely differently in terms of indexscan cost, but btcostestimate doesn't presently know that. I wonder if we shouldn't migrate the amcostestimate functions into the individual index AMs (which would mean adding a column to pg_am, but so what). btcostestimate could be much less phony about this if it had access to the same infrastructure that _bt_first uses to examine the index clauses. regards, tom lane
Tom Lane wrote: > "Victor Y. Yegorov" <viy@mits.lv> writes: > > If I have on-disk bitmap > > ON (a, b, c) > > will the planner pick an index scan for > > WHERE a = 42 AND b = 'foo' > > (i.e. only part of the index attributes are involved)? Any modifications > > needed to achieve this functionality? > > Hmm. That particular case will work, but the planner believes that only > consecutive columns in the index are usable --- that is, if you have > quals for a and c but not for b, it will think that the condition for c > isn't usable with the index. This is true for btree and gist indexes, > so I suppose we'd need to introduce a pg_am column that tells what to > do. We do have a TODO for this: * Use index to restrict rows returned by multi-key index when used with non-consecutive keys to reduce heap accesses For an index on col1,col2,col3, and a WHERE clause of col1 = 5 and col3 = 9, spin though the index checking for col1 andcol3 matches, rather than just col1; also called skip-scanning. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Hmm. That particular case will work, but the planner believes that only > > consecutive columns in the index are usable --- that is, if you have > > quals for a and c but not for b, it will think that the condition for c > > isn't usable with the index. This is true for btree and gist indexes, > > so I suppose we'd need to introduce a pg_am column that tells what to > > do. > > We do have a TODO for this: > > * Use index to restrict rows returned by multi-key index when used with > non-consecutive keys to reduce heap accesses > > For an index on col1,col2,col3, and a WHERE clause of col1 = 5 and > col3 = 9, spin though the index checking for col1 and col3 matches, > rather than just col1; also called skip-scanning. That TODO is something else. Though it is related in that it is another example of why the existing code is too simplistic and will eventually need to be enhanced. Not only would bitmap indexes and (possibly) gist indexes, but even btree indexes would need to do so if this TODO were implemented. -- greg
On Thu, 12 May 2005 17:40:06 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >the planner believes that only >consecutive columns in the index are usable --- that is, if you have >quals for a and c but not for b, it will think that the condition for c >isn't usable with the index. This is true for btree [...] It's not difficult to setup a test case where an index is used, but only with a=42 as an index condition, and c='foo' is applied as a filter condition. Now add a redundant condition on b... AND b BETWEEN minb AND maxb ... and watch how c='foo' moves into the index condition. I did this test some time ago and I believe that adding the condition on b did not change the estimated cost, only the actual execution time. ServusManfred
Greg Stark <gsstark@mit.edu> writes: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >>> Hmm. That particular case will work, but the planner believes that only >>> consecutive columns in the index are usable --- that is, if you have >>> quals for a and c but not for b, it will think that the condition for c >>> isn't usable with the index. >> >> We do have a TODO for this: >> >> * Use index to restrict rows returned by multi-key index when used with >> non-consecutive keys to reduce heap accesses > That TODO is something else. No, I think Bruce is right --- it's essentially the same thing. It certainly would be a good idea to change btrees to work like that, if we are going to modify the planner to relax the restriction for other index types. I think it would be easy to change the planner and btree to handle this (where "easy" means "I remember where all the skeletons are buried"). But I don't know the gist code hardly at all. Can anyone offer an informed opinion on whether gist can handle this now, and if not what it would take to handle it? (hash and rtree are not at issue since they don't support multi-key indexes.) regards, tom lane
On Sun, 15 May 2005, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: >> Bruce Momjian <pgman@candle.pha.pa.us> writes: >>>> Hmm. That particular case will work, but the planner believes that only >>>> consecutive columns in the index are usable --- that is, if you have >>>> quals for a and c but not for b, it will think that the condition for c >>>> isn't usable with the index. >>> >>> We do have a TODO for this: >>> >>> * Use index to restrict rows returned by multi-key index when used with >>> non-consecutive keys to reduce heap accesses > >> That TODO is something else. > > No, I think Bruce is right --- it's essentially the same thing. It > certainly would be a good idea to change btrees to work like that, > if we are going to modify the planner to relax the restriction for > other index types. > > I think it would be easy to change the planner and btree to handle > this (where "easy" means "I remember where all the skeletons are > buried"). But I don't know the gist code hardly at all. Can anyone > offer an informed opinion on whether gist can handle this now, and > if not what it would take to handle it? I think that handling this in GiST is depends solely on how users consistent function designed to handle NULLs in keys. Other words, it should works as soon as users consistent function "know" what to do with NULLs in internal keys. Teodor will comment multi-key GiST tomorrow. We used Paul Aoki paper "Generalizing ''Search'' in Generalized Search Trees", (available from http://www.sai.msu.su/~megera/postgres/gist/papers/csd-97-950.pdf ) for implementation of multi-key GiST index support. It's true, that first key is used for splitting, but elements with duplicated first key could be shuffled to get better clustering on second key. > > (hash and rtree are not at issue since they don't support multi-key > indexes.) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Tom Lane <tgl@sss.pgh.pa.us> writes: > I think it would be easy to change the planner and btree to handle > this (where "easy" means "I remember where all the skeletons are > buried"). But I don't know the gist code hardly at all. Can anyone > offer an informed opinion on whether gist can handle this now, and > if not what it would take to handle it? Currently gist indexes only use the first column for page splits, making multi-key gist indexes basically useless. The problem is it's hard to imagine an API for a pickSplit function that could handle multi-key indexes with disparate data types and operator classes. I had an idea of a new way to deal with gist indexes that simplified the API and side-stepped the whole issue but you raised concerns that it might be too limiting. Unfortunately the mailing list archive seems to have missed this discussion. I've attached three messages from the discussion at the time. Oleg & Teodor, If I understand the code correctly, GiST will only pass the first attribute of each index tuple to the user-defined PickSplit method when it wants to split a node. (see circa line 1269 of gist.c) Is this a wise design decision? Granted, in many situations the first attribute in the index is sufficient to make a reasonable decision about how to divide the node into two halves, but I don't think that is universally true. For example, consider a two column index whose first attribute has a small number of distinct values. It could well be that all the first attribute values in a node-to-be-split would be the same. Only passing the first attribute to PickSplit would result in an essentially random distribution of tuples among the split nodes, rather than allowing the GiST extension to make use of the second attribution to partition the nodes. That's an extreme example, but it is easy to construct more realistic scenarios (basically, any situation in which the cardinality of the first index attribute is low -- a reasonably common occurrence with a multi-column index, I believe). I'm not sure whether this would be a problem in practice. Speculation: repeated invocations of PickSplit are one of the main factors in deriving the ultimate shape of the GiST tree. Poor distribution of keys resulting from PickSplit would eventually result in unnecessarily loose bounding predicates in internal nodes, which would hurt performance. Comments welcome, Neil ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings Greg Stark <gsstark@MIT.EDU> writes: > I'm not sure that GiST indexes behave the same way as btree indexes for the > multi-column case. > > In a btree index the second column is entirely subordinate to the first > column. In a GiST index the data is multi-dimensional, and all dimensions are > equally important. In fact on further consideration I do have a proposal. If you look at the GiST implementations for various index types you'll see that many (all?) take the same approach for PickSplit. In fact they pretty much all have the same code copy/pasted to handle it. The approach they take is to have a function which calculates an abstract "distance" between any two entries. There's an algorithm that they use to pick the split based on this distance function. If you abandoned "PickSplit" and instead exposed this distance function as the external API then the behaviour for multi-column indexes is clear. You calculate the distance along all the axes and calculate the diagonal distance. I think abandoning PickSplit for the distance function might also mean you don't need a separate function for Penalty either, but I'm not sure on that. -- greg Greg Stark <gsstark@MIT.EDU> writes: > The approach they take is to have a function which calculates an > abstract "distance" between any two entries. There's an algorithm that > they use to pick the split based on this distance function. > If you abandoned "PickSplit" and instead exposed this distance > function as the external API then the behaviour for multi-column > indexes is clear. You calculate the distance along all the axes and > calculate the diagonal distance. Hmm ... the problem with that is the assumption that different opclasses will compute similarly-scaled distances. If opclass A generates distances in the range (0,1e6) while B generates in the range (0,1), combining them with Euclidean distance won't work well at all. OTOH you can't blindly normalize, because in some cases maybe the data is such that a massive difference in distances is truly appropriate. I'm also a bit leery of the assumption that every GiST application can reduce its PickSplit logic to Euclidean distances. regards, tom lane > (hash and rtree are not at issue since they don't support multi-key > indexes.) It seems like it would be trivial to make hash indexes do something useful. Maybe it's not clear which useful something to do though. a) hash each column and xor the results. This requires all columns be present for an index lookup. Again this requires making the planner logic more flexible and specific to each index method. b) Partition take only a few bits from each column's hash. This would actually let the hash index use any subset of columns though it seems like it would be prohibitively expensive at first blush. And it requires more than just planner hacking to take advantage of. -- greg
About page splitting algorithm in GiST in multikey case. For the beginning, page is splitted by calling pickSplit method of key of first column (pickSplit method is defined for opclass and it is a user function), then it try to find equal values of first column in left and right pages ( gist.c lines 1264-1901 ). If it is, then GiST core will try to resort tuples with first equal keys between left and right pages using penalty method for second and higher column's key. If it's not, it leave pages untouched. But unions for parent page of second and higher column's keys will be formed. So, if index is defined as 'using gist (a,b,c)' then, in principle, GiST index can speed up queries like 'a=V1 and c=V2'. But it will not usable for queries ( b=V3 and c=V2 ). By the way, instead of '=' operation may be used other operations. Number of supported operations by GiST is indefinite unlike, for example, btree which supported only five: <, <=, =, =>, >. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
If people have GIST TODOs, please post them. --------------------------------------------------------------------------- Teodor Sigaev wrote: > About page splitting algorithm in GiST in multikey case. For the beginning, page > is splitted by calling pickSplit method of key of first column (pickSplit method > is defined for opclass and it is a user function), then it try to find equal > values of first column in left and right pages ( gist.c lines 1264-1901 ). If > it is, then GiST core will try to resort tuples with first equal keys between > left and right pages using penalty method for second and higher column's key. If > it's not, it leave pages untouched. But unions for parent page of second and > higher column's keys will be formed. > > So, if index is defined as 'using gist (a,b,c)' then, in principle, GiST index > can speed up queries like 'a=V1 and c=V2'. But it will not usable for queries > ( b=V3 and c=V2 ). By the way, instead of '=' operation may be used other > operations. Number of supported operations by GiST is indefinite unlike, for > example, btree which supported only five: <, <=, =, =>, >. > > > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: http://www.sigaev.ru/ > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > If people have GIST TODOs, please post them. Concurrency :)
On Tue, May 17, 2005 at 09:39:20AM +0800, Christopher Kings-Lynne wrote: > > > Bruce Momjian wrote: > >If people have GIST TODOs, please post them. > > Concurrency :) And WAL support. -- Alvaro Herrera (<alvherre[a]surnet.cl>) "No necesitamos banderasNo reconocemos fronteras" (Jorge González)
Alvaro Herrera wrote: > On Tue, May 17, 2005 at 09:39:20AM +0800, Christopher Kings-Lynne wrote: > > > > > > Bruce Momjian wrote: > > >If people have GIST TODOs, please post them. > > > > Concurrency :) > > And WAL support. Already there: * Add WAL index reliability improvement to non-btree indexes and this too: * Add concurrency to GIST -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Teodor Sigaev <teodor@sigaev.ru> wrote: > ... So, if index is defined as 'using gist (a,b,c)' then, in > principle, GiST index can speed up queries like 'a=V1 and c=V2'. But > it will not usable for queries ( b=V3 and c=V2 ). By the way, instead > of '=' operation may be used other operations. Number of supported > operations by GiST is indefinite unlike, for example, btree which > supported only five: <, <=, =, =>, >. I have committed changes to the planner to arrange that a GiST indexscan must supply at least one restriction clause for the first index column, and can supply restriction clauses for any, all, or none of the remaining columns; the old left-to-right heuristic is gone. As far as I can tell, this doesn't require any changes to the GiST code, but please take another look if you aren't too sure about it. regards, tom lane
On Mon, 13 Jun 2005, Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> wrote: >> ... So, if index is defined as 'using gist (a,b,c)' then, in >> principle, GiST index can speed up queries like 'a=V1 and c=V2'. But >> it will not usable for queries ( b=V3 and c=V2 ). By the way, instead >> of '=' operation may be used other operations. Number of supported >> operations by GiST is indefinite unlike, for example, btree which >> supported only five: <, <=, =, =>, >. > > I have committed changes to the planner to arrange that a GiST indexscan > must supply at least one restriction clause for the first index column, > and can supply restriction clauses for any, all, or none of the > remaining columns; the old left-to-right heuristic is gone. > > As far as I can tell, this doesn't require any changes to the GiST code, > but please take another look if you aren't too sure about it. I did quick test and found no problem with gist(a,b,c) and index does used for (a,*,c) case > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83