Обсуждение: bitmap scans, btree scans, and tid order
About this time last year I was holding forth on the value of visiting
the heap in TID order, even when index scan tuples are randomly ordered.
Today I decided to start working on the problem stated in this TODO
item:
       Fetch heap pages matching index entries in sequential order               Rather than randomly accessing heap
pagesbased on index       entries, mark heap pages needing access in a bitmap and do the       lookups in sequential
order.Another method would be to sort       heap ctids matching the index before accessing the heap rows.       
 
I see that Tom has already done the infrastructure work by adding
getmulti, but getmulti isn't used by nodeIndexscan.c, only
nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
in-memory bitmaps in 8.1, or will some scans still be executed the usual
way?  If the former, I'd be wasting time, but in the latter case it
would be worth it.
-jwb
			
		"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> I see that Tom has already done the infrastructure work by adding
> getmulti, but getmulti isn't used by nodeIndexscan.c, only
> nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
> in-memory bitmaps in 8.1, or will some scans still be executed the usual
> way?
We aren't going to remove the existing indexscan behavior, because
bitmap scans lose the ordering of the underlying index.  There are many
situations where that ordering is important.  (See for instance the
recent changes to make MAX/MIN use that behavior.)
        regards, tom lane
			
		Tom Lane wrote: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > >>I see that Tom has already done the infrastructure work by adding >>getmulti, but getmulti isn't used by nodeIndexscan.c, only >>nodeBitmapIndexscan.c. Will btree index scans be executed by creating >>in-memory bitmaps in 8.1, or will some scans still be executed the usual >>way? > > > We aren't going to remove the existing indexscan behavior, because > bitmap scans lose the ordering of the underlying index. There are many > situations where that ordering is important. (See for instance the > recent changes to make MAX/MIN use that behavior.) Would you take a patch that retained the optimized executions of plans returning 1 tuple and also fixed the random heap problem? -jwb
Jeffrey Baker wrote: > Would you take a patch that retained the optimized executions of plans > returning 1 tuple and also fixed the random heap problem? Can you elaborate on what you're proposing? Obviously sorted b+-tree output is important for a lot more than just min()/max(). I don't see an obvious way to produce sorted output from a bitmap tree index scan without requiring an additional sort step (which would be rather pointless -- the whole point of the optimization is to avoid an additional sort). -Neil
Neil Conway wrote: > Jeffrey Baker wrote: > >> Would you take a patch that retained the optimized executions of plans >> returning 1 tuple and also fixed the random heap problem? > > > Can you elaborate on what you're proposing? Obviously sorted b+-tree > output is important for a lot more than just min()/max(). I don't see an > obvious way to produce sorted output from a bitmap tree index scan > without requiring an additional sort step (which would be rather > pointless -- the whole point of the optimization is to avoid an > additional sort). I understand the importance of returning tuples in index order for many plans (although I probably haven't thought of all the cases. min/max is the most obvious; order by + limit is another). The only problem I'm trying to solve is when an indexscan returns a large result, causing the heap to be visited in index order, which is to say random order, from the disk's perspective. When I investigated this last year, sorting the intermediate result of the index scan in disk order was good for a reduction by two-thirds in actual execution time, and sorting the scan result in chunks of 1000 tuples was enough to reduce the time by half. I'm considering one of the following courses of action: Change nodeIndexscan.c to call index_getmulti, and to handle multiple tuples returned. That code would sort the tuple array and store the tuples in the result in disk order. -or- Change the planner/executor to use the bitmap scan in all cases where index order is unimportant. From my reading of the current code, the bitmap scan is only used in case of a join. -jwb
On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote: > Jeffrey Baker <jwbaker@acm.org> writes: > > Change the planner/executor to use the bitmap scan in all cases where > > index order is unimportant. From my reading of the current code, the > > bitmap scan is only used in case of a join. > > This is a fallacy, and I think your concern is largely mistaken. Have > you experimented with the cases you are worried about? Perhaps I have not stated the problem clearly. Believe me, I have experimented extensively with this problem. So here's the statement of the problem: during an index scan, the heap is visited in index order, which can be and frequently is random order on the disk. An index scan that currently takes 15 minutes on my database takes only 5 when the tuples are fetched in strict disk order, and takes 8 minutes if the tuples are fetched in disk order in groups of 1000. The problem exists when the scan is expected to return a lot of tuples, but the planner chooses an index scan anyway. In many cases, sequential scan would be faster than index scan + random heap visits. > It's entirely possible that the current cost model needs adjustment to > make the planner pick the bitmap scan in more cases. However, it is > easy to demonstrate (either by thought-experiment or actual trial) that > a bitmap scan isn't necessarily a win. I agree. There's certainly a threshold below which the bitmap would not make sense. It's also possible that changing the btree scan to work in groups of tuples instead of single tuples would make more sense, which is why I ventured two different solution to the problem. > The overhead of maintaining the > bitmap is far from zero, and you don't get anything unless the resulting > pattern of heap page fetches is significantly cleaner than before. Bitmaps scans naturally come back in TID order. I realize this isn't 1:1 correspondence with disk order, but it's a good start. I also like the way the index scan and heap scan are decoupled in the bitmap code. It leaves room for more serious optimization of disk access, which is relatively difficult in the synchronous, one-at-a-time btree code. > Therefore, a patch that eliminates cost-estimation in favor of trying to > force one choice or the other is definitely not likely to be received > favorably ... That's not the idea. -jwb
On P, 2005-05-15 at 23:58 -0700, Jeffrey Baker wrote: > I'm considering one of the following courses of action: > > Change nodeIndexscan.c to call index_getmulti, and to handle multiple > tuples returned. That code would sort the tuple array and store the > tuples in the result in disk order. > > -or- > > Change the planner/executor to use the bitmap scan in all cases where > index order is unimportant. From my reading of the current code, the > bitmap scan is only used in case of a join. I think the 2nd option is probably the way to go. Probably not _all_ cases - it's probably cheper to not build a bitmap for cases where the index returns only a few (or just one) rows. OTOH, some scheme, where you fill sort_mem-sized memory structure with tuples, which are fetched in heap order but stored in that structure in index order could also be an interesting optimisastion for sort-order preserving btree-index scans. this could also be used for bigger sort as first step by saving each chunk to disk and then merging them back into result. in pseudocode one iteretion of this could go like this TIDARRAY = ARRAY[1000] I = 0 FOR TID in FETC_FROM_BTREE(0, 1000): ARRAY [I] := (TID, I) SORT_ARRAY_ON_TID() # this must be much faster than sorting # whole tuples for this methodto be a win # over bitmap_scan + sort. # using some self-orderingstructure like # btree/rbtree for TIDARRAY is also an option OUTARRAY = ARRAY[1000] FOR (TID, I) IN TIDARRAY: OUTARRAY[I] = FETCH_FROM_HEAP(TID) RETURN OUTARRAY This can probably not use bitmap scan code, but has the nice property of doing the disk accesses in file order but still returning the result in index order. -- Hannu Krosing <hannu@skype.net>
"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote:
>> This is a fallacy, and I think your concern is largely mistaken.  Have
>> you experimented with the cases you are worried about?
> Perhaps I have not stated the problem clearly.  Believe me, I have
> experimented extensively with this problem.
Sorry, perhaps I wasn't clear: have you experimented *with CVS tip*?
The current code is certainly capable of choosing either seqscan,
bitmap scan, or traditional index scan depending on the given query.
For example,
regression=# explain analyze select * from tenk1 where unique1 between 100 and 1000;
                  QUERY PLAN
 
--------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on tenk1  (cost=9.58..381.53 rows=930 width=244) (actual time=6.185..18.034 rows=901 loops=1)  Recheck Cond:
((unique1>= 100) AND (unique1 <= 1000))  ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..9.58 rows=930 width=0)
(actualtime=4.522..4.522 rows=901 loops=1)        Index Cond: ((unique1 >= 100) AND (unique1 <= 1000))Total runtime:
23.784ms
 
(5 rows)
regression=# explain analyze select * from tenk1 where unique2 between 100 and 1000;
                   QUERY PLAN
 
-----------------------------------------------------------------------------------------------------------------------------Index
Scanusing tenk1_unique2 on tenk1  (cost=0.00..45.88 rows=805 width=244) (actual time=0.154..14.856 rows=901 loops=1)
IndexCond: ((unique2 >= 100) AND (unique2 <= 1000))Total runtime: 20.331 ms
 
(3 rows)
(The reason these apparently-identical queries get different plans is
that the unique2 column is physically ordered, so the plain indexscan
gets a very high correlation discount.)  There are enable_foo switches
to let you force selection of any one of the three ways for testing
purposes.
> It's also possible that changing the btree scan to work in
> groups of tuples instead of single tuples would make more sense, which
> is why I ventured two different solution to the problem.
My feeling is that that would add a lot of complexity for dubious win.
The reason it's dubious is that it would penalize some cases, for
instance LIMIT-type queries where you aren't going to fetch many tuples
anyway.  I think that at least for the time being (8.1 time frame) we
should leave traditional indexscans alone and concentrate on being sure
we are getting the most we can out of the new bitmap scan code.  Only
after that dust has settled will we have hard facts with which to argue
about whether compromise behaviors might be wins.
I think the work that's most needed at the moment is to test the
bitmap-scan cost model to see if it has much to do with reality...
        regards, tom lane
			
		On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > > It's also possible that changing the btree scan to work in > > groups of tuples instead of single tuples would make more sense, which > > is why I ventured two different solution to the problem. > > My feeling is that that would add a lot of complexity for dubious win. > The reason it's dubious is that it would penalize some cases, for > instance LIMIT-type queries where you aren't going to fetch many tuples > anyway. I think that at least for the time being (8.1 time frame) we > should leave traditional indexscans alone and concentrate on being sure > we are getting the most we can out of the new bitmap scan code. Only > after that dust has settled will we have hard facts with which to argue > about whether compromise behaviors might be wins. I agree. I'll look at how my workload behaves with CVS code. I wasn't proposing this for 8.1 inclusion, and the TODO isn't marked for 8.1. > I think the work that's most needed at the moment is to test the > bitmap-scan cost model to see if it has much to do with reality... Alright. -jwb
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > > I see that Tom has already done the infrastructure work by adding > > getmulti, but getmulti isn't used by nodeIndexscan.c, only > > nodeBitmapIndexscan.c. Will btree index scans be executed by creating > > in-memory bitmaps in 8.1, or will some scans still be executed the usual > > way? > > We aren't going to remove the existing indexscan behavior, because > bitmap scans lose the ordering of the underlying index. There are many > situations where that ordering is important. (See for instance the > recent changes to make MAX/MIN use that behavior.) Hm. There are other circumstances where the ordering doesn't matter. When there's another unrelated ORDER BY clause or merge join wrapped around the index scan for example. This suggests one new 8.1 optimization strategy may be to add strategic no-op OR clauses to cause 8.1 to use a bitmapOr node. For example something like this where "flag" isn't very selective (say 25%) might run more slowly than a sequential scan because of the random access pattern. SELECT id,name,flag FROM tabWHERE flagORDER BY name But adding a no-op bitmapOr node like: SELECT id,name,flag FROM tabWHERE flag OR indexed_never_true_flagORDER BY name Might run faster, perhaps even more quickly than the sequential scan because the bitmap avoids the random access pattern but doesn't have to read the whole table. -- greg
On 5/16/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > regression=# explain analyze select * from tenk1 where unique1 between 100 and 1000; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on tenk1 (cost=9.58..381.53 rows=930 width=244) (actual time=6.185..18.034 rows=901 loops=1) > Recheck Cond: ((unique1 >= 100) AND (unique1 <= 1000)) > -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..9.58 rows=930 width=0) (actual time=4.522..4.522 rows=901 loops=1) > Index Cond: ((unique1 >= 100) AND (unique1 <= 1000)) > Total runtime: 23.784 ms > (5 rows) > > regression=# explain analyze select * from tenk1 where unique2 between 100 and 1000; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Index Scan using tenk1_unique2 on tenk1 (cost=0.00..45.88 rows=805 width=244) (actual time=0.154..14.856 rows=901 loops=1) > Index Cond: ((unique2 >= 100) AND (unique2 <= 1000)) > Total runtime: 20.331 ms > (3 rows) > Tom (or anyone with some round tuits and CVS-tip savy), if you have a chance at some point would you post the non-bitmap version of the query for tenk2 from above? I'd be very interested to see if the bitmap forced TID order fetch actually does help, or if it's outweighed by the bitmap setup overhead. TIA -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
Jeffrey Baker <jwbaker@acm.org> writes:
> Change the planner/executor to use the bitmap scan in all cases where 
> index order is unimportant.  From my reading of the current code, the 
> bitmap scan is only used in case of a join.
This is a fallacy, and I think your concern is largely mistaken.  Have
you experimented with the cases you are worried about?
It's entirely possible that the current cost model needs adjustment to
make the planner pick the bitmap scan in more cases.  However, it is
easy to demonstrate (either by thought-experiment or actual trial) that
a bitmap scan isn't necessarily a win.  The overhead of maintaining the
bitmap is far from zero, and you don't get anything unless the resulting
pattern of heap page fetches is significantly cleaner than before.
Therefore, a patch that eliminates cost-estimation in favor of trying to
force one choice or the other is definitely not likely to be received
favorably ...
        regards, tom lane
			
		Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)
От
 
		    	"Jeffrey W. Baker"
		    Дата:
		        On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote:
> >> This is a fallacy, and I think your concern is largely mistaken.  Have
> >> you experimented with the cases you are worried about?
> 
> > Perhaps I have not stated the problem clearly.  Believe me, I have
> > experimented extensively with this problem.
> 
> Sorry, perhaps I wasn't clear: have you experimented *with CVS tip*?
> The current code is certainly capable of choosing either seqscan,
> bitmap scan, or traditional index scan depending on the given query.
> For example,
[...]
> I think the work that's most needed at the moment is to test the
> bitmap-scan cost model to see if it has much to do with reality...
The cost model seems to work pretty darn well, as a matter of fact.  
This table is in-order by id, in-order by date, and randomly ordered by
"random".
scratch=# \d test    Table "public.test"Column |  Type   | Modifiers 
--------+---------+-----------id     | integer | date   | date    | random | integer | 
Indexes:   "test_pk" UNIQUE, btree (id)   "test_date_idx" btree (date)   "test_rand_idx" btree (random)
scratch=# select count(1) from test; count   
----------10000000
The size of the tables and indexes is about 1G, or double physical
memory.  I tested by selecting on the random column.  For 100 random
values, I selected the row that matches exactly, then in ranges of 1000,
10000, 100000, and 1000000.  These touch roughly 5, 50, 500, and 5000
tuples, respectively.  The test is repeated for index scan, bitmap scan,
and sequential scan.
Example query:
select count(1)          from test          where random >= 1429076987           and random < 1429076987 + 10000;
                                                         QUERY PLAN
    
 
------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=171.31..171.31 rows=1 width=0) (actual time=0.996..1.000 rows=1 loops=1)  ->  Bitmap Heap Scan on test
(cost=2.26..171.20rows=43 width=0) (actual time=0.145..0.829 rows=42 loops=1)        Recheck Cond: ((random >=
1429076987)AND (random < 1429086987))        ->  Bitmap Index Scan on test_rand_idx  (cost=0.00..2.26 rows=43 width=0)
(actualtime=0.063..0.063 rows=42 loops=1)              Index Cond: ((random >= 1429076987) AND (random <
1429086987))Totalruntime: 1.114 ms
 
100 queries returning |  1  |  5  |  50  |  500  |  5000  |
----------------------+-----+-----+------+-------+--------+
bitmap scan           |  1s |  2s |  13s | 1m41s | 11m16s |
index scan            |  1s |  1s |  12s | 2m6s  | 24m19s |
----------------------+-----+-----+------+-------+--------+
sequential scan       |             28m20s                | 
The planner uses index scan for the first two columns, and chooses
bitmap scan for the rest.  This would seem to be a reasonable decision,
given the results.  The only real problem with the cost model in further
testing is that it doesn't switch to seq scan quickly enough.  If bitmap
scan is disabled, the planner will pick index scan even in cases when
sequential scan is 10x faster:
scratch=# set enable_bitmapscan to off;
SET
scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
                                                           QUERY PLAN
            
 
--------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=170142.03..170142.03 rows=1 width=0) (actual time=177419.182..177419.185 rows=1 loops=1)  ->  Index Scan using
test_rand_idxon test  (cost=0.00..170034.11 rows=43167 width=0) (actual time=0.035..177255.696 rows=46764 loops=1)
 Index Cond: ((random >= 1429076987) AND (random < 1439076987))Total runtime: 177419.302 ms
 
(4 rows)
scratch=# set enable_indexscan to off;
SET
scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
                                                QUERY PLAN                                                      
 
----------------------------------------------------------------------------------------------------------------------Aggregate
(cost=204165.55..204165.55 rows=1 width=0) (actual time=12334.042..12334.045 rows=1 loops=1)  ->  Seq Scan on test
(cost=0.00..204057.62rows=43167 width=0) (actual time=17.436..12174.150 rows=46764 loops=1)        Filter: ((random >=
1429076987)AND (random < 1439076987))Total runtime: 12334.156 ms
 
(4 rows)
Obviously in this case sequential scan was (would have been) a huge win.
Incrementing random_page_cost from 4 (the default) to 5 causes the
planner to make a better decision.
-jwb
			
		Re: Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)
От
 
		    	Manfred Koizar
		    Дата:
		        On Tue, 17 May 2005 22:12:17 -0700, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: >Incrementing random_page_cost from 4 (the default) to 5 causes the >planner to make a better decision. We have such a low default random_page_cost primarily to mask other problems in the optimizer, two of which are . multi-column index correlation . interpolation between min_IO_Cost and max_IO_cost which approximates max_IO_cost too fast. ServusManfred
"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> ... If bitmap
> scan is disabled, the planner will pick index scan even in cases when
> sequential scan is 10x faster:
> scratch=# set enable_bitmapscan to off;
> SET
> scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
>                                                                  QUERY PLAN
                    
 
>
--------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=170142.03..170142.03 rows=1 width=0) (actual time=177419.182..177419.185 rows=1 loops=1)
>    ->  Index Scan using test_rand_idx on test  (cost=0.00..170034.11 rows=43167 width=0) (actual
time=0.035..177255.696rows=46764 loops=1)
 
>          Index Cond: ((random >= 1429076987) AND (random < 1439076987))
>  Total runtime: 177419.302 ms
> (4 rows)
> scratch=# set enable_indexscan to off;
> SET
> scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
>                                                       QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=204165.55..204165.55 rows=1 width=0) (actual time=12334.042..12334.045 rows=1 loops=1)
>    ->  Seq Scan on test  (cost=0.00..204057.62 rows=43167 width=0) (actual time=17.436..12174.150 rows=46764
loops=1)
>          Filter: ((random >= 1429076987) AND (random < 1439076987))
>  Total runtime: 12334.156 ms
> (4 rows)
> Obviously in this case sequential scan was (would have been) a huge win.
> Incrementing random_page_cost from 4 (the default) to 5 causes the
> planner to make a better decision.
But to get the estimated cost ratio to match up with the actual cost
ratio, we'd have to raise random_page_cost to nearly 70, which is a bit
hard to credit.  What was the platform being tested here?
        regards, tom lane
			
		On Wed, 2005-05-18 at 11:27 -0400, Tom Lane wrote: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > > Obviously in this case sequential scan was (would have been) a huge win. > > Incrementing random_page_cost from 4 (the default) to 5 causes the > > planner to make a better decision. > > But to get the estimated cost ratio to match up with the actual cost > ratio, we'd have to raise random_page_cost to nearly 70, which is a bit > hard to credit. What was the platform being tested here? i686 Linux 2.6.8 with a single 7200RPM SATA disk. -jwb