Обсуждение: CLUSTER and indisclustered

Поиск
Список
Период
Сортировка

CLUSTER and indisclustered

От
Gavin Sherry
Дата:
Hi all,

It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.

1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scan

Any reason this isn't possible? Any reason it couldn't dramatically speed
up the performance of the type of query i've mentioned?

Gavin



Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Gavin Sherry <swm@linuxworld.com.au> writes:
> It occured to me on the plane home that now that CLUSTER is fixed we may
> be able to put pg_index.indisclustered to use. If CLUSTER was to set
> indisclustered to true when it clusters a heap according to the given
> index, we could speed up sequantial scans.

AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute.  While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(.  I can't see much value in creating a whole new scan type
that's only usable on a perfectly-clustered table.

The existing approach to making the planner smart about clustered tables
is to compute a physical-vs-logical-order-correlation statistic and use
that to adjust the estimated cost of indexscans.  I believe this is a
more robust approach than considering a table to be "clustered" or "not
clustered", since it can deal with the gradual degradation of clustered
order over time.  However, I will not make any great claims for the
specific equations currently used for this purpose --- they're surely in
need of improvement.  Feel free to take a look and see if you have any
ideas.  The collection of the statistic is in commands/analyze.c and the
use of it is in optimizer/path/costsize.c.
        regards, tom lane


Re: CLUSTER and indisclustered

От
Gavin Sherry
Дата:
On Sat, 3 Aug 2002, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au> writes:
> > It occured to me on the plane home that now that CLUSTER is fixed we may
> > be able to put pg_index.indisclustered to use. If CLUSTER was to set
> > indisclustered to true when it clusters a heap according to the given
> > index, we could speed up sequantial scans.
> 
> AFAICT you're assuming that the table is *exactly* ordered by the
> clustered attribute.  While this is true at the instant CLUSTER
> completes, the exact ordering will be destroyed by the first insert or
> update :-(.  I can't see much value in creating a whole new scan type

Sorry, I meant to say that heap_insert() etc would need to set
indisclustered to false.

I do see some worth in this however. Naturally, in a situation where a
database is being modified very often this is of little value. However,
for applications focussed on analysing large amounts of static data this
could increase performance significantly. Once I get some time I will
attempt to explore this further in `diff -c` format :-).

Gavin



Re: CLUSTER and indisclustered

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> > It occured to me on the plane home that now that CLUSTER is fixed we may
> > be able to put pg_index.indisclustered to use. If CLUSTER was to set
> > indisclustered to true when it clusters a heap according to the given
> > index, we could speed up sequantial scans.
> 
> AFAICT you're assuming that the table is *exactly* ordered by the
> clustered attribute.  While this is true at the instant CLUSTER
> completes, the exact ordering will be destroyed by the first insert or
> update :-(.  I can't see much value in creating a whole new scan type
> that's only usable on a perfectly-clustered table.
> 
> The existing approach to making the planner smart about clustered tables
> is to compute a physical-vs-logical-order-correlation statistic and use
> that to adjust the estimated cost of indexscans.  I believe this is a
> more robust approach than considering a table to be "clustered" or "not
> clustered", since it can deal with the gradual degradation of clustered
> order over time.  However, I will not make any great claims for the
> specific equations currently used for this purpose --- they're surely in
> need of improvement.  Feel free to take a look and see if you have any
> ideas.  The collection of the statistic is in commands/analyze.c and the
> use of it is in optimizer/path/costsize.c.

Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: CLUSTER and indisclustered

От
Bruce Momjian
Дата:
Gavin Sherry wrote:
> Hi all,
> 
> It occured to me on the plane home that now that CLUSTER is fixed we may
> be able to put pg_index.indisclustered to use. If CLUSTER was to set
> indisclustered to true when it clusters a heap according to the given
> index, we could speed up sequantial scans. There are two possible ways.
> 
> 1) Planner determines that a seqscan is appropriate *and* the retrieval is
> qualified by the key(s) of one of the relation's indexes
> 2) Planner determines that the relation is clustered on disk according to
> the index over the key(s) used to qualify the retrieval
> 3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
> 4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
> 5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
> different from SeqNext) called SeqClusterNext
> 6) SeqClusterNext() has all the heapgettup() logic with two
> exceptions: a) we find the first tuple more intelligently (instead of
> scanning from the first page) b) if we have found tuple(s) matching the
> ScanKey when we encounter an non-matching tuple (via
> HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
> scan

Gavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: CLUSTER and indisclustered

От
Gavin Sherry
Дата:
On Sat, 3 Aug 2002, Bruce Momjian wrote:

> Gavin Sherry wrote:
> > Hi all,
> > 
> > It occured to me on the plane home that now that CLUSTER is fixed we may
> > be able to put pg_index.indisclustered to use. If CLUSTER was to set
> > indisclustered to true when it clusters a heap according to the given
> > index, we could speed up sequantial scans. There are two possible ways.
> > 
> > 1) Planner determines that a seqscan is appropriate *and* the retrieval is
> > qualified by the key(s) of one of the relation's indexes
> > 2) Planner determines that the relation is clustered on disk according to
> > the index over the key(s) used to qualify the retrieval
> > 3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
> > 4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
> > 5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
> > different from SeqNext) called SeqClusterNext
> > 6) SeqClusterNext() has all the heapgettup() logic with two
> > exceptions: a) we find the first tuple more intelligently (instead of
> > scanning from the first page) b) if we have found tuple(s) matching the
> > ScanKey when we encounter an non-matching tuple (via
> > HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
> > scan
> 
> Gavin, is that a big win compared to just using the index and looping
> through the entries, knowing that the index matches are on the same
> page, and the heap matches are on the same page.

Bruce,

It would cut out the index over head. Besides at (1) (above) we would have
determined that an index scan was too expensive and we would be using a
SeqScan instead. This would just be faster, since a) we would locate the
tuples more intelligently b) we wouldn't need to scan the whole heap once
we'd found all tuples matching the scan key.

Gavin



Re: CLUSTER and indisclustered

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom, should we be updating that flag after we CLUSTER instead of
> > requiring an ANALYZE after the CLUSTER?
> 
> Could do that I suppose, but I'm not super-excited about it.  ANALYZE is
> quite cheap these days (especially in comparison to CLUSTER ;-)).  I'd
> settle for a note in the CLUSTER docs that recommends a subsequent
> ANALYZE --- this seems no different from recommending ANALYZE after bulk
> data load or other major update of a table.

OK.  I am sure it is not obvious to people to ANALYZE because the data
in their table hasn't changed, just the ordering.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: CLUSTER and indisclustered

От
Bruce Momjian
Дата:
Gavin Sherry wrote:
> > Gavin, is that a big win compared to just using the index and looping
> > through the entries, knowing that the index matches are on the same
> > page, and the heap matches are on the same page.
> 
> Bruce,
> 
> It would cut out the index over head. Besides at (1) (above) we would have
> determined that an index scan was too expensive and we would be using a
> SeqScan instead. This would just be faster, since a) we would locate the
> tuples more intelligently b) we wouldn't need to scan the whole heap once
> we'd found all tuples matching the scan key.

Yes, but in a clustered table, an index scan is _never_ (?) more
expensive than a sequential scan, at least if the optimizer is working
correctly.  Index scans are slower only because they assume random heap
access, but with a clustered table, there is no random heap access.  The
index takes to right to the spot to start.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Gavin Sherry <swm@linuxworld.com.au> writes:
> On Sat, 3 Aug 2002, Tom Lane wrote:
>> AFAICT you're assuming that the table is *exactly* ordered by the
>> clustered attribute.  While this is true at the instant CLUSTER
>> completes, the exact ordering will be destroyed by the first insert or
>> update :-(.  I can't see much value in creating a whole new scan type

> Sorry, I meant to say that heap_insert() etc would need to set
> indisclustered to false.

<<itch>>  You could do that, but only if you are prepared to invent
a mechanism that will instantly invalidate any existing query plans
that assume the clustered ordering is good.

Up to now we've only allowed the planner to make decisions that impact
performace, not correctness of the result.  I'm uncomfortable with the
idea that a "clusterscan" plan could silently return wrong answers after
someone else updates the table and doesn't tell us they did.
        regards, tom lane


Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom, should we be updating that flag after we CLUSTER instead of
> requiring an ANALYZE after the CLUSTER?

Could do that I suppose, but I'm not super-excited about it.  ANALYZE is
quite cheap these days (especially in comparison to CLUSTER ;-)).  I'd
settle for a note in the CLUSTER docs that recommends a subsequent
ANALYZE --- this seems no different from recommending ANALYZE after bulk
data load or other major update of a table.
        regards, tom lane


Re: CLUSTER and indisclustered

От
mark Kirkwood
Дата:
Gavin Sherry wrote:

>On Sat, 3 Aug 2002, Bruce Momjian wrote:
>
>>Gavin Sherry wrote:
>>
>>>Hi all,
>>>
>>>It occured to me on the plane home that now that CLUSTER is fixed we may
>>>be able to put pg_index.indisclustered to use. If CLUSTER was to set
>>>indisclustered to true when it clusters a heap according to the given
>>>index, we could speed up sequantial scans. There are two possible ways.
>>>
>>>1) Planner determines that a seqscan is appropriate *and* the retrieval is
>>>qualified by the key(s) of one of the relation's indexes
>>>2) Planner determines that the relation is clustered on disk according to
>>>the index over the key(s) used to qualify the retrieval
>>>3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
>>>4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
>>>5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
>>>different from SeqNext) called SeqClusterNext
>>>6) SeqClusterNext() has all the heapgettup() logic with two
>>>exceptions: a) we find the first tuple more intelligently (instead of
>>>scanning from the first page) b) if we have found tuple(s) matching the
>>>ScanKey when we encounter an non-matching tuple (via
>>>HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
>>>scan
>>>
>>Gavin, is that a big win compared to just using the index and looping
>>through the entries, knowing that the index matches are on the same
>>page, and the heap matches are on the same page.
>>
>
>Bruce,
>
>It would cut out the index over head. Besides at (1) (above) we would have
>determined that an index scan was too expensive and we would be using a
>SeqScan instead. This would just be faster, since a) we would locate the
>tuples more intelligently b) we wouldn't need to scan the whole heap once
>we'd found all tuples matching the scan key.
>
>Gavin
>


Gavin and Bruce,

I am not so sure index access in these cases is such an overhead - since 
the clustered nature of the table means that many index elements will 
point to table data that is located  in a few pages .


Ok, this change would save you the initial access of the index structure 
itself - but isnt
the usual killer for indexes is the "thrashing" that happens when the 
"pointed to" table data is spread over a many pages.

I did some tests on this a while ago ( 7.1 era) and discovered that for 
a clustered table a sequential scan did not start to win until the 
target dataset was ~30% of the table itself.
While the suggested change would of course mean that seq scans would do 
much better than they did in my tests, it would be interesting to see 
some timings...


regards

Mark




Re: CLUSTER and indisclustered

От
Curt Sampson
Дата:
On Sun, 4 Aug 2002, mark Kirkwood wrote:

> Ok, this change would save you the initial access of the index
> structure itself - but isnt the usual killer for indexes is the
> "thrashing" that happens when the "pointed to" table data is spread
> over a many pages.

Yeah, no kidding on this one. I've reduced queries from 75 seconds
to 0.6 seconds by clustering on the appropriate field.

But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to? That would introduce a certain amount
of "sequentialness" to the reads that the OS is not capable of
introducing (since it can't reschedule the reads you're doing, the
way it could reschedule, say, random writes).

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



Re: CLUSTER and indisclustered

От
Gavin Sherry
Дата:
On Wed, 7 Aug 2002, Curt Sampson wrote:

> But after doing some benchmarking of various sorts of random reads
> and writes, it occurred to me that there might be optimizations
> that could help a lot with this sort of thing. What if, when we've
> got an index block with a bunch of entries, instead of doing the
> reads in the order of the entries, we do them in the order of the
> blocks the entries point to? That would introduce a certain amount
> of "sequentialness" to the reads that the OS is not capable of
> introducing (since it can't reschedule the reads you're doing, the
> way it could reschedule, say, random writes).

This sounds more or less like the method employed by Firebird as described
by Ann Douglas to Tom at OSCON (correct me if I get this wrong).

Basically, firebird populates a bitmap with entries the scan is interested
in. The bitmap is populated in page order so that all entries on the same
heap page can be fetched at once.

This is totally different to the way postgres does things and would
require significant modification to the index access methods.

Gavin



Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Curt Sampson <cjs@cynic.net> writes:
> But after doing some benchmarking of various sorts of random reads
> and writes, it occurred to me that there might be optimizations
> that could help a lot with this sort of thing. What if, when we've
> got an index block with a bunch of entries, instead of doing the
> reads in the order of the entries, we do them in the order of the
> blocks the entries point to?

I thought to myself "didn't I just post something about that?"
and then realized it was on a different mailing list.  Here ya go
(and no, this is not the first time around on this list either...)


I am currently thinking that bitmap indexes per se are not all that
interesting.  What does interest me is bitmapped index lookup, which
came back into mind after hearing Ann Harrison describe how FireBird/
InterBase does it.

The idea is that you don't scan the index and base table concurrently
as we presently do it.  Instead, you scan the index and make a list
of the TIDs of the table tuples you need to visit.  This list can
be conveniently represented as a sparse bitmap.  After you've finished
looking at the index, you visit all the required table tuples *in
physical order* using the bitmap.  This eliminates multiple fetches
of the same heap page, and can possibly let you get some win from
sequential access.

Once you have built this mechanism, you can then move on to using
multiple indexes in interesting ways: you can do several indexscans
in one query and then AND or OR their bitmaps before doing the heap
scan.  This would allow, for example, "WHERE a = foo and b = bar"
to be handled by ANDing results from separate indexes on the a and b
columns, rather than having to choose only one index to use as we do
now.

Some thoughts about implementation: FireBird's implementation seems
to depend on an assumption about a fixed number of tuple pointers
per page.  We don't have that, but we could probably get away with
just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page.
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.  (It's
interesting to think of this as lossy compression of the bitmap...
which leads to the idea of only being fuzzy in limited areas of the
bitmap, rather than losing all the information you have.)

A possibly nasty issue is that lazy VACUUM has some assumptions in it
about indexscans holding pins on index pages --- that's what prevents
it from removing heap tuples that a concurrent indexscan is just about
to visit.  It might be that there is no problem: even if lazy VACUUM
removes a heap tuple and someone else then installs a new tuple in that
same TID slot, you should be okay because the new tuple is too new to
pass your visibility test.  But I'm not convinced this is safe.
        regards, tom lane


Re: CLUSTER and indisclustered

От
Curt Sampson
Дата:
On Wed, 7 Aug 2002, Tom Lane wrote:

> I thought to myself "didn't I just post something about that?"
> and then realized it was on a different mailing list.  Here ya go
> (and no, this is not the first time around on this list either...)

Wow. I'm glad to see you looking at this, because this feature would so
*so* much for the performance of some of my queries, and really, really
impress my "billion-row-database" client.

> The idea is that you don't scan the index and base table concurrently
> as we presently do it.  Instead, you scan the index and make a list
> of the TIDs of the table tuples you need to visit.

Right.

> Also, the main downside of this approach is that the bitmap could
> get large --- but you could have some logic that causes you to fall
> back to plain sequential scan if you get too many index hits.

Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks. If you want to limit to,
say, 1000 TIDs, and your index has 3000, just do the first 1000, then
the next 1000, then the last 1000. This would still result in much less
disk head movement and speed the query immensely.

(BTW, I have verified this emperically during testing of random read vs.
random write on a RAID controller. The writes were 5-10 times faster
than the reads because the controller was caching a number of writes and
then doing them in the best possible order, whereas the reads had to be
satisfied in the order they were submitted to the controller.)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Curt Sampson <cjs@cynic.net> writes:
> On Wed, 7 Aug 2002, Tom Lane wrote:
>> Also, the main downside of this approach is that the bitmap could
>> get large --- but you could have some logic that causes you to fall
>> back to plain sequential scan if you get too many index hits.

> Well, what I was thinking of, should the list of TIDs to fetch get too
> long, was just to break it down in to chunks.

But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me.  If you've
visited only a part of each index then you can't apply that concept.

Another point to keep in mind is that the bigger the bitmap gets, the
less useful an indexscan is, by definition --- sooner or later you might
as well fall back to a seqscan.  So the idea of lossy compression of a
large bitmap seems really ideal to me.  In principle you could seqscan
the parts of the table where matching tuples are thick on the ground,
and indexscan the parts where they ain't.  Maybe this seems natural
to me as an old JPEG campaigner, but if you don't see the logic I
recommend thinking about it a little ...
        regards, tom lane


Re: CLUSTER and indisclustered

От
Curt Sampson
Дата:
On Wed, 7 Aug 2002, Tom Lane wrote:

> But then you lose the possibility of combining multiple indexes through
> bitmap AND/OR steps, which seems quite interesting to me.  If you've
> visited only a part of each index then you can't apply that concept.

Right. It'd be a shame to lose that, but a little is better than nothing
at all, if one ends up being faced with that decision.

> Another point to keep in mind is that the bigger the bitmap gets, the
> less useful an indexscan is, by definition --- sooner or later you might
> as well fall back to a seqscan.

Well, yes, so long as you chose the correct values of "big." I'd want
this to be able to optimize queries against a two billion row table
about 150 GB in size. And that might even get bigger in a few years.

> Maybe this seems natural
> to me as an old JPEG campaigner, but if you don't see the logic I
> recommend thinking about it a little ...

Well, photos are certainly not random, but database tables may be
in essentially random order far more often. How much that applies,
I'm not sure, since I don't really know a lot about this stuff.
I'll take your word for it on what's best there.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



Re: CLUSTER and indisclustered

От
Hannu Krosing
Дата:
On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > On Wed, 7 Aug 2002, Tom Lane wrote:
> >> Also, the main downside of this approach is that the bitmap could
> >> get large --- but you could have some logic that causes you to fall
> >> back to plain sequential scan if you get too many index hits.
> 
> > Well, what I was thinking of, should the list of TIDs to fetch get too
> > long, was just to break it down in to chunks.
> 
> But then you lose the possibility of combining multiple indexes through
> bitmap AND/OR steps, which seems quite interesting to me.  If you've
> visited only a part of each index then you can't apply that concept.

When the tuples are small relative to pagesize, you may get some
"compression" by saving just pages and not the actual tids in the the
bitmap.

-------------
Hannu


Re: CLUSTER and indisclustered

От
Hannu Krosing
Дата:
On Wed, 2002-08-07 at 06:46, Hannu Krosing wrote:
> On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
> > Curt Sampson <cjs@cynic.net> writes:
> > > On Wed, 7 Aug 2002, Tom Lane wrote:
> > >> Also, the main downside of this approach is that the bitmap could
> > >> get large --- but you could have some logic that causes you to fall
> > >> back to plain sequential scan if you get too many index hits.
> > 
> > > Well, what I was thinking of, should the list of TIDs to fetch get too
> > > long, was just to break it down in to chunks.
> > 
> > But then you lose the possibility of combining multiple indexes through
> > bitmap AND/OR steps, which seems quite interesting to me.  If you've
> > visited only a part of each index then you can't apply that concept.
> 
> When the tuples are small relative to pagesize, you may get some
> "compression" by saving just pages and not the actual tids in the the
> bitmap.

Now I remembered my original preference for page bitmaps (vs. tuple
bitmaps): one can't actually make good use of a bitmap of tuples because
there is no fixed tuples/page ratio and thus no way to quickly go from
bit position to actual tuple. You mention the same problem but propose a
different solution.

Using page bitmap, we will at least avoid fetching any unneeded pages -
essentially we will have a sequential scan over possibly interesting
pages.

If we were to use page-bitmap index for something with only a few values
like booleans, some insert-time local clustering should be useful, so
that TRUEs and FALSEs end up on different pages.

But I guess that CLUSTER support for INSERT will not be touched for 7.3
as will real bitmap indexes ;)

---------------
Hannu



Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Hannu Krosing <hannu@tm.ee> writes:
> Now I remembered my original preference for page bitmaps (vs. tuple
> bitmaps): one can't actually make good use of a bitmap of tuples because
> there is no fixed tuples/page ratio and thus no way to quickly go from
> bit position to actual tuple. You mention the same problem but propose a
> different solution.

> Using page bitmap, we will at least avoid fetching any unneeded pages -
> essentially we will have a sequential scan over possibly interesting
> pages.

Right.  One form of the "lossy compression" idea I suggested is to
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
too large to work with.  Again, one could imagine doing that only in
denser areas of the bitmap.

> But I guess that CLUSTER support for INSERT will not be touched for 7.3
> as will real bitmap indexes ;)

All of this is far-future work I think.  Adding a new scan type to the
executor would probably be pretty localized, but the ramifications in
the planner could be extensive --- especially if you want to do plans
involving ANDed or ORed bitmaps.
        regards, tom lane


Re: CLUSTER and indisclustered

От
Hannu Krosing
Дата:
On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > Now I remembered my original preference for page bitmaps (vs. tuple
> > bitmaps): one can't actually make good use of a bitmap of tuples because
> > there is no fixed tuples/page ratio and thus no way to quickly go from
> > bit position to actual tuple. You mention the same problem but propose a
> > different solution.
> 
> > Using page bitmap, we will at least avoid fetching any unneeded pages -
> > essentially we will have a sequential scan over possibly interesting
> > pages.
> 
> Right.  One form of the "lossy compression" idea I suggested is to
> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
> too large to work with.  

If it is a real bitmap, should it not be easyeast to allocate at the
start ?

a page bitmap for a 100 000 000 tuple table with 10 tuples/page will be
sized 10000000/8 = 1.25 MB, which does not look too big for me for that
amount of data (the data table itself would occupy 80 GB).

Even having the bitmap of 16 bits/page (with the bits 0-14 meaning
tuples 0-14 and bit 15 meaning "seq scan the rest of page") would
consume just 20 MB of _local_ memory, and would be quite justifyiable
for a query on a table that large.

For a real bitmap index the tuples-per-page should be a user-supplied
tuning parameter.

> Again, one could imagine doing that only in denser areas of the bitmap.

I would hardly call the resulting structure "a bitmap" ;)

And I'm not sure the overhead for a more complex structure would win us
any additional performance for most cases.

> > But I guess that CLUSTER support for INSERT will not be touched for 7.3
> > as will real bitmap indexes ;)
> 
> All of this is far-future work I think. 

After we do that we will probably be able claim support for
"datawarehousing" ;)

> Adding a new scan type to the
> executor would probably be pretty localized, but the ramifications in
> the planner could be extensive --- especially if you want to do plans
> involving ANDed or ORed bitmaps.

Also going to "smart inserter" which can do local clustering on sets of
real bitmap indexes for INSERTS (and INSERT side of UPDATE) would
probably be a major change from our current "stupid inserter" ;)

This will not be needed for bitmap resolution higher than 1bit/page but
default local clustering on bitmap indexes will probably buy us some
extra performance. by avoiding data page fetches when such indexes are
used.

AN anyway the support for INSERT being aware of clustering will probably
come up sometime.

------------
Hannu




Re: CLUSTER and indisclustered

От
Hannu Krosing
Дата:
On Wed, 2002-08-07 at 04:31, Curt Sampson wrote:
> On Sun, 4 Aug 2002, mark Kirkwood wrote:
> 
> > Ok, this change would save you the initial access of the index
> > structure itself - but isnt the usual killer for indexes is the
> > "thrashing" that happens when the "pointed to" table data is spread
> > over a many pages.
> 
> Yeah, no kidding on this one. I've reduced queries from 75 seconds
> to 0.6 seconds by clustering on the appropriate field.
> 
> But after doing some benchmarking of various sorts of random reads
> and writes, it occurred to me that there might be optimizations
> that could help a lot with this sort of thing. What if, when we've
> got an index block with a bunch of entries, instead of doing the
> reads in the order of the entries, we do them in the order of the
> blocks the entries point to? That would introduce a certain amount
> of "sequentialness" to the reads that the OS is not capable of
> introducing (since it can't reschedule the reads you're doing, the
> way it could reschedule, say, random writes).
>

I guess this could be solved elegantly using threading - one thread
scans index and pushes tids into a btree or some other sorted structure,
while other thread loops continuously (or "elevatorly" back and forth)
over that structure in tuple order and does the actual data reads. 

This would have the added benefit of better utilising multiprocessor
computers.

---------------
Hannu



Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Hannu Krosing <hannu@tm.ee> writes:
> On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
>> Right.  One form of the "lossy compression" idea I suggested is to
>> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
>> too large to work with.  

> If it is a real bitmap, should it not be easyeast to allocate at the
> start ?

But it isn't a "real bitmap".  That would be a really poor
implementation, both for space and speed --- do you really want to scan
over a couple of megs of zeroes to find the few one-bits you care about,
in the typical case?  "Bitmap" is a convenient term because it describes
the abstract behavior we want, but the actual data structure will
probably be nontrivial.  If I recall Ann's description correctly,
Firebird's implementation uses run length coding of some kind (anyone
care to dig in their source and get all the details?).  If we tried
anything in the way of lossy compression then there'd be even more stuff
lurking under the hood.
        regards, tom lane


Re: CLUSTER and indisclustered

От
Hannu Krosing
Дата:
On Wed, 2002-08-07 at 16:26, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
> >> Right.  One form of the "lossy compression" idea I suggested is to
> >> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
> >> too large to work with.  
> 
> > If it is a real bitmap, should it not be easyeast to allocate at the
> > start ?
> 
> But it isn't a "real bitmap".  That would be a really poor
> implementation, both for space and speed --- do you really want to scan
> over a couple of megs of zeroes to find the few one-bits you care about,
> in the typical case?

I guess that depends on data. The typical case should be somthing the
stats process will find out so the optimiser can use it

The bitmap must be less than 1/48 (size of TID) full for best
uncompressed "active-tid-list" to be smaller than plain bitmap. If there
were some structure above list then this ratio would be even higher.

I have had good experience using "compressed delta lists", which will
scale well ofer the whole "fullness" spectrum of bitmap, but this is for
storage, not for initial constructing of lists.

>  "Bitmap" is a convenient term because it describes
> the abstract behavior we want, but the actual data structure will
> probably be nontrivial.  If I recall Ann's description correctly,
> Firebird's implementation uses run length coding of some kind (anyone
> care to dig in their source and get all the details?).

Plain RLL is probably a good way to store it and for merging two or more
bitmaps, but not as good for constructing it bit-by-bit. I guess the
most effective structure for updating is often still a plain bitmap
(maybe not if it is very sparse and all of it does not fit in cache),
followed by some kind of balanced tree (maybe rb-tree).

If the bitmap is relatively full then the plain bitmap is almost always
the most effective to update.

> If we tried anything in the way of lossy compression then there'd
> be even more stuff lurking under the hood.

Having three-valued (0,1,maybe) RLL-encoded "tritmap" would be a good
way to represent lossy compression, and it would also be quite
straightforward to merge two of these using AND or OR. It may even be
possible to easily construct it using a fixed-length b-tree and going
from 1 to "maybe" for nodes that get too dense.

---------------
Hannu



Re: CLUSTER and indisclustered

От
Alvaro Herrera
Дата:
Tom Lane dijo: 

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom, should we be updating that flag after we CLUSTER instead of
> > requiring an ANALYZE after the CLUSTER?
> 
> Could do that I suppose, but I'm not super-excited about it.  ANALYZE is
> quite cheap these days (especially in comparison to CLUSTER ;-)).  I'd
> settle for a note in the CLUSTER docs that recommends a subsequent
> ANALYZE --- this seems no different from recommending ANALYZE after bulk
> data load or other major update of a table.

What if I [try to] extend the grammar to support an additional ANALYZE
in CLUSTER, so that it analyzes the table automatically? Say

CLUSTER <index> ON <table> [ANALYZE];

Or maybe just do an analyze of the table automatically after the
CLUSTERing.

What does everybody think?

-- 
Alvaro Herrera (<alvherre[a]atentus.com>)
"Para tener mas hay que desear menos"



Re: CLUSTER and indisclustered

От
Neil Conway
Дата:
Alvaro Herrera <alvherre@atentus.com> writes:
> What if I [try to] extend the grammar to support an additional ANALYZE
> in CLUSTER, so that it analyzes the table automatically?

I don't like this -- it seems like bloat. What's the advantage of

CLUSTER foo ON bar ANALYZE;

over

CLUSTER foo ON bar;
ANALYZE;

> Or maybe just do an analyze of the table automatically after the
> CLUSTERing.

Hmmm... I don't really see the problem with adding a note in the docs
suggesting that users following a CLUSTER with an ANALYZE (of course,
that assumes that the CLUSTER will significantly change the ordering
of the data in the table, which isn't always the case -- which is
another reason why make this automatic seems unwarranted, IMHO). It
seems like you're looking for a solution to a non-existent problem.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC



Re: CLUSTER and indisclustered

От
Alvaro Herrera
Дата:
Neil Conway dijo: 

> Alvaro Herrera <alvherre@atentus.com> writes:
> > What if I [try to] extend the grammar to support an additional ANALYZE
> > in CLUSTER, so that it analyzes the table automatically?
> 
> I don't like this -- it seems like bloat.

Maybe you are right.


> > Or maybe just do an analyze of the table automatically after the
> > CLUSTERing.
> 
> Hmmm... I don't really see the problem with adding a note in the docs
> suggesting that users following a CLUSTER with an ANALYZE (...).

ANALYZE is an inexpensive operation (compared to CLUSTER, anyway), so it
can't hurt to have it done automatically.

-- 
Alvaro Herrera (<alvherre[a]atentus.com>)
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)




Re: CLUSTER and indisclustered

От
"Christopher Kings-Lynne"
Дата:
> > > Or maybe just do an analyze of the table automatically after the
> > > CLUSTERing.
> >
> > Hmmm... I don't really see the problem with adding a note in the docs
> > suggesting that users following a CLUSTER with an ANALYZE (...).
>
> ANALYZE is an inexpensive operation (compared to CLUSTER, anyway), so it
> can't hurt to have it done automatically.

Well we have previously had discussions on the topic of adding analyze to
the end of dumps, etc. and the result has always been in favour of keeping
the command set orthogonal and not doing an automatic analyze...

Chris



Re: CLUSTER and indisclustered

От
Alvaro Herrera
Дата:
Christopher Kings-Lynne dijo: 

> > > > Or maybe just do an analyze of the table automatically after the
> > > > CLUSTERing.
> 
> Well we have previously had discussions on the topic of adding analyze to
> the end of dumps, etc. and the result has always been in favour of keeping
> the command set orthogonal and not doing an automatic analyze...

Oh.  Sorry for the noise.

I'm trying to look at other things in the TODO so I stop pestering about
CLUSTER.

-- 
Alvaro Herrera (<alvherre[a]atentus.com>)
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis)



Re: CLUSTER and indisclustered

От
"Christopher Kings-Lynne"
Дата:
> > Well we have previously had discussions on the topic of adding
> analyze to
> > the end of dumps, etc. and the result has always been in favour
> of keeping
> > the command set orthogonal and not doing an automatic analyze...
>
> Oh.  Sorry for the noise.
>
> I'm trying to look at other things in the TODO so I stop pestering about
> CLUSTER.

All I can say is - thanks for fixing CLUSTER.  As soon as we upgrade to 7.3
I'm going on a CLUSTERing spree :)

Chris



Re: CLUSTER and indisclustered

От
"Christopher Kings-Lynne"
Дата:
If you're looking for something very useful to work on, see if Gavin
Sherry(?) can post his old CREATE OR REPLACE VIEW code.  I'm pretty sure he
(or someone) said that he had an old patch, that needed to be synced with
HEAD...  This functionality is pretty essential for 7.3...

Chris

> -----Original Message-----
> From: Alvaro Herrera [mailto:alvherre@atentus.com]
> Sent: Friday, 9 August 2002 10:21 AM
> To: Christopher Kings-Lynne
> Cc: Neil Conway; Tom Lane; Bruce Momjian; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CLUSTER and indisclustered
>
>
> Christopher Kings-Lynne dijo:
>
> > > > > Or maybe just do an analyze of the table automatically after the
> > > > > CLUSTERing.
> >
> > Well we have previously had discussions on the topic of adding
> analyze to
> > the end of dumps, etc. and the result has always been in favour
> of keeping
> > the command set orthogonal and not doing an automatic analyze...
>
> Oh.  Sorry for the noise.
>
> I'm trying to look at other things in the TODO so I stop pestering about
> CLUSTER.
>
> --
> Alvaro Herrera (<alvherre[a]atentus.com>)
> "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
> sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis)
>



Re: CLUSTER and indisclustered

От
Tom Lane
Дата:
Neil Conway <nconway@klamath.dyndns.org> writes:
> Alvaro Herrera <alvherre@atentus.com> writes:
>> What if I [try to] extend the grammar to support an additional ANALYZE
>> in CLUSTER, so that it analyzes the table automatically?

> I don't like this -- it seems like bloat.

My reaction exactly.
        regards, tom lane


Re: CLUSTER and indisclustered

От
"Dave Page"
Дата:

> -----Original Message-----
> From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
> Sent: 09 August 2002 03:57
> To: Alvaro Herrera
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CLUSTER and indisclustered
>
>
> If you're looking for something very useful to work on, see if Gavin
> Sherry(?) can post his old CREATE OR REPLACE VIEW code.  I'm
> pretty sure he (or someone) said that he had an old patch,
> that needed to be synced with HEAD...  This functionality is
> pretty essential for 7.3...
>

I'll second that...

Dave.


Re: CLUSTER and indisclustered

От
Bruce Momjian
Дата:
I wanted to comment on this bitmapped index discussion because I am
hearing a lot about star joins, data warehousing, and bitmapped indexes
recently.

It seems we have several uses for bitmapped indexes:
Do index lookups in sequential heap orderAllow joining of bitmapped indexes to construct arbitrary indexes

There is a web page about "star joins" used a lot in data warehousing,
where you don't know what queries are going to be required and what
indexes to create:
http://www.dbdomain.com/a100397.htm

They show some sample queries, which is good.  Here is some
interesting text:
Star Transformation
If there are bitmap indexes on SALES_REP_ID, PRODUCT_ID, andDEPARTMENT_ID in the SALES table, then Oracle can resolve
thequeryusing merges of the bitmap indexes.Because Oracle can efficiently merge multiple bitmap indexes, you can create
asingle bitmap index on each of the foreign-key columns in thefact table rather than on every possible combination of
columns.Thislets you support all possible combinations of dimensions withoutcreating an unreasonable number of
indexes.

Added to TODO:* Use bitmaps to fetch heap pages in sequential order [performance]     * Use bitmaps to combine existing
indexes[performance]
 

and I will add some of these emails to TODO.detail/performance.

---------------------------------------------------------------------------

Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > But after doing some benchmarking of various sorts of random reads
> > and writes, it occurred to me that there might be optimizations
> > that could help a lot with this sort of thing. What if, when we've
> > got an index block with a bunch of entries, instead of doing the
> > reads in the order of the entries, we do them in the order of the
> > blocks the entries point to?
> 
> I thought to myself "didn't I just post something about that?"
> and then realized it was on a different mailing list.  Here ya go
> (and no, this is not the first time around on this list either...)
> 
> 
> I am currently thinking that bitmap indexes per se are not all that
> interesting.  What does interest me is bitmapped index lookup, which
> came back into mind after hearing Ann Harrison describe how FireBird/
> InterBase does it.
> 
> The idea is that you don't scan the index and base table concurrently
> as we presently do it.  Instead, you scan the index and make a list
> of the TIDs of the table tuples you need to visit.  This list can
> be conveniently represented as a sparse bitmap.  After you've finished
> looking at the index, you visit all the required table tuples *in
> physical order* using the bitmap.  This eliminates multiple fetches
> of the same heap page, and can possibly let you get some win from
> sequential access.
> 
> Once you have built this mechanism, you can then move on to using
> multiple indexes in interesting ways: you can do several indexscans
> in one query and then AND or OR their bitmaps before doing the heap
> scan.  This would allow, for example, "WHERE a = foo and b = bar"
> to be handled by ANDing results from separate indexes on the a and b
> columns, rather than having to choose only one index to use as we do
> now.
> 
> Some thoughts about implementation: FireBird's implementation seems
> to depend on an assumption about a fixed number of tuple pointers
> per page.  We don't have that, but we could probably get away with
> just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page.
> Also, the main downside of this approach is that the bitmap could
> get large --- but you could have some logic that causes you to fall
> back to plain sequential scan if you get too many index hits.  (It's
> interesting to think of this as lossy compression of the bitmap...
> which leads to the idea of only being fuzzy in limited areas of the
> bitmap, rather than losing all the information you have.)
> 
> A possibly nasty issue is that lazy VACUUM has some assumptions in it
> about indexscans holding pins on index pages --- that's what prevents
> it from removing heap tuples that a concurrent indexscan is just about
> to visit.  It might be that there is no problem: even if lazy VACUUM
> removes a heap tuple and someone else then installs a new tuple in that
> same TID slot, you should be okay because the new tuple is too new to
> pass your visibility test.  But I'm not convinced this is safe.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: CLUSTER and indisclustered

От
Hannu Krosing
Дата:
On Tue, 2002-08-13 at 09:25, Bruce Momjian wrote:
> 
> There is a web page about "star joins" used a lot in data warehousing,
> where you don't know what queries are going to be required and what
> indexes to create:
> 
>     http://www.dbdomain.com/a100397.htm
> 
> They show some sample queries, which is good.  Here is some
> interesting text:
> 
>     Star Transformation
> 
>     If there are bitmap indexes on SALES_REP_ID, PRODUCT_ID, and
>     DEPARTMENT_ID in the SALES table, then Oracle can resolve the query
>     using merges of the bitmap indexes.
>     
>     Because Oracle can efficiently merge multiple bitmap indexes, you can 
>     create a single bitmap index on each of the foreign-key columns in the
>     fact table rather than on every possible combination of columns.

Another way to achive the similar result would be using segmented hash
indexes, where each column maps directly to some part of hash value.

> This
>     lets you support all possible combinations of dimensions without
>     creating an unreasonable number of indexes.

-----------
Hannu