Обсуждение: No heap lookups on index
Allow me a brief introduction. I work in a company who contracts intelligence analysis software to the government. We are currently developing a product which is using PostgreSQL at it's core. Due to the licensing of the product and the integration with perl this is our first choice in database solutions. We are, however, currently stuck. We are storing millions of rows and require very high query performance. We have spent the last several months tweaking, list lurking and researching all the various tweaks and performance enhancements and have come to the conclusion that our biggest slowdown is validating the index rows which match our selection criteria against the heap values. In general cases there is a very small amount required for this, but in our extreme use cases we are finding this to slow our queries by an unacceptable amount of time. We would like to resolve this issue. In that endeavor we have done some feasibility analysis (either to write a patch ourselves or attempt to commission an expert to do so), starting with the archives for this list. We found several posts discussing the issue and it seems that the complexity of storing the tuple visibility information inside of the index rows is prohibitive for simple indexes. I have used SQL Server in the past and have noticed that bookmark lookups are avoided because they force the query executor to actually fetch the data page off of disk, rather then return the values that exist in the index. I have verified times against the PostgreSQL installation and SQL Server to verify that the SQL Server queries come back at roughly the same speed when avoiding bookmark lookups as Postgres queries accessing clustered tables using the index the table is clustered on. Since I am sure everyone is tired of the intro by now, I'll get to the questions: Do commercial databases implement MVCC in a way that allows an efficient implementation of index lookups that can avoid heap lookups? Is there any way to modify PostgreSQL to allowindex lookups without heap validation that doesn't involve re-writing the MVCC implementation of keeping dead rows on the live table? Is the additional overhead of keeping full tuple visibility information inside of the index so odious to the Postgres community as to prevent a patch with this solution from being applied back to the head? Maybe as an optional use feature? We would prefer this solution for our needs over the bitmap of heap pages listed in the TODO list because we want to ensure optimal query times, regardless of the state of the cache and because we are concerned with performance in the face of concurrent updates on the page level. Thanks for any thoughts on this, I know this is a perennial topic, but we are seriously considering contributing either code or money to the solution of this problem. David Scott Applied Technical Systems, Inc.
David Scott <davids@apptechsys.com> writes:
>     Is the additional overhead of keeping full tuple visibility 
> information inside of the index so odious to the Postgres community as 
> to prevent a patch with this solution from being applied back to the 
> head?
This has been discussed and rejected before (multiple times).  If you
want it considered you'll have to present stronger arguments than have
so far been made.  The current consensus is that the probability of a
net performance win is not good enough to justify the large amount of
development effort that would be required.
What sort of problems are you dealing with exactly?  There has been
some discussion of changes that would improve certain scenarios.  For
instance it might be plausible to do joins using index information and
only go back to the heap for entries that appear to pass the join test.
        regards, tom lane
			
		
			
				David,
You can find some of this discussion in "Much Ado About COUNT(*)". Related to that discussion, I had written a patch which added visibility information to the indexes.
If you're interested in the patch and/or consulting, contact me offline.
-Jonah
			
		
		
	You can find some of this discussion in "Much Ado About COUNT(*)". Related to that discussion, I had written a patch which added visibility information to the indexes.
If you're interested in the patch and/or consulting, contact me offline.
-Jonah
On 1/18/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Scott <davids@apptechsys.com> writes:
> Is the additional overhead of keeping full tuple visibility
> information inside of the index so odious to the Postgres community as
> to prevent a patch with this solution from being applied back to the
> head?
This has been discussed and rejected before (multiple times). If you
want it considered you'll have to present stronger arguments than have
so far been made. The current consensus is that the probability of a
net performance win is not good enough to justify the large amount of
development effort that would be required.
What sort of problems are you dealing with exactly? There has been
some discussion of changes that would improve certain scenarios. For
instance it might be plausible to do joins using index information and
only go back to the heap for entries that appear to pass the join test.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
On Wed, 2006-01-18 at 12:14 -0800, David Scott wrote: > Is the additional overhead of keeping full tuple visibility > information inside of the index so odious to the Postgres community > as > to prevent a patch with this solution from being applied back to the > head? Maybe as an optional use feature? You might want to consider the thought of "organised heaps" as an alternative thought to index improvements. That way there is no heap to avoid visiting because the index is also the main data structure. Teradata provides hash or value-ordered tables Oracle offers index organised tables DB2 offers multi-dimensional clustering Tandem offered value ordered tables This would offer performance, but would be one of the largest patches seen in recent times. You may find some co-backers. Best Regards, Simon Riggs
On Wed, Jan 18, 2006 at 12:14:12PM -0800, David Scott wrote: > Do commercial databases implement MVCC in a way that allows an > efficient implementation of index lookups that can avoid heap lookups? Oracle does, but you pay in other ways. Instead of keeping dead tuples in the main heap, they shuffle them off to an 'undo log'. This has some downsides: Rollbacks take *forever*, though this usually isn't much of an issue unless you need to abort a really big transaction. Every update/delete means two seperate writes to disk, one for the base table and one for the undo log (well, there's also the redo log, equivalent to our WAL). Though writes to undo can (and presumably are) grouped together, so they should normally be a lot more efficient than the updates to the base table unless you're updating data in table order. Of course there's downsides to our MVCC as well; the cost of index scans is just one. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, Jan 18, 2006 at 04:02:45PM -0500, Jonah H. Harris wrote: > David, > > You can find some of this discussion in "Much Ado About COUNT(*)". Related > to that discussion, I had written a patch which added visibility information > to the indexes. > > If you're interested in the patch and/or consulting, contact me offline. Does the patch change all indexes across the board? Do you have any performance numbers? I suspect that in some situations storing visibility info in the index would be a big win; if that's the case it would be very good if there was an option that allowed it. Perhaps this could be done using a different index access method... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Simon Riggs <simon@2ndquadrant.com> writes:
> You might want to consider the thought of "organised heaps" as an
> alternative thought to index improvements. That way there is no heap to
> avoid visiting because the index is also the main data structure.
> This would offer performance, but would be one of the largest patches
> seen in recent times. You may find some co-backers.
Either way it would be a pretty monstrous patch :-( ... in this case
because of the amount of code that knows about the properties of heap
storage, and in what David is thinking about because of the implications
of trying to keep multiple copies of tuple state up-to-date.
We'd probably end up with a cleaner system structure if we tried to
create an API separating out the knowledge of heap structure, but the
amount of work needed seems out of proportion to the benefit.
It might be possible to compromise though.  Imagine an index that
contains only the upper levels of a search tree --- links to what
would be the leaf level point into the associated heap.  In this design
the heap is still a heap in the sense that you can seqscan it without
any awareness of the index structure.  What you can't do is insert
tuples or move them around without the index AM's say-so.
RelationGetBufferForTuple would become an index AM call, but otherwise
I think the impact on existing code wouldn't be large.
There are some limitations.  For instance I don't think that the index
AM could control the order of items within a heap page, because of the
need for TIDs to be persistent; so within-page searches would still be
kinda slow.  But it's interesting to think about.
        regards, tom lane
			
		On Wed, 2006-01-18 at 18:27 -0500, Tom Lane wrote: > Imagine an index that > contains only the upper levels of a search tree --- links to what > would be the leaf level point into the associated heap. In this > design > the heap is still a heap in the sense that you can seqscan it without > any awareness of the index structure. What you can't do is insert > tuples or move them around without the index AM's say-so. > RelationGetBufferForTuple would become an index AM call, but otherwise > I think the impact on existing code wouldn't be large. Eureka! I had been thinking of a "block level index" which sounds almost the same thing (as opposed to the row level indexes we have now). We only need to index the row with the lowest value on any page so the main index would get 100 times smaller. The main part of the index would not need to be written to except when a block overflows. I had imagined an ordering within a block to allow fast uniqueness checks, but it would be pretty fast either way. Merge joins with the same index become block-level joins without sorts. We would just do an individual block sort before merging, so no need for very large sort-merges. Even if the block level indexes differ, we only need to sort one of the tables. Hopefully we could avoid trying to support GIST-heaps? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
> Hopefully we could avoid trying to support GIST-heaps?
Well, that would be an extra index AM that someone might or might not
get around to writing someday.  I was thinking that both btree and hash
index AMs might be interesting for this, though.  Hash in particular
would adapt pretty trivially ...
        regards, tom lane
			
		Simon Riggs <simon@2ndquadrant.com> writes:
> We only need to index the row with the lowest value on any page so the main
> index would get 100 times smaller. The main part of the index would not
> need to be written to except when a block overflows.
BTW, the above is equivalent to saying that the leaf-level index pages
aren't there: the downlink pointers on the level-1 index pages are
pointers to heap pages, instead, and you're right that they effectively
only index the lowest value per page (actually IIRC the highest value
per page, but same difference).
I think the "100x" figure is overoptimistic though.  There will be a lot
fewer entries per leaf page because actual heap tuples will be a lot
larger than index entries (typically at least).  Hence, you need more
level-1 entries and so the upper index levels are bigger than in a
simple index.  Another point is that the heap will be somewhat bloated
compared to a simple heap because of containing more unused space.
The traditional rule-of-thumb is that a btree index is only about 2/3rds
full at steady state, and I suppose this would apply to a
btree-organized heap too.
Still, it seems like an idea worth investigating.
> Merge joins with the same index become block-level joins without sorts.
> We would just do an individual block sort before merging, so no need for
> very large sort-merges. Even if the block level indexes differ, we only
> need to sort one of the tables.
I'd phrase that a little differently: an indexscan on such an index
would normally deliver unordered output, but you could demand ordered
output and get it by doing successive one-page sorts.  I doubt it's
worth inventing a new layer of mergejoin code to do this rather than
keeping it at the index access level.
Come to think of it, the idea also seems to map nicely into bitmap index
scans: the index will directly hand back a list of potential pages to
look at, but they are all marked "lossy" because the index doesn't know
exactly which tuple(s) on the target pages match the query.  The
existing bitmap-heap-scan code can take it from there.
        regards, tom lane
			
		> Oracle does, but you pay in other ways. Instead of keeping dead tuples > in the main heap, they shuffle them off to an 'undo log'. This has some > downsides: > > Rollbacks take *forever*, though this usually isn't much of an issue > unless you need to abort a really big transaction. It's a good point though. Surely a database should be optimised for the most common operation - commits, rather than rollbacks? Chris
On Thu, Jan 19, 2006 at 09:18:55AM +0800, Christopher Kings-Lynne wrote: > >Oracle does, but you pay in other ways. Instead of keeping dead tuples > >in the main heap, they shuffle them off to an 'undo log'. This has some > >downsides: > > > >Rollbacks take *forever*, though this usually isn't much of an issue > >unless you need to abort a really big transaction. > > It's a good point though. Surely a database should be optimised for the > most common operation - commits, rather than rollbacks? Generally true, but keep in mind this counter-argument... our MVCC performs fewer disk writes (since generally you can find some free space on the page you're modifying) and you can control when you take the hit of cleaning up dead space. In fact, you can take that hit at a reduced priority (vacuum_cost_*). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> Oracle does, but you pay in other ways. Instead of keeping dead tuples
>> in the main heap, they shuffle them off to an 'undo log'. This has some
>> downsides:
>> Rollbacks take *forever*, though this usually isn't much of an issue
>> unless you need to abort a really big transaction.
> It's a good point though.  Surely a database should be optimised for the 
> most common operation - commits, rather than rollbacks?
The "shuffling off" of the data is expensive in itself, so I'm not sure
you can argue that the Oracle way is more optimal for commits either.
        regards, tom lane
			
		On Wed, Jan 18, 2006 at 08:13:59PM -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > We only need to index the row with the lowest value on any page so the main > > index would get 100 times smaller. The main part of the index would not > > need to be written to except when a block overflows. > > BTW, the above is equivalent to saying that the leaf-level index pages > aren't there: the downlink pointers on the level-1 index pages are > pointers to heap pages, instead, and you're right that they effectively > only index the lowest value per page (actually IIRC the highest value > per page, but same difference). Would this open the door for allowing tables to be maintained in CLUSTER order (at least at the block level if not within the blocks)? Though I have no idea how you'd handle page splits without a lot of pain, but perhaps it would be possible to strive for a certain tuple ordering that would allow for a periodic re-cluster that doesn't have to move a lot of data. One thought is to strive for the same amount of free space on each page, so if you're touching a tuple on a page that has less than desired free space you move it's new version to either the next or previous page. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Would this open the door for allowing tables to be maintained in CLUSTER
> order (at least at the block level if not within the blocks)? Though I
> have no idea how you'd handle page splits without a lot of pain
I think the way you'd attack that is by building the table with a pretty
low fill factor, so that there's room on each page for a number of
updates before you have to split.  Since the index AM is going to be
dictating space allocation, this is all in its hands.
The existing CLUSTER code would probably be totally inapplicable to
this sort of organization --- we'd have to provide some alternate code
path for index-organized heaps.
        regards, tom lane
			
		On Thu, 2006-01-19 at 09:18 +0800, Christopher Kings-Lynne wrote: > > Oracle does, but you pay in other ways. Instead of keeping dead tuples > > in the main heap, they shuffle them off to an 'undo log'. This has some > > downsides: > > > > Rollbacks take *forever*, though this usually isn't much of an issue > > unless you need to abort a really big transaction. > > It's a good point though. Surely a database should be optimised for the > most common operation Yes. > - commits, rather than rollbacks? Commits are most common because most databases are optimized for them. Lots of programs go through a ton pre-checking to avoid a rollback that they don't need to do under PostgreSQL. I've found that for small systems I tend to rely very heavily on frequent vacuums and database level exceptions for virtually all data checking. Rollbacks are nearly as common as commits in those environments if not more-so. --
Tom Lane <tgl@sss.pgh.pa.us> writes: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > >> Oracle does, but you pay in other ways. Instead of keeping dead tuples > >> in the main heap, they shuffle them off to an 'undo log'. This has some > >> downsides: > >> Rollbacks take *forever*, though this usually isn't much of an issue > >> unless you need to abort a really big transaction. > > > It's a good point though. Surely a database should be optimised for the > > most common operation - commits, rather than rollbacks? > > The "shuffling off" of the data is expensive in itself, so I'm not sure > you can argue that the Oracle way is more optimal for commits either. You pay in Oracle when you read these records too. If there are pending updates you have to do a second read to the rollback segment to get the old record. This hits long-running batch queries especially hard since by the time they finish a large number of the records they're reading could have been updated and require a second read to the rollback segments. You also pay if the new value is too big to fit in the same space as the old record. Then you get to have to follow a pointer to the new location. Oracle tries to minimize that by intentionally leaving extra free space but that has costs too. And lastly rollback segments are of limited size. No matter how big you make them there's always the risk that a long running query will take long enough that data it needs will have expired from the rollback segments. Oh, and note that optimizing for the common case has limits. Rollbacks may be rare but one of the cases where they are effectively happening is on recovery after a crash. And that's one process you *really* don't want to take longer than necessary... -- greg
David Scott <davids@apptechsys.com> writes: > Since I am sure everyone is tired of the intro by now, I'll get to the > questions: ... > Is there any way to modify PostgreSQL to allow index lookups without heap > validation that doesn't involve re-writing the MVCC implementation of > keeping dead rows on the live table? Is the additional overhead of keeping > full tuple visibility information inside of the index so odious to the > Postgres community as to prevent a patch with this solution from being > applied back to the head? The consequences of full visibility information in indexes would indeed be pretty odious. However the general gist the conversation led last time it came up had what sounded like a feasible compromise: Keep a very compact bitmap outside the table (not attached to any single index) with one bit per tuple indicating whether the tuple was known to be visible to every transaction. The hope being this bitmap would be small enough to sit in memory pretty much permanently. Even if not then it should be much smaller than the table and impose a pretty small i/o overhead. If most of the records in the table are old records that are visible to every transaction then the index scan would be able to avoid reading in pages of the heap. "Most" would have to be a pretty big percentage though since even a single tuple with unknown visibility would have to be read in. The bitmap would be useful for vacuum too. Any page that contained only tuples with known visibility could be skipped. That would mean running vacuum for extremely large tables that have only moderate activity wouldn't have to scan all those static pages. (There could be an issue with people whose FSM can't track all the free space but expect it to be found on subsequent vacuums, but details details.) I wonder if the bitmap can actually be one bit per page actually. A single update has to set the bit for the tuple, and that will make the whole page have to be read in for both vacuum and index lookups. Only a vacuum will be able to verify that all the tuples in the page are known-visible and index entries have been cleaned up, and the vacuum is going to be operating on the whole page anyways. A one-bit-per-page bitmap will easily fit in RAM even for very large tables. -- greg
Greg Stark <gsstark@mit.edu> writes:
> You pay in Oracle when you read these records too. If there are pending
> updates you have to do a second read to the rollback segment to get the old
> record. This hits long-running batch queries especially hard since by the time
> they finish a large number of the records they're reading could have been
> updated and require a second read to the rollback segments.
If not third or fourth read, by the time you've found the version you're
supposed to be able to see.
I recall discussing this several years ago with somebody who knew quite
a bit about Oracle innards (though he didn't say how he knew...)
According to him, heavy accesses to the rollback segments have another
problem, which is contention for ownership of locks protecting access
to the rollback segments.  I got the impression that it would be like
us needing to take the WALWriteLock anytime we wanted to look at any
not-the-very-latest row version --- there's plenty of write traffic
that needs that lock, and you don't want to load it down with read
traffic too.
        regards, tom lane
			
		Greg Stark <gsstark@mit.edu> writes:
> I wonder if the bitmap can actually be one bit per page actually.
Yeah, I think we'd agreed that per-page was the way to go.  Per-tuple
bitmaps are painful to manage because of the variable number of tuples
per page.  And really all you need to know is whether to read the page
or not --- once you have, examining multiple tuples on it doesn't cost
much.
        regards, tom lane
			
		
			
				As an Oracle internals person myself, I don't see how making a comparison between the specifics of Oracle's MVCC to PostgreSQL's MVCC is relevant to this discussion.
As does *MOST* other commercial databases, Oracle's storage manager performs an update-in-place whereas PostgreSQL's (for the most part) does not. There are several ways to implement update-in-place, and Oracle has chosen their own rollback segment methodology which has issues that without tuning, are major hassles. I'm not saying that one is better than the other in ALL cases, but I and many other Oracle consultants have tuned Oracle installations to eliminate the headaches others in this list have mentioned. Any knowledgable Oracle person evaluating PostgreSQL that may be reading this list is just going to see it as a lot of anti-Oracle discussion with no basis in fact.
Regardless, there is NO WAY to perform an apples-to-apples comparison between the implementations, locking strategies, etc. as the MVCC implementations and goals are completely different. In short, Oracle's implementation is not perfect; neither is ours. Oracle's initial design (as a commercial database) is much different than PostgreSQL's (as a research database).
While I'm always game for looking at other implementations when designing and discussing new features for PostgreSQL, let's focus on making PostgreSQL better rather than spending time discussing unrealistic comparisons.
If we want to do a comparison on the how/why Oracle's index implementation is faster in the context of this situation and how we could make PostgreSQL's faster, let's stick to that.
			
		
		
	As does *MOST* other commercial databases, Oracle's storage manager performs an update-in-place whereas PostgreSQL's (for the most part) does not. There are several ways to implement update-in-place, and Oracle has chosen their own rollback segment methodology which has issues that without tuning, are major hassles. I'm not saying that one is better than the other in ALL cases, but I and many other Oracle consultants have tuned Oracle installations to eliminate the headaches others in this list have mentioned. Any knowledgable Oracle person evaluating PostgreSQL that may be reading this list is just going to see it as a lot of anti-Oracle discussion with no basis in fact.
Regardless, there is NO WAY to perform an apples-to-apples comparison between the implementations, locking strategies, etc. as the MVCC implementations and goals are completely different. In short, Oracle's implementation is not perfect; neither is ours. Oracle's initial design (as a commercial database) is much different than PostgreSQL's (as a research database).
While I'm always game for looking at other implementations when designing and discussing new features for PostgreSQL, let's focus on making PostgreSQL better rather than spending time discussing unrealistic comparisons.
If we want to do a comparison on the how/why Oracle's index implementation is faster in the context of this situation and how we could make PostgreSQL's faster, let's stick to that.
 On 1/19/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <gsstark@mit.edu> writes:
> I wonder if the bitmap can actually be one bit per page actually.
Yeah, I think we'd agreed that per-page was the way to go. Per-tuple
bitmaps are painful to manage because of the variable number of tuples
per page. And really all you need to know is whether to read the page
or not --- once you have, examining multiple tuples on it doesn't cost
much.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
On Thu, Jan 19, 2006 at 01:56:51AM -0500, Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > >> Oracle does, but you pay in other ways. Instead of keeping dead tuples > > >> in the main heap, they shuffle them off to an 'undo log'. This has some > > >> downsides: > > >> Rollbacks take *forever*, though this usually isn't much of an issue > > >> unless you need to abort a really big transaction. > > > > > It's a good point though. Surely a database should be optimised for the > > > most common operation - commits, rather than rollbacks? > > > > The "shuffling off" of the data is expensive in itself, so I'm not sure > > you can argue that the Oracle way is more optimal for commits either. > > You pay in Oracle when you read these records too. If there are pending > updates you have to do a second read to the rollback segment to get the old > record. This hits long-running batch queries especially hard since by the time > they finish a large number of the records they're reading could have been > updated and require a second read to the rollback segments. You pay the same cost in PostgreSQL though... If you index-scan to a dead tuple, you get pointed to where the new one is. And if you're seqscanning, well, you'll be reading everything anyway. > You also pay if the new value is too big to fit in the same space as the old > record. Then you get to have to follow a pointer to the new location. Oracle > tries to minimize that by intentionally leaving extra free space but that has > costs too. Again, similar to the cost with our MVCC. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > As an Oracle internals person myself, I don't see how making a comparison > between the specifics of Oracle's MVCC to PostgreSQL's MVCC is relevant to > this discussion. > > As does *MOST* other commercial databases, Oracle's storage manager performs > an update-in-place whereas PostgreSQL's (for the most part) does not. There > are several ways to implement update-in-place, and Oracle has chosen their > own rollback segment methodology which has issues that without tuning, are > major hassles. I'm not saying that one is better than the other in ALL > cases, but I and many other Oracle consultants have tuned Oracle > installations to eliminate the headaches others in this list have > mentioned. Any knowledgable Oracle person evaluating PostgreSQL that may be > reading this list is just going to see it as a lot of anti-Oracle discussion > with no basis in fact. > > Regardless, there is NO WAY to perform an apples-to-apples comparison > between the implementations, locking strategies, etc. as the MVCC > implementations and goals are completely different. ... Well it seems there were lots of facts posted. Yes you can avoid headaches caused by these issues, but we're not really talking about the headaches. We're comparing the performance costs of what are update-in-place and non-update-in-place approach. All of the costs named so far are to some degree fundamental costs of update-in-place. All you can hope to do in tuning a system is make sure the costs are kept within manageable bounds. There are fundamental costs to non-update-in-place as well. The table sizes are bloated by the amount of space used to store older versions and the dead tuples that haven't been reused yet. Whether this slows down Postgres as much as having to do a second (or third or fourth) read to a rollback segment is a valid area for discussion. It's especially interesting to discuss since the two costs hit different sets of queries unequally. > If we want to do a comparison on the how/why Oracle's index implementation > is faster in the context of this situation and how we could make > PostgreSQL's faster, let's stick to that. Well the main difference is the MVCC implementation. Talking about Oracle's index implementation while avoiding mentioning the elephant in the room would be sort of pointless. -- greg
On 19 Jan 2006 11:25:21 -0500, Greg Stark <gsstark@mit.edu> wrote:
Several were mentioned; some of which could generally be avoided by good tuning.
As PostgreSQL is not an update-in-place system, what is the point in discussing the costs? How does this solve David's original problem?
I agree, but again, we're not talking apples-to-apples. There's far too many variables to compare Oracle's speed to PostgreSQL's for most types of operations in the varying types of database deployments.
I agree that Oracle's MVCC plays *a little* into this index discussion, but isn't it pointless to discuss the pitfalls of an MVCC implementation that PostgreSQL does not have? Similarly, how does it solve David's original question.
Again, I'm fine with discussing these things, but let's keep on topic for David's sake. He posted a problem that we have discussed many times over. Let's focus on that problem and give him possible options.
David has stated that the index to heap visibility check is slowing him down, so what are the possible options:
- Visibility in indexes (-hackers archives cover the pros/cons)
- True organized heaps
- Block level index (Tom/Simon's earlier discussion)
			
		Well it seems there were lots of facts posted. Yes you can avoid headaches
caused by these issues, but we're not really talking about the headaches.
Several were mentioned; some of which could generally be avoided by good tuning.
We're comparing the performance costs of what are update-in-place and
non-update-in-place approach.
As PostgreSQL is not an update-in-place system, what is the point in discussing the costs? How does this solve David's original problem?
There are fundamental costs to non-update-in-place as well. The table sizes
are bloated by the amount of space used to store older versions and the dead
tuples that haven't been reused yet. Whether this slows down Postgres as much
as having to do a second (or third or fourth) read to a rollback segment is a
valid area for discussion. It's especially interesting to discuss since the
two costs hit different sets of queries unequally.
I agree, but again, we're not talking apples-to-apples. There's far too many variables to compare Oracle's speed to PostgreSQL's for most types of operations in the varying types of database deployments.
Well the main difference is the MVCC implementation. Talking about Oracle's
index implementation while avoiding mentioning the elephant in the room would
be sort of pointless.
I agree that Oracle's MVCC plays *a little* into this index discussion, but isn't it pointless to discuss the pitfalls of an MVCC implementation that PostgreSQL does not have? Similarly, how does it solve David's original question.
Again, I'm fine with discussing these things, but let's keep on topic for David's sake. He posted a problem that we have discussed many times over. Let's focus on that problem and give him possible options.
David has stated that the index to heap visibility check is slowing him down, so what are the possible options:
- Visibility in indexes (-hackers archives cover the pros/cons)
- True organized heaps
- Block level index (Tom/Simon's earlier discussion)
Jonah, > David has stated that the index to heap visibility check is slowing him > down, so what are the possible options: > > - Visibility in indexes (-hackers archives cover the pros/cons) > - True organized heaps > - Block level index (Tom/Simon's earlier discussion) also - Frozen relations This last solution was proposed as a possibility for the data warehousing case. For a time-partitioned table, we're going to know that all but one of the partitions has not been updated anywhere within visible transaction scope, and therefore index-only access is a possibility. also - join tables One of the other most valuable targets for index-only access is the "many-to-many join table" whose primary key consists of two (or more) foreign keys to two (or more) other tables. It's actually not necessary to check visibility on this kind of table as the visibility of tuples in the join table will be determined by the visibility of tuples in the two data tables. Since often join tables consist *only* of the join key, being able to do index-only access on them could dramatically speed up certian kinds of queries. Both of the above are "corner cases" but are very common ones and might be much easier to implement than the other solutions. --Josh
Thanks for all the help and thought to our problem. Jonah H. Harris wrote: > > David has stated that the index to heap visibility check is slowing > him down, so what are the possible options: > > - Visibility in indexes (-hackers archives cover the pros/cons) > - True organized heaps > - Block level index (Tom/Simon's earlier discussion) > > Several people seem to have a problem with full index visibility, however both last thread and this thread people seem to agree that keeping a page level visibility would be an acceptable compromise. Perhaps we should implement a rough patch and run some time stats to post for everyone to weigh the pros and cons against hard figures? The discussion on block level indexing and true organized heaps, if I am interpreting this topic correctly, would essentially keep the table optimized for accessing the data through one index? This seems like a very interesting idea, but I don't know that it would really solve our current problem because we need to use more then one index to efficiently access a table. We have a very driving need to solve this problem and so don't mind doing legwork to do feasibility testing or stat collection to help the community decide the best solution for Postgres.
Tom Lane wrote: >What sort of problems are you dealing with exactly? There has been >some discussion of changes that would improve certain scenarios. For >instance it might be plausible to do joins using index information and >only go back to the heap for entries that appear to pass the join test. > > > We tried that scenario, writing a "dirty index hack" to experiment with, that returned values whether they were valid or not. We saw some definite improvements inside of joins and sub queries, but we were still slowed down at the end because we still had to validate every row being returned. My hands are very tied as to what specific examples I can send, so I apologize for how long it took to get back to you on this. A simple (generalized) example of one the types of queries we are running: SELECT col1, col2, cool_func(stat_count, COALESCE(raw_counts.raw_count, (SELECT alt_count FROM alt_raw_table WHERE alt_raw_table.pk = col2))) as cool_func FROM (SELECT col1, col2, stat_count FROM pair_table WHERE col1 = $1) pair_table LEFT JOIN raw_counts ON pair_table.col2 = raw_counts.pk We tried not validating the return of both of these as we only want to see the rows which have a high value for cool_func, but it was still necessary to validate the rows which did match the criteria. So we did see an improvement, but not enough.
On Thu, Jan 19, 2006 at 10:19:01AM -0800, Josh Berkus wrote: > One of the other most valuable targets for index-only access is the > "many-to-many join table" whose primary key consists of two (or more) > foreign keys to two (or more) other tables. It's actually not necessary > to check visibility on this kind of table as the visibility of tuples in > the join table will be determined by the visibility of tuples in the two > data tables. Since often join tables consist *only* of the join key, > being able to do index-only access on them could dramatically speed up > certian kinds of queries. How would that handle 'delinking' item A from foobaz 2? (IE: DELETE FROM join_table WHERE id1=231 and id2=24842) The only way I can see this working is if it is required that items in both tables as well as the link in the many-many table are only inserted and deleted in the same transaction, which seems to be really pushing this into corner-case territory. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, Jan 19, 2006 at 10:35:30AM -0800, David Scott wrote: > Tom Lane wrote: > > >What sort of problems are you dealing with exactly? There has been > >some discussion of changes that would improve certain scenarios. For > >instance it might be plausible to do joins using index information and > >only go back to the heap for entries that appear to pass the join test. Do you still have that patch that folks could look at? ISTM that this technique would be rather dependant on your actual workload, and as such could result in a big win for certain types of queries. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2006-01-18 at 20:13 -0500, Tom Lane wrote: > Come to think of it, the idea also seems to map nicely into bitmap index > scans: the index will directly hand back a list of potential pages to > look at, but they are all marked "lossy" because the index doesn't know > exactly which tuple(s) on the target pages match the query. The > existing bitmap-heap-scan code can take it from there. Yes, I've privately suggested this solution in that context. I think there is enough meat there to make this topic worth discussing further, but not on list again just yet. Best Regards, Simon Riggs
On Thu, 19 Jan 2006, Jim C. Nasby wrote: > Do you still have that patch that folks could look at? ISTM that this > technique would be rather dependant on your actual workload, and as such > could result in a big win for certain types of queries. It is not a patch, per se. It is a c language function which calls some of the nbtree functions to return things from the index. The syntax for calling it is rather obtuse, since those of us who don't understand the parser are doomed to attempt circumventing it ;P. I tarred up the code, and put it on a web server so that interested parties can play with it. The url is http://linux.apptechsys.com/~jeremyd/postgresql/fakeidxscan.tar.gz It is very hackish, so definately do not assume that it is in any way correct, rather assume the opposite. I have run it on x86 and x86_64 boxes, and it compiles and runs on those. Here is an example of its usage, so you can see the nasty syntax required and perhaps grok how to use it better. create table test_table (a integer, b integer); create index test_table_a_b_idx on test_table (a, b); insert into test_table (a, b) select a, b from generate_series(1,100) a,generate_series(1,100) b; select * from fakeidxrowscomposite( 'test_table', -- relation 'test_table_a_b_idx', -- index 1, --number of scan keys ARRAY[1, 2]::smallint[], -- numbers of the index attributes to return ARRAY[1]::smallint[], -- numbers of the attrs the scankeys apply to ARRAY['=(integer,integer)'::regoperator], -- operators for the scankeys ARRAY[3]::smallint[], -- btree strategy for the scankeys (42,0) -- values for the scankeys to compare against (if there is only -- one, you have to put a fake one in since otherwisethe parser -- does not think it is a record) ) AS (a integer, b integer); -- tell the parser what columns to expect This example returns 100 rows in which the first column contains 42 and the second column contains the numbers between 1 and 100, in order. Feel free to do whatever with this, it's pretty fast for tables where seeks to validate tuples would hurt, but you do get back dead things... -- When you know absolutely nothing about the topic, make your forecast by asking a carefully selected probability sample of 300 others who don't know the answer either. -- Edgar R. Fiedler
On Thu, Jan 19, 2006 at 02:50:39PM -0800, Jeremy Drake wrote: > On Thu, 19 Jan 2006, Jim C. Nasby wrote: > > > Do you still have that patch that folks could look at? ISTM that this > > technique would be rather dependant on your actual workload, and as such > > could result in a big win for certain types of queries. ... > Feel free to do whatever with this, it's pretty fast for tables where > seeks to validate tuples would hurt, but you do get back dead things... How'd you then weed out the dead tuples? Basically, numbers talk. If there were convincing numbers for something that wasn't a corner-case that showed a marked improvement then there'd be much more interest in getting this into the backend in some fashion. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Basically, numbers talk. If there were convincing numbers for something
> that wasn't a corner-case that showed a marked improvement then there'd
> be much more interest in getting this into the backend in some fashion.
I've got no doubt that there are *some* "non corner" cases for which
this would be a win.  The lower the update load on the database, the
better it's going to look.  The issue really is where does it start
being a loss, and can you convince us that those cases are all "corner"
ones?
(The subtext here, of course, is the assumption that it's an
all-or-nothing choice.  I'm of the opinion that supporting both options
would be infeasible from a code complexity and maintenance standpoint;
but a simple patch that did both would of course prove that opinion
wrong ...)
        regards, tom lane
			
		On Thu, 19 Jan 2006, Jim C. Nasby wrote: > > Feel free to do whatever with this, it's pretty fast for tables where > > seeks to validate tuples would hurt, but you do get back dead things... > > How'd you then weed out the dead tuples? I didn't get that far with it. The purpose of this function was to quickly put something together to demonstrate that the overhead of seeking to the proper tuples in the heap to determine their visibility was the main component of the time being spent to satisfy our queries. > Basically, numbers talk. If there were convincing numbers for something > that wasn't a corner-case that showed a marked improvement then there'd > be much more interest in getting this into the backend in some fashion. I could get some numbers of how much time validating tuples adds to a query, but I don't think that that would be horribly novel. BTW, hopefully I did not make you think that I intended to get this into the official backend. This function was only meant to demonstrate to the people around here that the visibility check was the bottleneck we were seeing. The function may also be interesting as a demonstration of how indexes are handled in postgres, as you can see when tuples are flagged as no longer valid and when they are not. I have put xmin into an index so that I could use this function to better visualize when index tuples are left behind (I tried to put xmax in there too, but I never saw them change, after checking the code it turns out that the index is never told about changes in xmax). We were seeing this case: All rows in our table are visible (we are the only transaction on the machine and we did a VACUUM FULL ANALYZE before). We rebooted to ensure no caching. We were seeing times which, upon division by the number of rows returned by the index scan, were remarkably close to the average seek time listed on the specs for the hard drive in the testing box. This was about 5ms, which doesn't sound like much, but given a large enough number of rows and a few joins, 5ms per tuple adds up quickly. This implies that we were seeing approximately the worst case as far as the distribution of the relevant tuples on pages, ie each tuple we wanted was on a different heap page. Digging back to some times we had collected from this experiment, apparently we were taking about 15 to 20 seconds to run a particular query, and when we used the function I previously posted those times were reduced to 5 seconds. This was a while ago, however, so these times are probably not very accurate and we probably made other tweaks to speed things up since then. But it gives an idea. We could come up with more absolute numbers, but I think people already know what they would look like.