Обсуждение: Possibility of Index-Only access in PostgreSQL?

Поиск
Список
Период
Сортировка

Possibility of Index-Only access in PostgreSQL?

От
Gunther Schadow
Дата:
Hello,

we have a simple query here that uses a table such as Foo(id*, bar, baz)
were id is a primary key and then we have an additional index on
Foo(bar, baz). Now, we have a simple query such as this:

SELECT bar, baz FROM Foo WHERE bar='HIT';

The Foo table has about 25 million rows and the above query selects 35000
rows.

It takes 10 minutes, that's the problem.

Yes, it's doing an index scan according to EXPLAIN, and yes VACUUM
ANALYZE is done, the machine is decent (dual 1 GHz i*86 processor,
2 GB RAM doing software RAID 0-1 under Linux.) The shared memory
buffer is 512 MB and there is no significant swapping activity.
The index and the table are on the same spindle, yes, but that's
nothing I can change right now.

I found that this above query runs in about 8 second if it's a rerun.
Presumably our large cache then contains most of the index if not also
the data rows. Apparently I am stuck in an I/O bottleneck.

So, what can I do? Again I can't touch the hardware for now. Besides,
we are running this same query on Oracle (on an already aging Alpha
mini) and it consistently runs in 17 seconds. May be there is also
caching involved, but there is something Oracle does better about it.

Two things come into mind:

- Oracle is content with data read from an index if that is all that's  needed. Indeed, I only need bar and baz and
thoseare in the index.  The benefits of not having to seek the scattered rows from the data  table saves 35000 back and
forthhead movements / seeks.
 

- Even if we had to go to the data rows, could it be that if a good  big chunk of the index would be read ahead and
bufferedin one piece  and then iterated over to retrieve the data rows without making the  r/w head jump back and forth
toand from the index would be good.
 
  Is PostgreSQL doing this index read-ahead? Is there a tuning variable  I failed to set to make it read-ahead more?

I presume that PostgreSQL does good reading-ahead of the index, right?

But I heard it could not do index-only reads. Now, since I have several
of these problems (each time with different indexes) and our Oracle side
of the project uses index-only reads a lot, I would be very eager to have
index-only reads in PostgreSQL.

Naively it doesn't sound like a hard problem to implement in PostgreSQL,
does it? All we would need to do is get the data off the index that we
already read, and not do the row lookup. Probably the optimizer has
to make different plans here, so the main work will probably have to be
in the optimizer code. And in addition it might happen that some lower
level constraint exists in the executor who need to go to the rows for
some reason. In the worst case the index does not contain all data but
only just enough of a portion that distinguishes rows. That would then
require to rewrite all index code to have complete data in indexes.
But I'm only naively conjecturing.

So, please advise, what are the real barriers to this kind of
optimization? Is there anything I could do to help the progress on
this issue. (I'm generally capable of hacking certain features into
existing code if there isn't some subtle logic involved that I do not 
understand, because it isn't documented.)

I appreciate your help,
thank you,
-Gunther


-- 
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org






Re: Possibility of Index-Only access in PostgreSQL?

От
Tom Lane
Дата:
Gunther Schadow <gunther@aurora.regenstrief.org> writes:
> I would be very eager to have
> index-only reads in PostgreSQL.

It won't happen.  See the many, many prior discussions of this point in
the archives.

Something we perhaps *could* do is to "batch" index reads: pull a bunch
of TIDs from the index, sort these by page number, and then access the
corresponding heap tuples in page number order.  Unfortunately you'd
probably need to batch some thousands of TIDs to really improve the
locality of reference this way, and that would mean holding pins on
quite a few index pages, which would complicate the bookkeeping and
probably create a severe concurrency hit (if not actually introduce a
risk of deadlock; I haven't thought hard enough about it to be sure).
We could avoid these potential downsides by only sorting one index
page's worth of TIDs at a time, but most likely that doesn't buy enough
locality of reference to be worth the trouble.

Still it seems like a useful avenue to investigate.  If you want to
pursue it further, this is not the list to be discussing it on;
pgsql-hackers is the place for such discussions.
        regards, tom lane




Re: Possibility of Index-Only access in PostgreSQL?

От
"Christopher Kings-Lynne"
Дата:
> - Oracle is content with data read from an index if that is all that's
>    needed. Indeed, I only need bar and baz and those are in the index.
>    The benefits of not having to seek the scattered rows from the data
>    table saves 35000 back and forth head movements / seeks.

Postgres can't use the data from its indices, because of the MVCC mechanism.
Maybe you should try CLUSTERing your table to improve performance???

Chris






Re: Possibility of Index-Only access in PostgreSQL?

От
Gunther Schadow
Дата:
Tom Lane wrote:

> It won't happen.  See the many, many prior discussions of this point in
> the archives.


Hmm, I searched the archives both in my local mailbox and on Google,
could you give me some pointer to a relevant thread? I certainly
want to learn from past discussions.

So, it's that MVCC issue? I will then pursue distributing indexes
and tables over more spindles and will also consider CLUSTERing the
table (although that's difficult for two *different* use cases I
have.)

However, when everything is said and done, the "it won't happen"
still sounds frightening to me. It puts PostgreSQL at a BIG disadvantage,
limiting it to 150 records per second random access ceiling when the
data needed is really all available for sequential index scan.

I'm eager to learn more about it and may be put versioning into the
index rows (or whatever it is.) Who am I to doubt Tom Lane's words
"it won't happen"? But isn't this a very serious conclusion? In any
of this we have that challenge/competition with Oracle in mind, want
it or not. If we want to use PostgreSQL seriously for our medical
record, an "it won't happen" answer can make a huge negative impression
on those I am working with. PostgreSQL for our medical record system
-- it won't happen!? ... Heck no! This can't be the answer!

regards
-Gunther

-- 
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org