Обсуждение: proposal for smaller indexes on index-ordered tables
The way I read it, the current btree index stores the index value and the TID of every tuple having that value. When youhave a table with three columns, you index one of them and you get an index which is practically as large as the tableitself.<br /><br />Supposing the table is generally or strictly ordered by the column to be indexed, it would be morecompact if the index stored ranges of tuples. Instead of storing the TID of every tuple with that value, the index wouldstore a first and last TID, between which all tuples have the value.<br /><br />Example: table with one million rowsindexed on a column having one thousand distinct values. Table is in-order by the indexed column. The traditional indexwould contain a million TIDs, whereas a range index would contain only two thousand. The range index would be 500 timessmaller, more likely to be cached, etc.<br /><br />Thoughts?<br /><br />-jwb<br />
On Tue, Jun 24, 2008 at 4:34 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: > Supposing the table is generally or strictly ordered by the column to be > indexed, it would be more compact if the index stored ranges of tuples. > Instead of storing the TID of every tuple with that value, the index would > store a first and last TID, between which all tuples have the value. There are several databases which implement this idea. Unfortunately, Postgres does not yet ensure that indexed tables remain indexed. As such, an index such as this would soon be ineffective. IIRC, Heikki has done some work on keeping clustered tables clustered, but it hasn't yet made it into core. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
Jeffrey Baker írta: > The way I read it, the current btree index stores the index value and > the TID of every tuple having that value. When you have a table with > three columns, you index one of them and you get an index which is > practically as large as the table itself. > > Supposing the table is generally or strictly ordered by the column to > be indexed, it would be more compact if the index stored ranges of > tuples. Instead of storing the TID of every tuple with that value, > the index would store a first and last TID, between which all tuples > have the value. > > Example: table with one million rows indexed on a column having one > thousand distinct values. Table is in-order by the indexed column. > The traditional index would contain a million TIDs, whereas a range > index would contain only two thousand. The range index would be 500 > times smaller, more likely to be cached, etc. > > Thoughts? > > -jwb Example with your theory: One (not yet committed) transaction changes one tuple that was in the middle of a range before but the tuple's indexed column changed. What would you do? You need to keep track of multiple index versions: 1. the range has to be split for the not-yet-committed modifier transaction, it might need to re-read the same table. 2. the old range has to be kept for reader transactions that still see the old data Imagine you have thousands of UPDATEs in flight on different rows. Or you introduce "readers has to wait for writers locks" and "updaters has to wait for other updaters on the same range" that the MVCC implementation nicely avoids. Look at MaxDB once, you'll appreciate PostgreSQL then. MaxDB stores table tuples in the order of its primary key, it uses a balanced btree for that. This means slower INSERTs and UPDATEs and decreased concurrency compared to PostgreSQL. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
On Tue, Jun 24, 2008 at 1:59 PM, Zoltan Boszormenyi <zb@cybertec.at> wrote:
Insert the new tuple at the end of the table and add another range to the index. Leave the old tuple in place and don't touch the original index range.
This is only true if you update the tuple in-place.
I'm quite aware of the problems of maintaining such a table and index, but the fact is that data warehouse type tables may never be updated after being created. The particular application I'm struggling with does a SELECT ... INTO ... ORDER BY to make an ordered table for querying every night. The problem is it takes longer, much longer, to create the index than to create the table, and in the end the index is as big as half the table anyway.
So this type of index would only be useful for an essentially read-only table. I agree.
Quite another proposal would be to somehow instruct the database that the table is strictly in-order by a column and allow a binary search access method. Then you don't need any index at all.
-jwb
Jeffrey Baker írta:Example with your theory:> The way I read it, the current btree index stores the index value and
> the TID of every tuple having that value. When you have a table with
> three columns, you index one of them and you get an index which is
> practically as large as the table itself.
>
> Supposing the table is generally or strictly ordered by the column to
> be indexed, it would be more compact if the index stored ranges of
> tuples. Instead of storing the TID of every tuple with that value,
> the index would store a first and last TID, between which all tuples
> have the value.
>
> Example: table with one million rows indexed on a column having one
> thousand distinct values. Table is in-order by the indexed column.
> The traditional index would contain a million TIDs, whereas a range
> index would contain only two thousand. The range index would be 500
> times smaller, more likely to be cached, etc.
>
> Thoughts?
>
> -jwb
One (not yet committed) transaction changes one tuple
that was in the middle of a range before but the tuple's indexed
column changed. What would you do?
Insert the new tuple at the end of the table and add another range to the index. Leave the old tuple in place and don't touch the original index range.
You need to keep track of multiple index versions:
1. the range has to be split for the not-yet-committed modifier transaction,
it might need to re-read the same table.
2. the old range has to be kept for reader transactions that still see
the old data
This is only true if you update the tuple in-place.
Imagine you have thousands of UPDATEs in flight on different rows.
I'm quite aware of the problems of maintaining such a table and index, but the fact is that data warehouse type tables may never be updated after being created. The particular application I'm struggling with does a SELECT ... INTO ... ORDER BY to make an ordered table for querying every night. The problem is it takes longer, much longer, to create the index than to create the table, and in the end the index is as big as half the table anyway.
So this type of index would only be useful for an essentially read-only table. I agree.
Quite another proposal would be to somehow instruct the database that the table is strictly in-order by a column and allow a binary search access method. Then you don't need any index at all.
-jwb
Jeffrey Baker írta: > On Tue, Jun 24, 2008 at 1:59 PM, Zoltan Boszormenyi <zb@cybertec.at > <mailto:zb@cybertec.at>> wrote: > > Jeffrey Baker írta: > > The way I read it, the current btree index stores the index > value and > > the TID of every tuple having that value. When you have a table > with > > three columns, you index one of them and you get an index which is > > practically as large as the table itself. > > > > Supposing the table is generally or strictly ordered by the > column to > > be indexed, it would be more compact if the index stored ranges of > > tuples. Instead of storing the TID of every tuple with that value, > > the index would store a first and last TID, between which all tuples > > have the value. > > > > Example: table with one million rows indexed on a column having one > > thousand distinct values. Table is in-order by the indexed column. > > The traditional index would contain a million TIDs, whereas a range > > index would contain only two thousand. The range index would be 500 > > times smaller, more likely to be cached, etc. > > > > Thoughts? > > > > -jwb > > Example with your theory: > One (not yet committed) transaction changes one tuple > that was in the middle of a range before but the tuple's indexed > column changed. What would you do? > > > Insert the new tuple at the end of the table and add another range to > the index. Leave the old tuple in place and don't touch the original > index range. This is what I described below but I only mentioned the index part: > You need to keep track of multiple index versions: > 1. the range has to be split for the not-yet-committed modifier > transaction, > it might need to re-read the same table. > 2. the old range has to be kept for reader transactions that still see > the old data > > > This is only true if you update the tuple in-place. Why? If you update in-place then the above is not needed. You just need to serialize transactions but there goes concurrency. > Imagine you have thousands of UPDATEs in flight on different rows. > > > I'm quite aware of the problems of maintaining such a table and index, > but the fact is that data warehouse type tables may never be updated > after being created. The particular application I'm struggling with > does a SELECT ... INTO ... ORDER BY to make an ordered table for > querying every night. The problem is it takes longer, much longer, to > create the index than to create the table, and in the end the index is > as big as half the table anyway. > > So this type of index would only be useful for an essentially > read-only table. I agree. > > Quite another proposal would be to somehow instruct the database that > the table is strictly in-order by a column and allow a binary search > access method. Then you don't need any index at all. CLUSTER tablename USING indexname; It's useful for little changing very large tables. > -jwb > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
"Jeffrey Baker" <jwbaker@gmail.com> writes: > I'm quite aware of the problems of maintaining such a table and index, but > the fact is that data warehouse type tables may never be updated after being > created. The particular application I'm struggling with does a SELECT ... > INTO ... ORDER BY to make an ordered table for querying every night. The > problem is it takes longer, much longer, to create the index than to create > the table, and in the end the index is as big as half the table anyway. There's something wrong with that: sorting the table rows surely ought to take longer than sorting the same number of (smaller) index entries. Have you done any profiling to find out what the problem is? Perhaps there's something wrong with the setting of maintenance_work_mem (vs work_mem). regards, tom lane
On Tue, Jun 24, 2008 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
For this query, work_mem is 100MB and maintenance_work_mem is 1GB, on a system with 8GB of memory. Notably I just installed a new storage subsystem and upgraded to 8.3.1 less than a week ago, so my experience with this instance is somewhat limited. Creating the table in this case takes half an hour and then indexing it requires almost an hour. Subsequently analyzing the table takes less than a minute, with statistics set to maximum.
Query performance is excellent. I was just brainstorming on ways to save time on the creation.
-jwb
"Jeffrey Baker" <jwbaker@gmail.com> writes:There's something wrong with that: sorting the table rows surely ought
> I'm quite aware of the problems of maintaining such a table and index, but
> the fact is that data warehouse type tables may never be updated after being
> created. The particular application I'm struggling with does a SELECT ...
> INTO ... ORDER BY to make an ordered table for querying every night. The
> problem is it takes longer, much longer, to create the index than to create
> the table, and in the end the index is as big as half the table anyway.
to take longer than sorting the same number of (smaller) index entries.
Have you done any profiling to find out what the problem is? Perhaps
there's something wrong with the setting of maintenance_work_mem (vs
work_mem).
For this query, work_mem is 100MB and maintenance_work_mem is 1GB, on a system with 8GB of memory. Notably I just installed a new storage subsystem and upgraded to 8.3.1 less than a week ago, so my experience with this instance is somewhat limited. Creating the table in this case takes half an hour and then indexing it requires almost an hour. Subsequently analyzing the table takes less than a minute, with statistics set to maximum.
Query performance is excellent. I was just brainstorming on ways to save time on the creation.
-jwb
"Jeffrey Baker" <jwbaker@gmail.com> writes: > For this query, work_mem is 100MB and maintenance_work_mem is 1GB, on a > system with 8GB of memory. Notably I just installed a new storage subsystem > and upgraded to 8.3.1 less than a week ago, so my experience with this > instance is somewhat limited. Creating the table in this case takes half an > hour and then indexing it requires almost an hour. These numbers seem to me to be pretty strong evidence that maintenance_work_mem = 1GB is a mistake. Try it at 100MB and then some intermediate values. Now, *why* it is a mistake is interesting to speculate about, but let's confirm the theory first. regards, tom lane
>>> On Tue, Jun 24, 2008 at 4:54 PM, in message <7020.1214344479@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jeffrey Baker" <jwbaker@gmail.com> writes: >> Creating the table in this case takes half an >> hour and then indexing it requires almost an hour. > > These numbers seem to me to be pretty strong evidence that > maintenance_work_mem = 1GB is a mistake. Try it at 100MB and then some > intermediate values. > > Now, *why* it is a mistake is interesting to speculate about, but > let's confirm the theory first. Could this be related to hint bit rewrites during indexing? Would a vacuum between creation and indexing be a good way to tell? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now, *why* it is a mistake is interesting to speculate about, but >> let's confirm the theory first. > Could this be related to hint bit rewrites during indexing? If so, changing maintenance_work_mem won't improve the situation. What I personally suspect is that Jeff's index build is swapping like crazy, or else there's just some problem in the sort code for such a large sort arena. But let's get some evidence about how the index build time varies with maintenance_work_mem before jumping to conclusions. > Would a vacuum between creation and indexing be a good way to tell? Yeah, that might be a useful experiment to try too. It wouldn't improve the overall time AFAICS, but it would give us some idea how much of the indexing time was really spent on hintbits. regards, tom lane
On Tue, Jun 24, 2008 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well it definitely isn't that, because the machine doesn't even have a swap area defined. vmstat during the table creation and index creation look really quite different. During the table sort there's a heavy r/w traffic 12-20MB/s, during the index creation it's lower. But seem to be CPU limited (i.e. one CPU is maxed out the whole time, and iowait is not very high).
I guess nobody has any interest in my proposal, only in the departure of my described experience from expected behavior :-(
>> Now, *why* it is a mistake is interesting to speculate about, butIf so, changing maintenance_work_mem won't improve the situation.
>> let's confirm the theory first.
> Could this be related to hint bit rewrites during indexing?
What I personally suspect is that Jeff's index build is swapping like
crazy, or else there's just some problem in the sort code for such a
large sort arena. But let's get some evidence about how the index build
time varies with maintenance_work_mem before jumping to conclusions.
Well it definitely isn't that, because the machine doesn't even have a swap area defined. vmstat during the table creation and index creation look really quite different. During the table sort there's a heavy r/w traffic 12-20MB/s, during the index creation it's lower. But seem to be CPU limited (i.e. one CPU is maxed out the whole time, and iowait is not very high).
I guess nobody has any interest in my proposal, only in the departure of my described experience from expected behavior :-(
On Tue, Jun 24, 2008 at 3:50 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote: > On Tue, Jun 24, 2008 at 4:34 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: >> Supposing the table is generally or strictly ordered by the column to be >> indexed, it would be more compact if the index stored ranges of tuples. >> Instead of storing the TID of every tuple with that value, the index would >> store a first and last TID, between which all tuples have the value. > > There are several databases which implement this idea. Unfortunately, > Postgres does not yet ensure that indexed tables remain indexed. > Just for the records. you mean *ordered* tables, don't you? Postgres does not yet ensure that ordered tables remain ordered. -- regards, Jaime Casanova Soporte y capacitación de PostgreSQL Guayaquil - Ecuador Cel. (593) 87171157
"Jeffrey Baker" <jwbaker@gmail.com> writes: > I guess nobody has any interest in my proposal, only in the departure of my > described experience from expected behavior :-( Well, we certainly should try to understand the unexpected behavior in detail before we consider solutions. Per Sir A.C. Doyle, it is a capital mistake to theorize in advance of the data. (It's probably also worth noting that this community's historical interest has not been in read-only or even read-mostly data. We'd not be willing to pay all that MVCC overhead if we thought we were just a warehouse of static data. If that's what you want, maybe you need some other DBMS.) regards, tom lane
Jeffrey Baker wrote: > The way I read it, the current btree index stores the index value and the > TID of every tuple having that value. When you have a table with three > columns, you index one of them and you get an index which is practically as > large as the table itself. > > Supposing the table is generally or strictly ordered by the column to be > indexed, it would be more compact if the index stored ranges of tuples. > Instead of storing the TID of every tuple with that value, the index would > store a first and last TID, between which all tuples have the value. Search the archives for the Grouped Index Tuples, also known as Clustered Indexes patch I worked on in spring 2007. It did almost exactly that. It didn't make it into 8.3 for various reasons: the patch was quite invasive, the design and performance characteristics were not well-understood by fellow hackers, and there was not that much interest in it back then. I'm not working on it or planning to work on it for now, but if you're interested, the patch is still out there. It requires a lot of work, but the design is still viable. I'm certainly willing to help with it if someone wants to pick it up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com