Обсуждение: Incremental clustering?
The docs say: "Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. That is, no attempt is made to store new or updated rows according to their index order. If one wishes, one can periodically recluster by issuing the command again." and "During the cluster operation, a temporary copy of the table is created that contains the table data in the index order. Temporary copies of each index on the table are created as well. Therefore, you need free space on disk at least equal to the sum of the table size and the index sizes." I don't relish totally locking and making a temporary copy of my biggest table, but that's exactly the table that seems to need clustering the most. Will subsequent cluster command also make a complete copy? Some form of "incremental clustering" would be nice... -John
A long time ago, in a galaxy far, far away, siracusa@mindspring.com (John Siracusa) wrote: > The docs say: > > "Clustering is a one-time operation: when the table is subsequently > updated, the changes are not clustered. That is, no attempt is made > to store new or updated rows according to their index order. If one > wishes, one can periodically recluster by issuing the command > again." > > and > > "During the cluster operation, a temporary copy of the table is > created that contains the table data in the index order. Temporary > copies of each index on the table are created as well. Therefore, > you need free space on disk at least equal to the sum of the table > size and the index sizes." > > I don't relish totally locking and making a temporary copy of my > biggest table, but that's exactly the table that seems to need > clustering the most. Will subsequent cluster command also make a > complete copy? Some form of "incremental clustering" would be > nice... Unfortunately, making this work in a manner that allows concurrency is likely to be really troublesome. The cluster operation potentially has to reorder all the tuples, and the fact that the table is already _partially_ organized only diminishes the potential. If the new data, generally added "at the end," has values that are fairly uniformly distributed across the index, then the operation really will have to reorder all of the tuples... It would be pretty sweet to have a process analagous to 'non-blocking VACUUM' as opposed to 'VACUUM FULL.' But there's no particularly easy way, and, to do so, you'd essentially have to throw away a fair chunk of the benefits of the "clustered" properties. -- "cbbrowne","@","acm.org" http://www.ntlug.org/~cbbrowne/rdbms.html "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." -- Sir Arthur Conan Doyle (1859-1930), English author. Sherlock Holmes, in The Sign of Four, ch. 6 (1889). [...but see the Holmesian Fallacy, due to Bob Frankston... <http://www.frankston.com/public/Essays/Holmesian%20Fallacy.asp>]
On 1/4/04 6:24 PM, Christopher Browne wrote: > The cluster operation potentially has to reorder all the tuples, and > the fact that the table is already _partially_ organized only > diminishes the potential. If the new data, generally added "at the > end," has values that are fairly uniformly distributed across the > index, then the operation really will have to reorder all of the > tuples... What about the special case of a table that is clustered on a column and all subsequent inserts will add rows with ever-increasing values of that column? This would be the case for creation dates or even a column created from a sequence. Basically, after clustering, it would be nice if you could tell the system to "only add to the end" and to "add in clustered order." Programming for special cases is annoying, but sometimes it really helps. -John
John, et al, We too have an interest in reclustering large tables, but in our case most of the transactions are spread throughout the table (though in some cases not uniformly). I have been pondering a program that selects all the rows in the table in cluster order and then, as a single transaction, deletes a block-full of rows and then re-inserts them. The program would then move on to the next block-full and repeat the operation. Note that if there are triggers on the table, this may have unintended side-effects. This would require having a pretty clear idea of the space required for each row, and would probably require frequent vacuums during the process. If there were a way to tell the block address of each row, I suppose you could leave some rows where they are. In the end, you might end up with the same space requirements (a full copy as workspace), but I'm not sure. Depending on the data, it may be possible to add new rows to the table while this process is going on. Any new rows added by other processes will certainly not be in order, and may interfere with new rows being added in a contiguous fashion (I'm not sure of the allocation algorithm used by PG). Thoughts? Comments? Ray On Mon, Jan 05, 2004 at 01:54:16PM -0500, John Siracusa wrote: > On 1/4/04 6:24 PM, Christopher Browne wrote: > > The cluster operation potentially has to reorder all the tuples, and > > the fact that the table is already _partially_ organized only > > diminishes the potential. If the new data, generally added "at the > > end," has values that are fairly uniformly distributed across the > > index, then the operation really will have to reorder all of the > > tuples... > > What about the special case of a table that is clustered on a column and all > subsequent inserts will add rows with ever-increasing values of that column? > This would be the case for creation dates or even a column created from a > sequence. Basically, after clustering, it would be nice if you could tell > the system to "only add to the end" and to "add in clustered order." > > Programming for special cases is annoying, but sometimes it really helps. > > -John > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org ---------------------------------------------------------------------- Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/