Обсуждение: Incremental clustering?

Поиск
Список
Период
Сортировка

Incremental clustering?

От
John Siracusa
Дата:
The docs say:

"Clustering is a one-time operation: when the table is subsequently updated,
the changes are not clustered. That is, no attempt is made to store new or
updated rows according to their index order. If one wishes, one can
periodically recluster by issuing the command again."

and

"During the cluster operation, a temporary copy of the table is created that
contains the table data in the index order. Temporary copies of each index
on the table are created as well. Therefore, you need free space on disk at
least equal to the sum of the table size and the index sizes."

I don't relish totally locking and making a temporary copy of my biggest
table, but that's exactly the table that seems to need clustering the most.
Will subsequent cluster command also make a complete copy?  Some form of
"incremental clustering" would be nice...

-John


Re: Incremental clustering?

От
Christopher Browne
Дата:
A long time ago, in a galaxy far, far away, siracusa@mindspring.com (John Siracusa) wrote:
> The docs say:
>
> "Clustering is a one-time operation: when the table is subsequently
> updated, the changes are not clustered. That is, no attempt is made
> to store new or updated rows according to their index order. If one
> wishes, one can periodically recluster by issuing the command
> again."
>
> and
>
> "During the cluster operation, a temporary copy of the table is
> created that contains the table data in the index order. Temporary
> copies of each index on the table are created as well. Therefore,
> you need free space on disk at least equal to the sum of the table
> size and the index sizes."
>
> I don't relish totally locking and making a temporary copy of my
> biggest table, but that's exactly the table that seems to need
> clustering the most.  Will subsequent cluster command also make a
> complete copy?  Some form of "incremental clustering" would be
> nice...

Unfortunately, making this work in a manner that allows concurrency is
likely to be really troublesome.

The cluster operation potentially has to reorder all the tuples, and
the fact that the table is already _partially_ organized only
diminishes the potential.  If the new data, generally added "at the
end," has values that are fairly uniformly distributed across the
index, then the operation really will have to reorder all of the
tuples...

It would be pretty sweet to have a process analagous to 'non-blocking
VACUUM' as opposed to 'VACUUM FULL.'  But there's no particularly easy
way, and, to do so, you'd essentially have to throw away a fair chunk
of the benefits of the "clustered" properties.
--
"cbbrowne","@","acm.org"
http://www.ntlug.org/~cbbrowne/rdbms.html
"When you have eliminated the impossible, whatever remains, however
improbable, must be the truth." -- Sir Arthur Conan Doyle (1859-1930),
English author. Sherlock Holmes, in The Sign of Four, ch. 6 (1889).
[...but see the Holmesian Fallacy, due to Bob Frankston...
<http://www.frankston.com/public/Essays/Holmesian%20Fallacy.asp>]

Re: Incremental clustering?

От
John Siracusa
Дата:
On 1/4/04 6:24 PM, Christopher Browne wrote:
> The cluster operation potentially has to reorder all the tuples, and
> the fact that the table is already _partially_ organized only
> diminishes the potential.  If the new data, generally added "at the
> end," has values that are fairly uniformly distributed across the
> index, then the operation really will have to reorder all of the
> tuples...

What about the special case of a table that is clustered on a column and all
subsequent inserts will add rows with ever-increasing values of that column?
This would be the case for creation dates or even a column created from a
sequence.  Basically, after clustering, it would be nice if you could tell
the system to "only add to the end" and to "add in clustered order."

Programming for special cases is annoying, but sometimes it really helps.

-John


Re: Incremental clustering?

От
Ray Ontko
Дата:
John, et al,

We too have an interest in reclustering large tables, but in our
case most of the transactions are spread throughout the table
(though in some cases not uniformly).

I have been pondering a program that selects all the rows in
the table in cluster order and then, as a single transaction,
deletes a block-full of rows and then re-inserts them.  The program
would then move on to the next block-full and repeat the operation.
Note that if there are triggers on the table, this may have unintended
side-effects.

This would require having a pretty clear idea of the space required
for each row, and would probably require frequent vacuums during
the process.  If there were a way to tell the block address of each
row, I suppose you could leave some rows where they are.  In the
end, you might end up with the same space requirements (a full
copy as workspace), but I'm not sure.

Depending on the data, it may be possible to add new rows to the
table while this process is going on.  Any new rows added by other
processes will certainly not be in order, and may interfere with
new rows being added in a contiguous fashion (I'm not sure of the
allocation algorithm used by PG).

Thoughts?  Comments?

Ray

On Mon, Jan 05, 2004 at 01:54:16PM -0500, John Siracusa wrote:
> On 1/4/04 6:24 PM, Christopher Browne wrote:
> > The cluster operation potentially has to reorder all the tuples, and
> > the fact that the table is already _partially_ organized only
> > diminishes the potential.  If the new data, generally added "at the
> > end," has values that are fairly uniformly distributed across the
> > index, then the operation really will have to reorder all of the
> > tuples...
>
> What about the special case of a table that is clustered on a column and all
> subsequent inserts will add rows with ever-increasing values of that column?
> This would be the case for creation dates or even a column created from a
> sequence.  Basically, after clustering, it would be nice if you could tell
> the system to "only add to the end" and to "add in clustered order."
>
> Programming for special cases is annoying, but sometimes it really helps.
>
> -John
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
----------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/