Re: Weird indices

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Weird indices
Дата
Msg-id 12193.982791429@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Weird indices  (Martijn van Oosterhout <kleptog@svana.org>)
Список pgsql-general
Martijn van Oosterhout <kleptog@svana.org> writes:
> We have a table with over 1 million rows and the statistics Postgres gathers
> are not particularly useful. There is not one (non-null) value that occurs
> significantly more often than other values but the distribution looks a lot
> like a 1/x curve I guess. The most common value occurs 5249 times but the
> average is only 95, so Postgres chooses seq scan almost always. We actually
> now set enable_seqscan=off in many areas of our code to speed it up to a
> useful rate. (This table also happens to have an (accedental) clustering on
> this column also).

> What is the reasoning behind estimating like that? Why not just the average
> or the average + 1 SD?

Can you think of a reasonable algorithm for VACUUM to obtain the true
average frequency?  It has a tough enough time estimating the most
common frequency with any reliability.  Given complaints in nearby
threads that VACUUM ANALYZE is too slow, it'd be a good idea if your
algorithm was faster than the current one, too ;-)

We have kicked around the idea of scanning a btree index (if available)
rather than the main table.  This would make it a *lot* easier to obtain
reliable frequency statistics, since you'd know that all the instances
of a given value would be seen in sequence, and you could count them
with trivial logic rather than having a difficult estimation problem.
(The fact that some might represent dead tuples probably isn't a problem,
especially since we just vacuumed.)  Not done yet though, and there's
some issues still to be surmounted.

> Another idea, is there a use for making a "cohesiveness" index. ie. if
> you're looking X by looking up the index, on average, how many also matching
> tuples will be in the next 8k (or whatever size). Since these are likely to
> be in the cache the cost of retreival would be much lower. This would mean
> that an index on a clustered column would have a much lower estimated cost
> than an index on other columns. This would make clustering more useful.

Again, estimating this number with any reliability seems a hard problem.
Got any ideas?

What I've been thinking of is simply maintaining a flag that says "this
table has been clustered on this index" (a fact not now stored anywhere)
and having the planner change cost estimates accordingly.  The accuracy
of the cost estimates would degrade as the table is updated and drifts
away from clustered order, but it seems *very* hard to estimate that
process.  One answer is not to try, but to assume that the dbadmin will
re-cluster the table often enough so that it stays in pretty good order.

At the moment I'm hesitant to do anything that encourages use of CLUSTER
at all, because of the horrible side-effects it has.  So personally I
think this is further down the to-do queue than rewriting CLUSTER.

            regards, tom lane

В списке pgsql-general по дате отправления:

Предыдущее
От: "Tamsin"
Дата:
Сообщение: RE: ERWin and/or Postgres?
Следующее
От: newsreader@mediaone.net
Дата:
Сообщение: DBD::Pg is suddenly acting up!