Column correlation drifts, index ignored again

Поиск

Список

Период

Сортировка

От	John Siracusa
Тема	Column correlation drifts, index ignored again
Дата	21 февраля 2004 г. 23:18:04
Msg-id	94FA2D7A-64CC-11D8-9DEA-000A95BA4506@mindspring.com обсуждение исходный текст
Ответы	Re: Column correlation drifts, index ignored again Re: Column correlation drifts, index ignored again
Список	pgsql-performance

Дерево обсуждения

This is a follow-up to an old thread of mine, but I can't find it now
so I'll just re-summarize.

I have a ~1 million row table that I mostly want to query by date
range.  The rows are pretty uniformly spread over a 3 year date range.
I have an index on the date column, but it wasn't always used in the
past.  I disabled the seqscan plan before running my query as a first
fix, but it bothered me that I had to do that.

Next, thanks to my earlier thread, I clustered the table on the date
column and then "SET STATISTICS" on the date column to be 100.  That
did the trick, and I stopped explicitly disabling seqscan.

Today, I noticed that Postgres (still 7.4) stopped using the date index
again.  I checked the correlation for the date column and it was down
to 0.4.  So I guess that stat does drift away from 1.0 after
clustering.  That's a bummer, because clustering locks up the table
while it works, which I can't really afford to do often.  Even at a
correlation of 0.4 on the date column, using the date index was still
much faster than the seqscan plan that Postgres was choosing.  Anyway,
it's reclustering now.

A common query looks like this:

SELECT
     SUM(amount),
     SUM(quantity),
     date_trunc('day', date) AS date
FROM
     mytable
WHERE
     col1 IS NOT NULL AND
     col2 = 'foo'     AND
     col3 = 'bar'     AND
     date BETWEEN '2004-02-01 00:00:00' AND '2004-02-28 23:59:59'
GROUP BY
     date_trunc('day', date)
ORDER BY
     date;

The EXPLAIN ANALYZE output should look like this:

  Sort  (cost=4781.75..4824.15 rows=16963 width=23) (actual
time=2243.595..2243.619 rows=21 loops=1)
    Sort Key: date_trunc('day'::text, date)
    ->  HashAggregate  (cost=3462.87..3590.09 rows=16963 width=23)
(actual time=2241.773..2243.454 rows=21 loops=1)
          ->  Index Scan using mytable_date_idx on mytable
(cost=0.00..3071.70 rows=52155 width=23) (actual time=2.610..1688.111
rows=49679 loops=1)
                Index Cond: ((date >= '2004-02-01 00:00:00'::timestamp
without time zone) AND (date <= '2004-02-28 23:59:59'::timestamp
without time zone))
                Filter: ((col1 IS NOT NULL) AND ((col2)::text =
'foo'::text) AND ((col3)::text = 'bar'::text))
  Total runtime: 2244.391 ms

Unfortunately, since I just re-clustered, I can't get the old EXPLAIN
output, but just imagine "Seq Scan" in place of "Index Scan using
mytable_date_idx" to get the idea.

My question is: what other options do I have?  Should I "SET
STATISTICS" on the date column to 200?  500? The maximum value of 1000?
  I want to do something that will convince Postgres that using the date
index is, by far, the best plan when running my queries, even when the
date column correlation stat drops well below 1.0.

-John

В списке pgsql-performance по дате отправления:

Предыдущее

От: Don Bowman
Дата: 21 февраля 2004 г., 20:12:39
Сообщение: conceptual method to create high performance query involving time

Следующее

От: Josh Berkus
Дата: 22 февраля 2004 г., 18:02:57
Сообщение: Re: JOIN order, 15K, 15K, 7MM rows

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Column correlation drifts, index ignored again

Предыдущее

Следующее