Обсуждение: Query not using index, please explain.

Поиск
Список
Период
Сортировка

Query not using index, please explain.

От
Matthew Hagerty
Дата:
Greetings,

I have a real simple table with a timestamp field.  The timestamp field has 
an index on it.  But, the index does not seem to be taken into account for 
selects that return rows:

pglog=# explain select time_stamp from history_entries where time_stamp < 
'03-01-2000';
NOTICE:  QUERY PLAN:

Index Scan using hist_entries_timestamp on 
history_entries  (cost=0.00..12810.36 rows=3246 width=8)

EXPLAIN
pglog=# explain select time_stamp from history_entries where time_stamp < 
'04-01-2000';
NOTICE:  QUERY PLAN:

Seq Scan on history_entries  (cost=0.00..160289.71 rows=138215 width=8)

EXPLAIN
pglog=# set enable_seqscan to off;
SET VARIABLE
pglog=# explain select time_stamp from history_entries where time_stamp < 
'04-01-2000';
NOTICE:  QUERY PLAN:

Index Scan using hist_entries_timestamp on 
history_entries  (cost=0.00..368241.51 rows=138215 width=8)

EXPLAIN
pglog=# set enable_seqscan to on;
SET VARIABLE
pglog=#

The query where the time_stamp < '03-01-2000' does not return any rows, the 
04-01-2000 date does return rows.  When I disable seqscan the query is 
almost instant, but with it on, it takes about 3 or 4 minutes.  Why can't 
the query planner use the index in the later case?

Thanks,
Matthew



Re: Query not using index, please explain.

От
Matthew Hagerty
Дата:
Richard,

Thanks for the response, I guess I should have included a little more 
information.  The table contains 3.5 million rows.  The indexes were 
created after the data was imported into the table and I had just run 
vacuum and vacuum analyze on the database before trying the queries and 
sending this question to hackers.

When I turned the seqscan variable off and ran the query with the 
'04-01-2000' date the results were literally instantaneous.  Turn the 
seqscan back on and it takes right around 3 minutes.  Also, the query for 
any date older than the '04-01-2000' returns zero rows.  The actual number 
of rows for the '04-01-2000' select is right around 8300.

Here is the table for more information:

pglog=# \d history_entries       Table "history_entries" Attribute  |    Type     | Modifier
------------+-------------+---------- domain     | varchar(80) | time_stamp | timestamp   | response   | integer     |
transfered| integer     | reqtime    | integer     | entry      | text        |
 
Indices: hist_entries_domain,         hist_entries_timestamp

I'm also having problems with this query:

select domain from history_entries group by domain;

To me, since there is an index on domain, it seems like this should be a 
rather fast thing to do?  It takes a *very* long time, no matter if I turn 
seqscan on or off.

pglog=# select version();                                 version
------------------------------------------------------------------------- PostgreSQL 7.0.3 on
i386-unknown-freebsdelf3.4,compiled by gcc 2.7.2.3
 
(1 row)

Thanks,
Matthew


At 07:18 PM 3/8/2001 +0000, you wrote:
>On Thu, Mar 08, 2001 at 01:49:42PM -0500, Matthew Hagerty wrote:
> > Greetings,
> >
> > I have a real simple table with a timestamp field.  The timestamp field 
> has
> > an index on it.  But, the index does not seem to be taken into account for
> > selects that return rows:
> >
> > pglog=# explain select time_stamp from history_entries where time_stamp <
> > '03-01-2000';
> > NOTICE:  QUERY PLAN:
> >
> > Index Scan using hist_entries_timestamp on
> > history_entries  (cost=0.00..12810.36 rows=3246 width=8)
> >
> > EXPLAIN
> > pglog=# explain select time_stamp from history_entries where time_stamp <
> > '04-01-2000';
> > NOTICE:  QUERY PLAN:
> >
> > Seq Scan on history_entries  (cost=0.00..160289.71 rows=138215 width=8)
> >
> > EXPLAIN
> > pglog=# set enable_seqscan to off;
> > SET VARIABLE
> > pglog=# explain select time_stamp from history_entries where time_stamp <
> > '04-01-2000';
> > NOTICE:  QUERY PLAN:
> >
> > Index Scan using hist_entries_timestamp on
> > history_entries  (cost=0.00..368241.51 rows=138215 width=8)
> >
> > EXPLAIN
> > pglog=# set enable_seqscan to on;
> > SET VARIABLE
> > pglog=#
> >
> > The query where the time_stamp < '03-01-2000' does not return any rows, 
> the
> > 04-01-2000 date does return rows.  When I disable seqscan the query is
> > almost instant, but with it on, it takes about 3 or 4 minutes.  Why can't
> > the query planner use the index in the later case?
>
>Well, it can, it just chooses not to.  Your second EXPLAIN shows that
>it thinks it's going to get 138215 rows from that select; it then
>calculates that it would be more expensive to use the index than simply
>to scan the table. Presumably it actually returns many fewer rows than
>that. Have you done a VACUUM ANALYZE recently? If you get plans this
>badly wrong immediately after a VACUUM ANALYZE, *then*'s the time to
>ask -hackers about it (FAQ item 4.9).
>
>Richard



Re: Query not using index, please explain.

От
Tom Lane
Дата:
Matthew Hagerty <mhagerty@voyager.net> writes:
> The query where the time_stamp < '03-01-2000' does not return any rows, the 
> 04-01-2000 date does return rows.  When I disable seqscan the query is 
> almost instant, but with it on, it takes about 3 or 4 minutes.  Why can't 
> the query planner use the index in the later case?

It *can* (and did, in two of the three examples you gave).  It just
doesn't think the indexscan is faster --- note the cost estimates.
Evidently the cost estimates are way off, probably because the estimated
number of selected rows is way off.

Have you done a VACUUM ANALYZE lately?  Not that that will help if the
distribution of timestamps is highly irregular :-(.  See the many past
discussions of these issues.
        regards, tom lane


Re: Query not using index, please explain.

От
Richard Poole
Дата:
On Thu, Mar 08, 2001 at 02:43:54PM -0500, Matthew Hagerty wrote:
> Richard,
> 
> Thanks for the response, I guess I should have included a little more 
> information.  The table contains 3.5 million rows.  The indexes were 
> created after the data was imported into the table and I had just run 
> vacuum and vacuum analyze on the database before trying the queries and 
> sending this question to hackers.
> 
> When I turned the seqscan variable off and ran the query with the 
> '04-01-2000' date the results were literally instantaneous.  Turn the 
> seqscan back on and it takes right around 3 minutes.  Also, the query for 
> any date older than the '04-01-2000' returns zero rows.  The actual number 
> of rows for the '04-01-2000' select is right around 8300.

This is where you need an expert. :) But I'll have a go and someone
will correct me if I'm wrong...

The statistics which are kept aren't fine-grained enough to be right
here. All the optimiser knows are the highest and lowest values of
the attribute, the most common value (not really useful here), the
number of nulls in the column, and the "dispersion" (a sort of
handwavy measure of how bunched-together the values are). So in a
case like this, where effectively the values are all different over
a certain range, all it can do is (more or less) linearly interpolate
in the range to guess how many tuples are going to be returned. Which
means it's liable to be completely wrong if your values aren't
evenly distributed over their whole range, which it seems they aren't.
It thinks you're going to hit around 1/28 of the tuples in this table,
presumably because '04/01/2000' is about 1/28 of the way from your
minimum value to your maximum.

This sort of thing will all become much better one fine day when
we have much better statistics available, and so many of us want
such things that that fine day will surely come. Until then, I think
you're best off turning off seqscans from your client code when
you know they'll be wrong. (That's what we do here in several similar
cases).

Can someone who really knows this stuff (Tom?) step in if what I've
just said is completely wrong?

> select domain from history_entries group by domain;
> 
> To me, since there is an index on domain, it seems like this should be a 
> rather fast thing to do?  It takes a *very* long time, no matter if I turn 
> seqscan on or off.

The reason this is slow is that Postgres always has to look at heap
tuples, even when it's been sent there by indexes. This in turn is
because of the way the storage manager works (only by looking in the
heap can you tell for sure whether a tuple is valid for the current
transaction). So a "group by" always has to look at every heap tuple
(that hasn't been eliminated by a where clause). "select distinct"
has the same problem. I don't think there's a way to do what you
want here with your existing schema without a sequential scan over
the table.


Richard


Re: Query not using index, please explain.

От
Tom Lane
Дата:
Richard Poole <richard.poole@vi.net> writes:
> [ snip a bunch of commentary about optimizer statistics ]
> Can someone who really knows this stuff (Tom?) step in if what I've
> just said is completely wrong?

Looked good to me.

>> select domain from history_entries group by domain;
>> 
>> To me, since there is an index on domain, it seems like this should be a 
>> rather fast thing to do?  It takes a *very* long time, no matter if I turn 
>> seqscan on or off.

> The reason this is slow is that Postgres always has to look at heap
> tuples, even when it's been sent there by indexes. This in turn is
> because of the way the storage manager works (only by looking in the
> heap can you tell for sure whether a tuple is valid for the current
> transaction). So a "group by" always has to look at every heap tuple
> (that hasn't been eliminated by a where clause). "select distinct"
> has the same problem. I don't think there's a way to do what you
> want here with your existing schema without a sequential scan over
> the table.

But this last I object to.  You certainly could use an index scan here,
it's just that it's not very likely to be faster.  The way that Postgres
presently does GROUP BY and SELECT DISTINCT is to sort the tuples into
order and then combine/eliminate adjacent duplicates.  (If you've ever
used a "sort | uniq" pipeline in Unix then you know the principle.)
So the initial requirement for this query is to scan the history_entries
tuples in order by domain.  We can do this either by a sequential scan
and explicit sort, or by an index scan using an ordered index on domain.
It turns out that unless the physical order of the tuples is pretty
close to their index order, the index-scan method is actually slower,
because it results in a lot of random-access thrashing.  But neither way
is exactly speedy.

One thing that's on the to-do list is to look at reimplementing these
operations using in-memory hash tables, with one hash entry per distinct
value of the GROUP/DISTINCT columns.  Then you can just do a sequential
scan, with no sort, and as long as there aren't so many distinct values
as to make the hash table overflow memory you're going to win.  However
until we have statistics that can give us some idea how many distinct
values there might be, there's no way for the planner to make an
intelligent choice between this way and the sort/uniq way...
        regards, tom lane