От: Tobias Brox
Тема: Re: timestamp indexing
Дата: ,
Msg-id: 20050531022011.GC26100@tobias.exoweb.net
(см: обсуждение, исходный текст)
Ответ на: Re: timestamp indexing  (Tobias Brox)
Ответы: Re: timestamp indexing  ("Jim C. Nasby")
Список: pgsql-performance

Скрыть дерево обсуждения

timestamp indexing  (Tobias Brox, )
 Re: timestamp indexing  (Michael Fuhr, )
  Re: timestamp indexing  (Tobias Brox, )
   Re: timestamp indexing  (Tom Lane, )
    Re: timestamp indexing  (Tobias Brox, )
     Re: timestamp indexing  (Tobias Brox, )
      Re: timestamp indexing  ("Jim C. Nasby", )
       Re: timestamp indexing  (Tobias Brox, )

[Tobias Brox - Tue at 10:06:25AM +0800]
> [Tom Lane - Mon at 01:57:54PM -0400]
> > Your followup hasn't shown up here yet,
>
> I'll check up on that and resend it.

Hrm ... messed-up mail configuration I suppose.  Here we go:

Paul McGarry unintentionally sent a request for more details off the list,
since it was intended for the list I'll send my reply here.

While writing up the reply, and doing research, I discovered that this is
not a problem with indexing timestamps per se, but more with a query of the
kind "give me 5% of the table"; it seems like it will often prefer to do a
full table scan instead of going via the index.

I think that when I had my university courses on databases, we also learned
about flat indexes, where the whole index has to be rebuilt whenever a field
is updated or inserted in the middle, and I also think we learned that the
table usually would be sorted physically by the primary key on the disk.  As
long as we have strictly incrementing primary keys and timestamps, such a
setup would probably be more efficient for queries of the kind "give me all
activity for the last two weeks"?

Here follows my reply to Paul, including some gory details:

[Paul McGarry - Mon at 07:59:35PM +1000]
> What version of postgresql are you using and what are the exact
> datatypes and queries?

We are still using 7.4.6, but I suppose that if our issues are completely or
partially solved in pg 8, that would make a good case for upgrading :-)

The datatypes I'm indexing are timestamp without time zone.

Actually I may be on the wrong hunting ground now - the production system
froze completely some days ago basically due to heavy iowait and load on the
database server, rendering postgresql completely unresponsive - and back
then we had too poor logging to find out what queries that was causing it to
grind to a halt, and since we've never such a bad problem before, we didn't
know how to handle the situation (we just restarted the entire postgresql;
if we had been just killing the processes running the rogue database
queries, we would have had very good tracks of it in the logs).

I digress.  The last days I've looked through profiling logs, and I'm
checking if the accumulatively worst queries can be tuned somehow.  Most of
them are big joins, but I'm a bit concerned of the amounts of "Seq Scan"
returned by "explain" despite the fact that only a small fraction of the
tables are queried.  I reduced the problem to a simple "select * from table
where created>xxx" and discovered that it still won't use index, and still
will be costly (though of course not much compared to the big joined query).

The "ticket" table have less than a million rows, around 50k made the last
ten days:

NBET=> explain analyze select * from ticket where created>'2005-05-20';
                                                    QUERY PLAN

------------------------------------------------------------------------------------------------------------------
 Seq Scan on ticket  (cost=0.00..19819.91 rows=89553 width=60) (actual time=535.884..1018.268 rows=53060 loops=1)
   Filter: (created > '2005-05-20 00:00:00'::timestamp without time zone)
 Total runtime: 1069.514 ms
(3 rows)

Anyway, it seems to me that "indexing on timestamp" is not the real issue
here, because when restricting by primary key (numeric, sequential ID) the
execution time is the same or worse, still doing a sequence scan:

NBET=> explain analyze select * from ticket  where id>711167;
                                                    QUERY PLAN

------------------------------------------------------------------------------------------------------------------
 Seq Scan on ticket  (cost=0.00..19819.91 rows=92273 width=60) (actual
time=550.855..1059.843 rows=53205 loops=1)
   Filter: (id > 711167)
 Total runtime: 1110.469 ms
(3 rows)


I've tried running equivalent queries on a table with twice as many rows and
width=180, it will pull from the index both when querying by ID and
timestamp, and it will usually spend less time.

Running "select * from ticket" seems to execute ~2x slower than when having
the restriction.

> I have a 7.3 database with a "timestamp with time zone" field and we
> have to be very careful to explicitly cast values as that in queries
> if it is to use the index correctly. I believe it's an issue that is
> cleared up in newer versions though.

I suppose so - as said, restricting by primary key didn't improve the
performance significantly, so I was clearly wrong indicating that this is a
special issue with indexing a timestamp.

--
Tobias Brox, Beijing


В списке pgsql-performance по дате сообщения:

От: Tobias Brox
Дата:
Сообщение: Re: Index on a NULL-value
От: "Praveen Raja"
Дата:
Сообщение: very large table