Обсуждение: Planner statistics, correlations

Поиск
Список
Период
Сортировка

Planner statistics, correlations

От
Tobias Brox
Дата:
We have a table with a timestamp attribute (event_time) and a state flag
which usually changes value around the event_time (it goes to 4).  Now
we have more than two years of events in the database, and around 5k of
future events.

It is important to frequently pick out "overdue events", say:

  select * from events where state<>4 and event_time<now()

This query would usually yield between 0 and 100 rows - however, the
planner doesn't see the correlation betewen state and event_time - since
most of the events have event_time<now, the planner also assumes most of
the events with state<>4 has event_time<now, so the expected number of
rows is closer to 5k.  This matters, because I have a query with joins,
and I would really benefit from nested loops.

(I've tried replacing "now()" above with different timestamps from the
future and the past.  I'm using pg 8.2)

Any suggestions?


Re: Planner statistics, correlations

От
"Peter Childs"
Дата:
On 12/01/07, Tobias Brox <tobias@nordicbet.com> wrote:
> We have a table with a timestamp attribute (event_time) and a state flag
> which usually changes value around the event_time (it goes to 4).  Now
> we have more than two years of events in the database, and around 5k of
> future events.
>
> It is important to frequently pick out "overdue events", say:
>
>   select * from events where state<>4 and event_time<now()
>
> This query would usually yield between 0 and 100 rows - however, the
> planner doesn't see the correlation betewen state and event_time - since
> most of the events have event_time<now, the planner also assumes most of
> the events with state<>4 has event_time<now, so the expected number of
> rows is closer to 5k.  This matters, because I have a query with joins,
> and I would really benefit from nested loops.
>
> (I've tried replacing "now()" above with different timestamps from the
> future and the past.  I'm using pg 8.2)
>
> Any suggestions?
>

Can you say what state might be rather than what it is not. I'm guess
that state is an int but there is only a limited list of possible
states, if you can say what it might be rather than what it is the
index is more liklly to be used.

Peter.

Re: Planner statistics, correlations

От
Tobias Brox
Дата:
[Peter Childs - Fri at 08:56:54AM +0000]
> Can you say what state might be rather than what it is not. I'm guess
> that state is an int but there is only a limited list of possible
> states, if you can say what it might be rather than what it is the
> index is more liklly to be used.

  explain select * from events where state in (1,2,3) and event_time<now()

also estimates almost 5k of rows.  I also tried:

  explain select * from events where state=2 and event_time<now()

but get the same behaviour.

Maybe it would help to partitionate the table every year?

Re: Planner statistics, correlations

От
Richard Huxton
Дата:
Tobias Brox wrote:
> [Peter Childs - Fri at 08:56:54AM +0000]
>> Can you say what state might be rather than what it is not. I'm guess
>> that state is an int but there is only a limited list of possible
>> states, if you can say what it might be rather than what it is the
>> index is more liklly to be used.
>
>   explain select * from events where state in (1,2,3) and event_time<now()
>
> also estimates almost 5k of rows.

Try a partial index:
CREATE INDEX my_new_index ON events (event_time)
WHERE state in (1,2,3);

Now, if that doesn't work you might want to split the query into two...

SELECT * FROM events
WHERE state IN (1,2,3) AND event_time < '2007-01-01'::date
UNION ALL
SELECT * FROM events
WHERE state IN (1,2,3) AND event_time >= '2007-01-01'::date AND
event_time < now();

CREATE INDEX my_new_index ON events (event_time)
WHERE state in (1,2,3) AND event_time < '2007-01-01'::date;

CREATE INDEX event_time_state_idx ON events (event_time, state);

You'll want to replace the index/update the query once a year/month etc.

--
   Richard Huxton
   Archonet Ltd

Re: Planner statistics, correlations

От
Tobias Brox
Дата:
[Richard Huxton - Fri at 09:17:48AM +0000]
> Try a partial index:
> CREATE INDEX my_new_index ON events (event_time)
> WHERE state in (1,2,3);

I have that, the index is used and the query is lightning fast - the
only problem is that the planner is using the wrong estimates.  This
becomes a real problem when doing joins and more complex queries.

> Now, if that doesn't work you might want to split the query into two...

Hm, that's an idea - to use a two-pass query ... first:

  select max(event_time) from events where state in (1,2,3);

and then use the result:

  select * from events
  where event_time>? and event_time<now() and state in (1,2,3)

This would allow the planner to get the estimates in the right ballpark
(given that the events won't stay for too long in the lower states), and
it would in any case not be significantly slower than the straight-ahead
approach - but quite inelegant.


Re: Planner statistics, correlations

От
Heikki Linnakangas
Дата:
Tobias Brox wrote:
> Maybe it would help to partitionate the table every year?

I thought about partitioning the table by state, putting rows with
state=4 into one partition, and all others to another partition.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Planner statistics, correlations

От
Tobias Brox
Дата:
[Heikki Linnakangas - Fri at 10:41:34AM +0000]
> I thought about partitioning the table by state, putting rows with
> state=4 into one partition, and all others to another partition.

That sounds like a good idea - but wouldn't that be costly when changing state?

Re: Planner statistics, correlations

От
Heikki Linnakangas
Дата:
Tobias Brox wrote:
> [Heikki Linnakangas - Fri at 10:41:34AM +0000]
>> I thought about partitioning the table by state, putting rows with
>> state=4 into one partition, and all others to another partition.
>
> That sounds like a good idea - but wouldn't that be costly when changing state?

In PostgreSQL, UPDATE internally inserts a new row and marks the old one
as deleted, so there shouldn't be much of a performance difference.

I'm not very familiar with our partitioning support, so I'm not sure if
there's any problems with an update moving a row from one partition to
another. I think you'll have to create an INSTEAD OF UPDATE rule to do a
DELETE on one partition and an INSERT on the other partition. Depending
on your application, that might be a problem; UPDATE is different from
DELETE+INSERT from transaction isolation point of view.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com