Re: Strange count(*) implementation?

Поиск
Список
Период
Сортировка
От Henk Ernst Blok
Тема Re: Strange count(*) implementation?
Дата
Msg-id 417E5031.8040409@utwente.nl
обсуждение исходный текст
Ответ на Re: Strange count(*) implementation?  (Tino Wildenhain <tino@wildenhain.de>)
Ответы Re: Strange count(*) implementation?
Список pgsql-general
Hi Tino,

Tino Wildenhain wrote:
I assume(d) the more expensive statistics (e.g., value distribution
info) are updated only when outdated too much or on request (manual
vacuum). Usually, other/cheap statistics can easily be maintained
incrementally and thus reflect actual table state after each update.   
I remember some discussion about this. But also here the MVCC and
the general call for performance leads to the current solution 
where statistics are only updated on vacuum. With PG8.0 you have
vacuum strategies with better performance which can run more often
as I understand - so while not giving you exact figures your
count() could be estimated at least. 
I need exact figures at the moment due to the type of some scientific experiments I'm running. Approximations are in the pipeline but I need a base-line run without any tricks that affect the accuracy of the numbers involved.
 Of course, the MVCC principle seems to make things a bit more
complicated I understand now. But tracking whether statistics are
dirty has to be in the system anyway. How does it otherwise decide
when to do its own statistics updates? So  if explain can get the most
recent count, why not use it in the count as well if you know the
statistics are still acurate?    
The point is: you dont know it. There is curently no: mark statistics
dirty if table has new tuples or tuples removed. 
OK. Didn't know that, but got the impression by now it worked that way in Postgres.
By the way, a count(*) without any where does occur very frequently if
you are dealing with an OLAP load, which is the case in my setting.
So, I indeed already 'fixed' the performance problem by precomputing
all counts I can predict to be of any use.   
I'm not familar with OLAP specifics, so what is the meaning of the
count() here? What is done with this information? 
OLAP stands for OnLine Analystical Processing, typically meaning heavy queries with a lot of data (for typical examples, see: http://www.tpc.org/
the TPC-H query set in particular). So decision support and datamining are in that area for instance. My topic of interest is IR (information retrieval) in a database context. My experiments behave like an OLAP load at the moment. My current experiments involve a lot of counting and expensive joins as I have to compute certain estimators in a mathematical model I'm working on, hence the importance of the count... ;)
On MySQL each of the 30 queries I have to run took on average about 24 h. As my queries are getting even complexer I'm now trying to find out whether Postgres can do a better job.

Regards,


Henk Ernst
-- 
address: DB group, Computer Science, EEMCS Dept., University of Twente,        PO Box 217, 7500 AE, ENSCHEDE, THE NETHERLANDS
phone:   ++31 (0)53 489 3754 (if no response: 3690)
email:   h.e.blok@utwente.nl
WWW:     http://www.cs.utwente.nl/~blokh

В списке pgsql-general по дате отправления:

Предыдущее
От: Mike Mascari
Дата:
Сообщение: Re: Any plans on allowing user-defined triggers to be deferrable?
Следующее
От: Tino Wildenhain
Дата:
Сообщение: Re: Strange count(*) implementation?