Re: count(*) and bad design was: Experiences with extensibility

Поиск
Список
Период
Сортировка
От Scott Marlowe
Тема Re: count(*) and bad design was: Experiences with extensibility
Дата
Msg-id dcc563d10801090830k2033d26fmdb5f6774238ed92f@mail.gmail.com
обсуждение исходный текст
Ответ на Re: count(*) and bad design was: Experiences with extensibility  (Sim Zacks <sim@compulab.co.il>)
Ответы Re: count(*) and bad design was: Experiences with extensibility  (Ivan Sergio Borgonovo <mail@webthatworks.it>)
Список pgsql-general
On Jan 9, 2008 8:33 AM, Sim Zacks <sim@compulab.co.il> wrote:
> Using count(*) is not bad design, though generally it makes sense to use it with
> a where.
>
> Saying using count(*) is bad design means that the only design that you can
> visualize is the specific one that you are using.
>
> There are tons of real world examples where you need count. That is why so many
> people use it as a benchmark.
>
> Obviously if you have an application where millions of rows are added and
> subtracted every minute, then the value of count is kind of vague.
>
> However, if you are querying a data warehouse that gets populated once a month,
> then count has a huge value. You could use statistics in that case, but it is
> much harder to join against other tables when you are using statistics. It is
> also less intuitive then using the SQL standard for finding the number of rows.

I think part of the problem is people think of count() as something
other than an aggregate function.

If I ran "select avg(i) from table" on a 20M row table, I'd expect it
to take a few seconds, after all, I'm running a function across 20
Million rows.

Some databases have the ability to short-circuit count(*) without a
where clause, some with a where clause.  But the basic model of
count(*) is that it's an aggregate function, and what you're asking
the db to do is to count every single row in the db that matches your
where clause.

Now, everything's a tradeoff.  If PostgreSQL had visibility
information in the indexes, it would have to lock both the table and
index for every write, thus slowing down all the other queries that
are trying to access the table.  It would be a tradeoff that
sacrificed write speed for read speed. In a db that was used mostly
for writing, it would likely be a fair trade.  In a db that did a lot
of writing, it might slow the whole thing to a crawl.

So, the slow count(*) performance of postgresql, especially count(*)
without a where clause, is a result of the TANSTAAFL principle (there
ain't no such thing as a free lunch).

Now, if there's a where clause that's selective enough, then a
count(*) query may use the index and be a bit faster, but remember, in
pgsql, it's still got to actually hit the table to see if each tuple
really is visible to this transaction, so the index needs to be fairly
selective to be a win.

A possible workaround is to have something like a separate table with
nothing but the IDs and whatever would be in your where clause for the
tables you're accessing with a foreign key to it, and use THAT for a
count(*).  Since the rows are skinnier, the count(*) will be faster.
Another alternative is to have a trigger fire that keeps a track of
the size of the table in a summary table when rows are added and
deleted.

Each of these methods "costs" you something, in time and effort or
performance, and that's why they're not automatic.  For instance, I
have an 80M row stats db that grows by about 1M rows a week.  I do NOT
need to count the whole thing, hardly ever, and can live with the fact
that I don't know exactly how many rows it has at any given time.
When I do a select count(*) with a where clause it's usually
restricted to < 1 weeks data and can use an index and come back pretty
quickly.

select count(*) from bigtable where inserttime > now() - interval '1 week';
  count
---------
 1254269
(1 row)

Time: 21422.368 ms

Second run:
select count(*) from bigtable where inserttime > now() - interval '1 week';
  count
---------
 1254320
(1 row)

Time: 2001.204 ms

With the data loaded into shared_buffers / linux kernel cache, that's
not too bad.

Now, I run the same query against our production oracle machine, which
is a MUCH more powerful server...

SQL> select count(*) from bigtable where inserttime > SYSDATE-7;

  COUNT(*)
----------
   1255972

Elapsed: 00:00:18.62

second run:
SQL> select count(*) from bigtable where inserttime > SYSDATE-7;

  COUNT(*)
----------
   1255973
Elapsed: 00:00:00.98

Now, obviously, Oracle's got some optimizations for what it's got in
the buffer there, but the first run isn't really any faster. In fact,
for a much more powerful machine, the performance was, relatively
speaking, pretty bad compared to my little 1 CPU 1 sw RAID-10
reporting server.

So, while PostgreSQL's count(*) performance isn't blindingly fast,
it's not the dog some people make it out to be either.

В списке pgsql-general по дате отправления:

Предыдущее
От: Ivan Sergio Borgonovo
Дата:
Сообщение: Re: quick question abt pg_dump and restore
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Installation problem: failed to initialize lc_messages to ""