Обсуждение: why does count take so long?

Поиск
Список
Период
Сортировка

why does count take so long?

От
Joseph Shraibman
Дата:
On a 7.3.4 database:

explain analyse select count(*) from elog;


  Aggregate  (cost=223764.05..223764.05 rows=1 width=0) (actual
time=81372.11..81372.11 rows=1 loops=1)
    ->  Seq Scan on elog  (cost=0.00..203012.24 rows=8300724 width=0)
(actual time=3.91..71542.53 rows=8313762 loops=1)
  Total runtime: 81378.42 msec
(3 rows)


It looks like the aggregate took 10 secs all by itself.  What's taking
so long?


Re: why does count take so long?

От
Bruno Wolff III
Дата:
On Sun, Sep 07, 2003 at 21:06:26 -0400,
  Joseph Shraibman <jks@selectacast.net> wrote:
> On a 7.3.4 database:
>
> explain analyse select count(*) from elog;
>
>
>  Aggregate  (cost=223764.05..223764.05 rows=1 width=0) (actual
> time=81372.11..81372.11 rows=1 loops=1)
>    ->  Seq Scan on elog  (cost=0.00..203012.24 rows=8300724 width=0)
> (actual time=3.91..71542.53 rows=8313762 loops=1)
>  Total runtime: 81378.42 msec
> (3 rows)
>
>
> It looks like the aggregate took 10 secs all by itself.  What's taking
> so long?

It looks like there are 8 million log records that need to be counted.

There have been some discussions over the last few days about tricks to
get better performance if you need to use count on large tables.

Re: why does count take so long?

От
Tom Lane
Дата:
Bruno Wolff III <bruno@wolff.to> writes:
>   Joseph Shraibman <jks@selectacast.net> wrote:
>> It looks like the aggregate took 10 secs all by itself.  What's taking
>> so long?

> It looks like there are 8 million log records that need to be counted.

Yeah, but I think he's complaining about the 10sec delta for the
aggregate on top of the 71sec to read the 8 million rows.  That
seems high to me too.  On a 10-mil-row test table, I get

regression=# explain analyze select count(*) from foo;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=22.50..22.50 rows=1 width=0) (actual time=189865.81..189865.81 rows=1 loops=1)
   ->  Seq Scan on foo  (cost=0.00..20.00 rows=1000 width=0) (actual time=18.88..163833.61 rows=10240000 loops=1)
 Total runtime: 189865.91 msec
(3 rows)

in other words 26sec to do the aggregate on top of 163sec to read the
rows.

Unless Joseph's machine has a way better IO-to-CPU ratio than my little
development machine, there's something odd about his numbers.

            regards, tom lane

Re: why does count take so long?

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Yeah, but I think he's complaining about the 10sec delta for the
> aggregate on top of the 71sec to read the 8 million rows.  That
> seems high to me too.  On a 10-mil-row test table, I get
...
> in other words 26sec to do the aggregate on top of 163sec to read the
> rows.
>
> Unless Joseph's machine has a way better IO-to-CPU ratio than my little
> development machine, there's something odd about his numbers.

Why is 10s (a 14% delta) for 8M records suspicious but 26s (16% delta) for 10M
not suspicious? These results seem fairly consistent actually.

I think what the original question was is "what work does this 10s represent".
I'm curious too. Is it really just 10 million times the cpu cycles necessary
to dispatch a call to the count() aggregate state function?



PS:

> regression=# explain analyze select count(*) from foo;
>                                                     QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=22.50..22.50 rows=1 width=0) (actual time=189865.81..189865.81 rows=1 loops=1)
>    ->  Seq Scan on foo  (cost=0.00..20.00 rows=1000 width=0) (actual time=18.88..163833.61 rows=10240000 loops=1)
>  Total runtime: 189865.91 msec
> (3 rows)

Hey, you haven't analyzed your table! :)

--
greg

Re: why does count take so long?

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> Why is 10s (a 14% delta) for 8M records suspicious but 26s (16% delta) for 10M
> not suspicious? These results seem fairly consistent actually.

Duh --- must have gotten the decimal point off in my mental arithmetic.

> I think what the original question was is "what work does this 10s represent".
> I'm curious too. Is it really just 10 million times the cpu cycles necessary
> to dispatch a call to the count() aggregate state function?

Well, it's 10 mil cycles of the aggregate plan node, which is going to
involve rather more work than just the aggregate function call.  But the
last time I profiled COUNT(), it seemed that the pallocs/pfrees needed
for the int8 counter were the largest single chunk of CPU time.

Something I've wanted to do for awhile is to look at making int8 and
float8 be pass-by-value datatypes on machines where Datum is naturally
8 bytes (ie, any 8-byte-pointer architecture).  I doubt it would be a
win to widen Datum on 32-bit machines, though; the distributed costs
would swamp the advantage from making these datatypes more efficient.

            regards, tom lane

Re: why does count take so long?

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Something I've wanted to do for awhile is to look at making int8 and
> float8 be pass-by-value datatypes on machines where Datum is naturally
> 8 bytes (ie, any 8-byte-pointer architecture).  I doubt it would be a
> win to widen Datum on 32-bit machines, though; the distributed costs
> would swamp the advantage from making these datatypes more efficient.

Things like count(*) could use int4 until it overflows though.
Is int4 a pass-by-value datatype on 32-bit machines?

--
greg

Re: why does count take so long?

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> Things like count(*) could use int4 until it overflows though.

I don't see a reasonable way for an aggregate to change state datatype
on the fly; otherwise this would be a great solution.

> Is int4 a pass-by-value datatype on 32-bit machines?

Yeah.  Not too long ago, count() used int4, but we got some complaints
about it overflowing on big tables.

            regards, tom lane

Re: why does count take so long?

От
Tom Lane
Дата:
I said:
> Greg Stark <gsstark@mit.edu> writes:
>> Things like count(*) could use int4 until it overflows though.

> I don't see a reasonable way for an aggregate to change state datatype
> on the fly; otherwise this would be a great solution.

On the other hand, the cost is imposed by the generic aggregate
definition that says the aggregate state transition function is an
ordinary function.  This is fine for user-defined aggregates, but there
is no law that says that all the built-in aggregates must use that same
API.  We could probably invent some API that allows COUNT(*) to keep its
running count someplace where it needn't be re-palloc'd on every cycle.
Something to think about for 7.5 (too late for 7.4 I fear).

            regards, tom lane

Re: why does count take so long?

От
Joseph Shraibman
Дата:
Tom Lane wrote:

> Unless Joseph's machine has a way better IO-to-CPU ratio than my little
> development machine, there's something odd about his numbers.
>
>             regards, tom lane
Doing some math:

81372.11 / 71542.53 : 1.1373949174008804
189865.81 / 163833.61 : 1.1588941365572059

So I have count() taking 14% over the seqscan, and you have 16%


Re: why does count take so long?

От
Jean-Luc Lachance
Дата:
How about keeping counts of inserts, deletes and updates per table per
transaction as part of the live statistics?



Tom Lane wrote:
>
> I said:
> > Greg Stark <gsstark@mit.edu> writes:
> >> Things like count(*) could use int4 until it overflows though.
>
> > I don't see a reasonable way for an aggregate to change state datatype
> > on the fly; otherwise this would be a great solution.
>
> On the other hand, the cost is imposed by the generic aggregate
> definition that says the aggregate state transition function is an
> ordinary function.  This is fine for user-defined aggregates, but there
> is no law that says that all the built-in aggregates must use that same
> API.  We could probably invent some API that allows COUNT(*) to keep its
> running count someplace where it needn't be re-palloc'd on every cycle.
> Something to think about for 7.5 (too late for 7.4 I fear).
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: why does count take so long?

От
Dennis Gearon
Дата:
Is improving count(*) <a lot if possible> on one of the TODO lists?

Jean-Luc Lachance wrote:

>How about keeping counts of inserts, deletes and updates per table per
>transaction as part of the live statistics?
>
>
>
>Tom Lane wrote:
>
>
>>I said:
>>
>>
>>>Greg Stark <gsstark@mit.edu> writes:
>>>
>>>
>>>>Things like count(*) could use int4 until it overflows though.
>>>>
>>>>
>>>I don't see a reasonable way for an aggregate to change state datatype
>>>on the fly; otherwise this would be a great solution.
>>>
>>>
>>On the other hand, the cost is imposed by the generic aggregate
>>definition that says the aggregate state transition function is an
>>ordinary function.  This is fine for user-defined aggregates, but there
>>is no law that says that all the built-in aggregates must use that same
>>API.  We could probably invent some API that allows COUNT(*) to keep its
>>running count someplace where it needn't be re-palloc'd on every cycle.
>>Something to think about for 7.5 (too late for 7.4 I fear).
>>
>>                        regards, tom lane
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>>
>
>---------------------------(end of broadcast)---------------------------
>TIP 8: explain analyze is your friend
>
>
>


Re: why does count take so long?

От
"scott.marlowe"
Дата:
The problem with that approach is that while there may only be one or two
tables that you want to do max(field) on, it would require that all tables
be kept track of, increasing overhead.  Keep in mind, select id from table
order by id desc limit 1 is a snap, hits the indexes, and runs fast.

also, I don't think postgresql keeps "live" statistics, only "dead" ones
i.e. they are only gathered by running analyze.  If you just need an
approximate account, you can already get that from the statistics table.

Now, if there were a simple way of assigning such behaviour to a table,
(i.e. with count or something like that) via a trigger, that might be
worth looking into, but keep in mind, it creates a serious bottleneck
during inserts/updates/deletes.

On Tue, 9 Sep 2003, Jean-Luc Lachance wrote:

> How about keeping counts of inserts, deletes and updates per table per
> transaction as part of the live statistics?
>
>
>
> Tom Lane wrote:
> >
> > I said:
> > > Greg Stark <gsstark@mit.edu> writes:
> > >> Things like count(*) could use int4 until it overflows though.
> >
> > > I don't see a reasonable way for an aggregate to change state datatype
> > > on the fly; otherwise this would be a great solution.
> >
> > On the other hand, the cost is imposed by the generic aggregate
> > definition that says the aggregate state transition function is an
> > ordinary function.  This is fine for user-defined aggregates, but there
> > is no law that says that all the built-in aggregates must use that same
> > API.  We could probably invent some API that allows COUNT(*) to keep its
> > running count someplace where it needn't be re-palloc'd on every cycle.
> > Something to think about for 7.5 (too late for 7.4 I fear).
> >
> >                         regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


Re: why does count take so long?

От
Joseph Shraibman
Дата:
Tom Lane wrote:

> Something to think about for 7.5 (too late for 7.4 I fear).

What about 7.4.1?  This wouldn't affect how data is stored on disk,
which is what would keep an upgrade out of a minor release, right?


Re: why does count take so long?

От
Doug McNaught
Дата:
Joseph Shraibman <jks@selectacast.net> writes:

> Tom Lane wrote:
>
> > Something to think about for 7.5 (too late for 7.4 I fear).
>
> What about 7.4.1?  This wouldn't affect how data is stored on disk,
> which is what would keep an upgrade out of a minor release, right?

Minor releases are basically bugfix-only for PG.

-Doug

Re: why does count take so long?

От
Bruce Momjian
Дата:
Joseph Shraibman wrote:
> Tom Lane wrote:
>
> > Something to think about for 7.5 (too late for 7.4 I fear).
>
> What about 7.4.1?  This wouldn't affect how data is stored on disk,
> which is what would keep an upgrade out of a minor release, right?

We only do bug fixes in minor releases like 7.4.1.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: why does count take so long?

От
Christopher Browne
Дата:
In the last exciting episode, jllachan@nsd.ca (Jean-Luc Lachance) wrote:
> How about keeping counts of inserts, deletes and updates per table per
> transaction as part of the live statistics?

Aye, there's the rub.

That's exactly what _won't_ work, and that's exactly the case that is
somewhat pathological under MVCC.

With MVCC, data "appears as if by magic" when its transaction COMMITs,
thereby revealing the rows to the world.

Unfortunately, there's no simple way of making updates to counts
"simply appear" when they take effect, not without turning the updates
into a concurrency bottleneck.

Here's a bit of a wild thought...

Assume a table with schema as follows:
create table pg_count (
  xid integer,   ---  Actually, it's not an integer, right, Andrew? :-(
  reltype oid,
  count integer
);

Every transaction, "xid," affects some number of tuples.  So that for
a transaction, #2134151 that adds 5 rows to table with oid 12345 and deletes 4
rows from table with 45678, part of the transaction would include
inserting these rows:

  insert into pg_count (xid, reltype, count) values (2134151, 12345, 5);
  insert into pg_count (xid, reltype, count) values (2134151, 45678, -4);

In order to get the count for table 12345, you could then go to
pg_count and request:
  select sum(count) from pg_count where reltype = 12345;

The size of this table would increase every time a transaction gets
committed, so presumably part of VACUUM TABLE would be a
collection/summarization, thus...

  -- Collect existing stats into 1 row
  insert into pg_count(xid, reltype, count) values (currxid,
    currtable, select sum(count) from pg_count where reltype =
    currtable);

  -- Delete the old stats
  delete from pg_count where reltype = currtable and xid <> currxid;

This will cope with concurrency reasonably well (modulo having to make
sure that the "collect/delete" transaction is a serialized one).

Unfortunately, if a table is updated frequently, the summary
  select sum(count) from pg_count where reltype = 12345;
will have to collect together quite a large number of entries, which
makes this "less cheap."

That suggests an optimization; any time the COUNT is selected, the old
stats can and should be collected into 1 row and the old data deleted.
--
wm(X,Y):-write(X),write('@'),write(Y). wm('aa454','freenet.carleton.ca').
http://www3.sympatico.ca/cbbrowne/nonrdbms.html
Sturgeon's Law: 90% of *EVERYTHING* is crud.

Re: why does count take so long?

От
Jean-Luc Lachance
Дата:
Christopher,

You did not quite understand.

The counts in question is the actual counts (deltas) for the
transtactions.

the tuple should be something like:
xid, reltype, insert_count, update_count, delete_count

When a COUNT(*) is issued the commited tuples are totaled up stored as
xid 0 or whatever and the commited tuples deleted (all this with
appropriate locks).

I am not sure if there is a need for the update count.

JLL


Christopher Browne wrote:
>
> In the last exciting episode, jllachan@nsd.ca (Jean-Luc Lachance) wrote:
> > How about keeping counts of inserts, deletes and updates per table per
> > transaction as part of the live statistics?
>
> Aye, there's the rub.
>
> That's exactly what _won't_ work, and that's exactly the case that is
> somewhat pathological under MVCC.
>
> With MVCC, data "appears as if by magic" when its transaction COMMITs,
> thereby revealing the rows to the world.
>
> Unfortunately, there's no simple way of making updates to counts
> "simply appear" when they take effect, not without turning the updates
> into a concurrency bottleneck.
>
> Here's a bit of a wild thought...
>
> Assume a table with schema as follows:
> create table pg_count (
>   xid integer,   ---  Actually, it's not an integer, right, Andrew? :-(
>   reltype oid,
>   count integer
> );
>
> Every transaction, "xid," affects some number of tuples.  So that for
> a transaction, #2134151 that adds 5 rows to table with oid 12345 and deletes 4
> rows from table with 45678, part of the transaction would include
> inserting these rows:
>
>   insert into pg_count (xid, reltype, count) values (2134151, 12345, 5);
>   insert into pg_count (xid, reltype, count) values (2134151, 45678, -4);
>
> In order to get the count for table 12345, you could then go to
> pg_count and request:
>   select sum(count) from pg_count where reltype = 12345;
>
> The size of this table would increase every time a transaction gets
> committed, so presumably part of VACUUM TABLE would be a
> collection/summarization, thus...
>
>   -- Collect existing stats into 1 row
>   insert into pg_count(xid, reltype, count) values (currxid,
>     currtable, select sum(count) from pg_count where reltype =
>     currtable);
>
>   -- Delete the old stats
>   delete from pg_count where reltype = currtable and xid <> currxid;
>
> This will cope with concurrency reasonably well (modulo having to make
> sure that the "collect/delete" transaction is a serialized one).
>
> Unfortunately, if a table is updated frequently, the summary
>   select sum(count) from pg_count where reltype = 12345;
> will have to collect together quite a large number of entries, which
> makes this "less cheap."
>
> That suggests an optimization; any time the COUNT is selected, the old
> stats can and should be collected into 1 row and the old data deleted.
> --
> wm(X,Y):-write(X),write('@'),write(Y). wm('aa454','freenet.carleton.ca').
> http://www3.sympatico.ca/cbbrowne/nonrdbms.html
> Sturgeon's Law: 90% of *EVERYTHING* is crud.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Re: why does count take so long?

От
"Jim C. Nasby"
Дата:
On Mon, Sep 08, 2003 at 05:47:48PM -0400, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Things like count(*) could use int4 until it overflows though.
>
> I don't see a reasonable way for an aggregate to change state datatype
> on the fly; otherwise this would be a great solution.

What about estimating which one you'll need based on statistics? Yeah,
might fail some times, but see my next comment...

> > Is int4 a pass-by-value datatype on 32-bit machines?
>
> Yeah.  Not too long ago, count() used int4, but we got some complaints
> about it overflowing on big tables.

MS SQL (and I believe some other databases) provide countbig() which is
int8 for this very reason. Many times you *know* you won't have more
than int4 rows (ie, if you're using an int4 as a PK), so there's no
reason to spend the overhead associated with an int8.

Maybe have count4(), count8(), and have count() guess as to which one it
should use based on statistics?
--
Jim C. Nasby, Database Consultant                  jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"