Re: Slow count(*) again...

Поиск
Список
Период
Сортировка
От Reid Thompson
Тема Re: Slow count(*) again...
Дата
Msg-id 4CB1D588.4080503@ateb.com
обсуждение исходный текст
Ответ на Re: Slow count(*) again...  (Neil Whelchel <neil.whelchel@gmail.com>)
Ответы Re: Slow count(*) again...
Список pgsql-performance
  On 10/10/2010 6:29 AM, Neil Whelchel wrote:
> On Saturday 09 October 2010 23:56:15 Craig Ringer wrote:
>> On 10/10/2010 11:02 AM, Neil Whelchel wrote:
>>> On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
>>>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>
>>> wrote:
>>>>> I know that there haven been many discussions on the slowness of
>>>>> count(*) even when an index is involved because the visibility of the
>>>>> rows has to be checked. In the past I have seen many suggestions about
>>>>> using triggers and tables to keep track of counts and while this works
>>>>> fine in a situation where you know what the report is going to be
>>>>> ahead of time, this is simply not an option when an unknown WHERE
>>>>> clause is to be used (dynamically generated). I ran into a fine
>>>>> example of this when I was searching this mailing list, "Searching in
>>>>> 856,646 pages took 13.48202 seconds. Site search powered by PostgreSQL
>>>>> 8.3." Obviously at some point count(*) came into play here because the
>>>>> site made a list of pages (1 2 3 4 5 6>   next). I very commonly make a
>>>>> list of pages from search results, and the biggest time killer here is
>>>>> the count(*) portion, even worse yet, I sometimes have to hit the
>>>>> database with two SELECT statements, one with OFFSET and LIMIT to get
>>>>> the page of results I need and another to get the amount of total rows
>>>>> so I can estimate how many pages of results are available. The point I
>>>>> am driving at here is that since building a list of pages of results
>>>>> is such a common thing to do, there need to be some specific high
>>>>> speed ways to do this in one query. Maybe an estimate(*) that works
>>>>> like count but gives an answer from the index without checking
>>>>> visibility? I am sure that this would be good enough to make a page
>>>>> list, it is really no big deal if it errors on the positive side,
>>>>> maybe the list of pages has an extra page off the end. I can live with
>>>>> that. What I can't live with is taking 13 seconds to get a page of
>>>>> results from 850,000 rows in a table.
> Good point, maybe this is turning more into a discussion of how to generate a
> list of pages of results and one page of results with one query so we don't
> have to do the same painfully slow query twice to do a very common task.
>
> On the other hand, I copied a table out of one of my production servers that
> has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp,
> text). The first numeric column has numbers evenly spread between 0 and 100
> and it is indexed. I put the table in a pair of database servers both running
> on the same physical hardware. One server is Postgres, the other is a popular
> server (I am not mentioning names here). on Postgres: SELECT count(*) FROM
> table where column>50; takes about 8 seconds to run. The other database server
> took less than one second (about 25 ms) as it is using the index (I assume) to
> come up with the results. It is true that this is not a fair test because both
> servers were tested with their default settings, and the defaults for Postgres
> are much more conservative, however, I don't think that any amount of settings
> tweaking will bring them even in the same ball park. There has been discussion
> about the other server returning an incorrect count because all of the indexed
> rows may not be live at the time. This is not a problem for the intended use,
> that is why I suggested another function like estimate(*). It's name suggests
> that the result will be close, not 100% correct, which is plenty good enough
> for generating a list of results pages in most cases. I am faced with a very
> serious problem here. If the query to make a list of pages takes say 6 seconds
> and it takes another 6 seconds to generate a page of results, the customer is
> waiting 12 seconds. This is not going to work. If count made a quick estimate,
> say less than a second, and it took 6 seconds to come up with the actual
> results, I could live with that. Or if coming up with the window of results
> via (OFFSET and LIMIT) and returned the total number of rows that would have
> matched the query, then I would still have everything I need to render the
> page in a reasonable time. I really think that this needs to be addressed
> somewhere. It's not like I am the only one that does this. You see it nearly
> everywhere a long list of results is (expected to be) returned in a web site.
> Among the people I work with, this seems to be the most mentioned reason that
> they claim that they don't use Postgres for their projects.
>
> It would be nice to see how the server comes up with the search results and
> list of links to pages of results for this mailing list.
> (http://search.postgresql.org/search?q=slow+count%28%29&m=1&l=&d=365&s=r) I am
> guessing that it probably uses the count and query method I am talking about.
>
>> I guess there could be *some* way to expose the query planner's cost
>> estimates in a manner useful for result count estimation ... but given
>> how coarse its stats are and how wildly out the estimates can be, I kind
>> of doubt it. It's really intended for query planning decisions and more
>> interested in orders of magnitude, "0, 1, or more than that" measures,
>> etc, and seems to consider 30% here or there to be pretty insignificant
>> most of the time.
>>
>>> It's bad enough that count(*) is slow, then you have to do it all over
>>> again to get the results you need! I have not dug into this much yet,
>>> but would it be possible to return the amount of rows that a WHERE
>>> clause would actually return if the LIMIT and OFFSET were not applied.
>>> IE: When a normal query is executed, the server returns the number of
>>> rows aside from the actual row data. Would it be a big deal to modify
>>> this to allow it to return the amount of rows before the LIMIT and
>>> OFFSET is applied as well?
>> It'd force the server to fully execute the query. Then again, it sounds
>> like you're doing that anyway.
How big is your DB?
How fast is your disk access?
Any chance disks/RAM can be addressed?

My disk access is pitiful...
first run, 2.3 million rows.. 0m35.38s, subsequent runs.. real    0m2.55s

rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2340704
(1 row)


real    0m35.38s
user    0m0.25s
sys     0m0.03s

subsequent runs.... (count changes due to inserts.)

rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2363707
(1 row)


real    0m2.70s
user    0m0.27s
sys     0m0.02s
rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2363707
(1 row)


real    0m2.55s
user    0m0.26s
sys     0m0.02s
rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2363707
(1 row)


real    0m2.50s
user    0m0.26s
sys     0m0.02s

reporting=# SELECT pg_size_pretty(pg_total_relation_size('my_production_table'));
  pg_size_pretty
----------------
  1890 MB
(1 row)


В списке pgsql-performance по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: Slow count(*) again...
Следующее
От: Mladen Gogala
Дата:
Сообщение: Re: Slow count(*) again...