Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword

Поиск
Список
Период
Сортировка
От Torge
Тема Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword
Дата
Msg-id 0e8b1eb9-97bf-0ddf-a487-ac4e4f6e0186@posteo.de
обсуждение исходный текст
Ответы Re: Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
Dear Postgres Devs,

I use Postgres everywhere I can and for me it is the by far best 
database system out there. Great job, thank you!

Now I would like to humbly propose a feature that gives an easy way to 
get a quick count estimate for any condition - index based or not - 
based on a random sample of rows, that does not require a custom 
function creation or complex SQL statement

The final syntax could be for example:

     SELECT count_estimate(2000) FROM my_table WHERE ...

or, since peerce in the chat pointed out that aggregate functions always 
only get hit after the WHERE condition is already evaluated:

     SAMPLE count(*) FROM my_table WHERE ... SAMPLESIZE 2000

The idea of this function is to estimate the total number of rows by 
checking a random sample size of them for the WHERE condition.

1. Get an as good as possible but quick estimate of max number of rows. 
(rowcount)
2. Select random sample size of all rows (samplesize)
3. check if where condition matches (hits)
4. calculate and return estimate using rowcount * hits / samplesize

The higher the sample size, the more precise is the count, so the 
developer can decide on the tradeoff between accuracy and speed.
Ofc this has it's limitations if the sample size is so small, or the 
actual results so few, that only a few hits - if any at all - make it 
into it.
So maybe a min_hits value should be considered as well, so the sample 
size is increased until at least that many hits are found or an exact 
count was produced (or until an optional max_samplesize was reached)

Typical scenario: highly flexible searching/filtering data with paging 
mechanism and "total results" preview

I think this would find a lot of fans. When once more hitting the famous 
count(*) performance issue I read again a lot of comments, 
stackoverflows and mail threads discussing it, possible workarounds and 
solutions, one more complex than the other.
I ended up writing my own SQL statement that achieves the above and I 
think it meets the requirements of many who started discussions, but I 
really would love to see this become a supported feature so it is easier 
to use and most likely far more performant.


Greetings,
      Torge.




PS:

Here my current SQL level statement to achieve this. Due to limitations 
I couldn't work around, it works slightly different. Instead of the row 
count, it uses the max ID of the table, which ofc has to exist to even work.
Using reltuples was suggested, but in my system it is already often off 
by more than 10% and I don't know how to select random rows in that case 
(efficiently) while I can easily generate random IDs in the space of 
possible IDs.

My statement returns an estimate in ~ 1 second on a 100M table with a 
not index supported WHERE matching 10M entries that is roughly 5% off 
while an exact count takes over 20 seconds

WITH vars AS (
     SELECT (SELECT my_id FROM my_table ORDER BY my_id DESC LIMIT 1) AS 
max_id,  --the maximum ID currently used in this table
     2000 AS sample_size --The number of entries to sample. The higher 
the more precise the estimate
),
random_rows AS ( --Create a temporary table with sample_size number of 
random rows
     SELECT *
     FROM my_table
     WHERE my_id = ANY (ARRAY(
         SELECT (random() * (SELECT max_id FROM vars))::int --Select a 
random id from 0 to max_id
         FROM GENERATE_SERIES(1, (SELECT sample_size FROM vars)) 
--generate sample_size number of rows
     ))
)
SELECT (SELECT max_id FROM vars) * count(*) / (SELECT sample_size from 
vars)::int AS estimated_count
FROM random_rows
WHERE ... --Any where condition possible

There was a concern mentioned in chat that the ID might have holes, 
especially as the sequence is not reset on failed transactions or when 
stuff is deleted. But this should not matter much, it reduces the 
accuracy, but could be countered by a bigger sample size. Also a min_id 
could easily be added in cases where old rows get deleted.

Example:

table has 100M entries
Condition would match 10M entries
highest ID is 300.000.000
so only 1/3rd of IDs really exist in the table.
Sample size: 3000

We try to fetch 3000 rows by probing random IDs resulting in roughly 
1000 actual rows, lets say 1012
We compare those rows against our WHERE condition and match 96

The resulting estimate is maxId * hits / samplesize = 300.000.000 * 96 / 
3000 = 9.600.000

The holes, no matter where they are, while reducing the precision, are 
not a general problem.
This becomes clearer if we first calculate the estimated number of 
existing rows based on the data and then continue from there, which will 
in the end return the same result

actualRowEstimate = maxId * fetchedRows / samplesize = 300.000.000 * 
1012 / 3000 = 101.200.000
estimate = actualRowEstimate * hits / fetchedRows = 101.200.000 * 96 / 
1012 = 9.600.000  => same number

Because:
estimate = actualRowEstimate * hits / fetchedRows
= (maxId * fetchedRows / samplesize) * hits / fetchedRows
= maxId * hits / samplesize




В списке pgsql-general по дате отправления:

Предыдущее
От: Wesley Schwengle
Дата:
Сообщение: View definition changes after reloading pg_dump export
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword