improving GROUP BY estimation

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема improving GROUP BY estimation
Дата
Msg-id 56CD0381.5060502@2ndquadrant.com
обсуждение исходный текст
Ответы Re: improving GROUP BY estimation  (Mark Dilger <hornschnorter@gmail.com>)
Список pgsql-hackers
Hi,

while working of using correlation statistics when estimating GROUP BY
(or generally whatever uses estimate_num_groups), I've noticed that we
apply the selectivity in a rather naive way.

After estimating number of groups in the whole table - either by reading
pg_stats.n_distinct for a single column, or multiplying (and doing some
additional magic) for a group of columns, we do this:

    reldistinct *= rel->rows / rel->tuples;

which essentially applies the selectivity to the ndistinct estimate.

So when you have a table with 1000 distinct values in a column, and we
read 10% of the table, we expect to get 100 distinct values.

But consider for example the table has 100M rows, that the column has
n_distinct=10000 and we're reading 1% of the table (i.e. 1M rows). The
current code will estimate the GROUP BY cardinality to be 100 (as 1% of
the n_distinct), but in reality we're more likely to see all 10000
distinct values.

Example:

CREATE TABLE t (a INT, b FLOAT);
INSERT INTO t SELECT (100000 * random())::int, random()
                 FROM generate_series(1,10000000) s(i);
ANALYZE t;

SELECT n_distinct FROM pg_stats WHERE attname = 'a';

  n_distinct
------------
       98882

-- 1% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.01 GROUP BY a;

                                QUERY PLAN
-------------------------------------------------------------------------
  HashAggregate  (cost=179514.33..179523.45 rows=912 width=4)
                 (actual time=916.224..955.136 rows=63492 loops=1)
    Group Key: a
    ->  Seq Scan on t  (cost=0.00..179053.25 rows=92216 width=4)
                       (actual time=0.103..830.104 rows=100268 loops=1)
          Filter: (b < '0.01'::double precision)
          Rows Removed by Filter: 9899732
  Planning time: 1.237 ms
  Execution time: 982.600 ms
(7 rows)

-- 5% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.05 GROUP BY a;

                                 QUERY PLAN
-------------------------------------------------------------------------
  HashAggregate  (cost=180296.06..180345.22 rows=4916 width=4)
                 (actual time=1379.519..1440.005 rows=99348 loops=1)
    Group Key: a
    ->  Seq Scan on t  (cost=0.00..179053.25 rows=497123 width=4)
                       (actual time=0.096..1000.494 rows=500320 loops=1)
          Filter: (b < '0.05'::double precision)
          Rows Removed by Filter: 9499680
  Planning time: 0.129 ms
  Execution time: 1480.986 ms
(7 rows)

This is getting especially bad when reading only a small fraction of a
huge table - it's easy to construct cases with arbitrarily large error.
And it's worth noting that the error is almost exclusively massive
under-estimate, so the "wrong" direction as HashAggregate is still
vulnerable to OOM (again, the larger the table the worse).

So I think a more appropriate way to estimate this would be to find the
expected number of distinct values when sampling with replacements, as
explained for example in [1].

Applied to the code, it means changing the line from

   reldistinct *= rel->rows / rel->tuples;

to

   reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

With this change, the estimates for the examples look like this:

                                QUERY PLAN
-------------------------------------------------------------------------
  HashAggregate  (cost=179283.79..179883.48 rows=59969 width=4)
                 (actual time=897.071..934.764 rows=63492 loops=1)
    Group Key: a
    ->  Seq Scan on t  (cost=0.00..179053.25 rows=92216 width=4)
                       (actual time=0.104..820.276 rows=100268 loops=1)
          Filter: (b < '0.01'::double precision)
          Rows Removed by Filter: 9899732
  Planning time: 1.261 ms
  Execution time: 962.812 ms
(7 rows)

and

                                 QUERY PLAN
-------------------------------------------------------------------------
  Group  (cost=232886.15..235371.76 rows=98234 width=4)
         (actual time=1519.545..2104.447 rows=99348 loops=1)
    Group Key: a
    ->  Sort  (cost=232886.15..234128.95 rows=497123 width=4)
              (actual time=1519.540..1838.575 rows=500320 loops=1)
          Sort Key: a
          Sort Method: external merge  Disk: 6824kB
          ->  Seq Scan on t  (cost=0.00..179053.25 rows=497123 width=4)
                      (actual time=0.090..1044.099 rows=500320 loops=1)
                Filter: (b < '0.05'::double precision)
                Rows Removed by Filter: 9499680
  Planning time: 0.133 ms
  Execution time: 2147.340 ms
(10 rows)

So much better. Clearly, there are cases where this will over-estimate
the cardinality - for example when the values are somehow correlated.

But I'd argue over-estimating is better because of the OOM issues in
Hash Aggregate.

regards

[1]

http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: The plan for FDW-based sharding
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Add generate_series(date,date) and generate_series(date,date,integer)