[GENERAL] Combining count() and row_number() as window functions

Поиск
Список
Период
Сортировка
От Thomas Kellerer
Тема [GENERAL] Combining count() and row_number() as window functions
Дата
Msg-id o5qd6k$2ni$2@blaine.gmane.org
обсуждение исходный текст
Ответы Re: [GENERAL] Combining count() and row_number() as window functions  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
I was playing around with a query that essentially looked something like this:

    select row_number() over (order by foo_date) as rn,
           count(*) over () as total_count,
           f.*
    from foo f;

(The actual query limits the output based on the row_number() for pagination purposes, but for this question this
turnedout to be irrelevant) 

I assumed that the count() wouldn't increase the runtime of the query as the result of the row_number() can be used to
calculatethat.  

However it turns out that using a scalar subquery to calculate the count() is faster despite the duplicated Seq Scan on
thetable: 

    select row_number() over (order by foo_date) as rn,
           (select count(*) from foo) as total_count,
           f.*
    from foo f

I used the following test setup:

    create table foo
    as
    select i,
           date '2000-01-01' + (random() * 17 * 365)::int as foo_date,
           'Some Text '||i as data
    from generate_series(1,500000) g(i);

    explain (analyze, verbose, buffers)
    select row_number() over (order by foo_date) as rn,
           count(*) over () as total_count,
           f.*
    from foo f;

    explain (analyze, verbose, buffers)
    select row_number() over (order by foo_date) as rn,
           (select count(*) from foo) as total_count,
           f.*
    from foo f;

This is the plan for the first query

WindowAgg  (cost=44191.13..49344.89 rows=429480 width=56) (actual time=1295.152..1491.642 rows=500000 loops=1)
  Output: (row_number() OVER (?)), count(*) OVER (?), i, foo_date, data
  Buffers: shared hit=2048 read=1531 dirtied=1531, temp read=5781 written=5078
  I/O Timings: read=26.022
  ->  WindowAgg  (cost=44191.13..47841.71 rows=429480 width=48) (actual time=611.887..987.440 rows=500000 loops=1)
        Output: foo_date, i, data, row_number() OVER (?)
        Buffers: shared hit=2048 read=1531 dirtied=1531, temp read=2123 written=2123
        I/O Timings: read=26.022
        ->  Sort  (cost=44191.13..45264.83 rows=429480 width=40) (actual time=611.872..723.216 rows=500000 loops=1)
              Output: foo_date, i, data
              Sort Key: f.foo_date
              Sort Method: external merge  Disk: 16976kB
              Buffers: shared hit=2048 read=1531 dirtied=1531, temp read=2123 written=2123
              I/O Timings: read=26.022
              ->  Seq Scan on stuff.foo f  (cost=0.00..4008.48 rows=429480 width=40) (actual time=0.064..158.561
rows=500000loops=1) 
                    Output: foo_date, i, data
                    Buffers: shared hit=2048 read=1531 dirtied=1531
                    I/O Timings: read=26.022
Planning time: 0.711 ms
Execution time: 1523.306 ms


and this is the plan for the second query:

WindowAgg  (cost=49273.31..52923.89 rows=429480 width=56) (actual time=660.543..1036.534 rows=500000 loops=1)
  Output: row_number() OVER (?), $0, f.i, f.foo_date, f.data
  Buffers: shared hit=7158, temp read=2123 written=2123
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=5082.18..5082.18 rows=1 width=8) (actual time=105.307..105.307 rows=1 loops=1)
          Output: count(*)
          Buffers: shared hit=3579
          ->  Seq Scan on stuff.foo  (cost=0.00..4008.48 rows=429480 width=0) (actual time=0.041..54.075 rows=500000
loops=1)
                Output: foo.i, foo.foo_date, foo.data
                Buffers: shared hit=3579
  ->  Sort  (cost=44191.13..45264.83 rows=429480 width=40) (actual time=555.216..663.021 rows=500000 loops=1)
        Output: f.foo_date, f.i, f.data
        Sort Key: f.foo_date
        Sort Method: external merge  Disk: 16976kB
        Buffers: shared hit=3579, temp read=2123 written=2123
        ->  Seq Scan on stuff.foo f  (cost=0.00..4008.48 rows=429480 width=40) (actual time=0.030..107.520 rows=500000
loops=1)
              Output: f.foo_date, f.i, f.data
              Buffers: shared hit=3579
Planning time: 0.134 ms
Execution time: 1065.572 ms

I uploaded both plans in case formatting breaks the above:

First query: https://explain.depesz.com/s/BT8y
Second query: https://explain.depesz.com/s/cbTm

The major contributor to the runtime is obviously the order by which is to be expected
But I am surprised that adding the count(*) in the first query adds additional work as from my perspective the count()
couldbe "derived" from the row_count().  

Is this a case of "just not implemented yet" or a case of "to expensive to optimize"?

This is on 9.6.1

Regards
Thomas





В списке pgsql-general по дате отправления:

Предыдущее
От: PAWAN SHARMA
Дата:
Сообщение: [GENERAL] Moving from 9.5 to 9.6
Следующее
От: Kenneth Marshall
Дата:
Сообщение: Re: [GENERAL] migrate Sql Server database to PostgreSql