HashAggregate slower than sort?

От: Jatinder Sangha
Тема: HashAggregate slower than sort?
Дата: ,
Msg-id: 519DE98D62C54D4A896E22A026CF13227A201F@cdlon-ex.CoalitionDev.local
(см: обсуждение, исходный текст)
Ответы: Re: HashAggregate slower than sort?  ("Kevin Grittner")
Список: pgsql-performance

Скрыть дерево обсуждения

HashAggregate slower than sort?  ("Jatinder Sangha", )
 Re: HashAggregate slower than sort?  ("Kevin Grittner", )
 Re: HashAggregate slower than sort?  ("Jatinder Sangha", )
  Re: HashAggregate slower than sort?  ("Kevin Grittner", )

Hi,

I've noticed something that I find strange with the hash-aggregate feature of Postgres. I'm currently running Postgres v8.4.1 on Debian Linux 64-bit.

I have a simple query that when planned either uses hash-aggregates or a sort depending on the amount of working memory available. The problem is that when it uses the hash-aggregates, the query runs 25% slower than when using the sort method.

The table in question contains about 60 columns, many of which are boolean, 32-bit integers and some are 64-bit integers. Many fields are text - and some of these can be quite long (eg 32Kb).

The SQL is as follows:

explain analyse
select distinct T1.*
  from role T1
 where T1.endDate is null and T1.latest=true and T1.active=true and
       T1.deceased=false and T1.desk in (BIG LIST OF INTEGERS);

select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"

show enable_hashagg --> "on"
set work_mem='8MB'
show work_mem --> "8MB"

Explain analyse of the SQL above:
Unique  (cost=47033.71..48410.27 rows=8881 width=1057) (actual time=18.803..38.969 rows=6449 loops=1)
  ->  Sort  (cost=47033.71..47055.91 rows=8881 width=1057) (actual time=18.801..20.560 rows=6449 loops=1)
        Sort Key: id, version, latest, active, deceased, person, formalnotes, informalnotes, description, desk, rolelevel, roletype, promotiondate, primaryrole, headofplace, careergrading, startdate, enddate, percentsalary, deskf, rolelevelf, roletypef, promotiondatef, primaryrolef, headofplacef, careergradingf, startdatef, enddatef, percentsalaryf, descriptionf, deskmv, rolelevelmv, roletypemv, promotiondatemv, primaryrolemv, headofplacemv, careergradingmv, startdatemv, enddatemv, percentsalarymv, descriptionmv, hasattachments, hasrelationships, hasprojects, audwho, audwhen, audcreated, costcentre, reportsto, manages, startdateest, enddateest, hasstarperformers, projectnames, sourcefrom, sourceto, checkedwho, checkedwhen, checkednotes, hasqueries, querytitles

        Sort Method:  quicksort  Memory: 2001kB
        ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27 rows=8881 width=1057) (actual time=7.041..12.504 rows=6449 loops=1)

              Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[]))
              Filter: ((enddate IS NULL) AND latest AND active AND (NOT deceased))
              ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37 rows=10984 width=0) (actual time=6.948..6.948 rows=9296 loops=1)

                    Index Cond: ((latest = true) AND (active = true) AND (deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))

Total runtime: 40.777 ms

This execution of the query used a sort to perform the "distinct".

Now for the second run:

select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"

show enable_hashagg --> "on"
set work_mem='64MB'
show work_mem --> "64MB"

Explain analyse of the SQL above:
HashAggregate  (cost=43675.63..43764.44 rows=8881 width=1057) (actual time=46.556..55.694 rows=6449 loops=1)
  ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27 rows=8881 width=1057) (actual time=7.179..13.023 rows=6449 loops=1)

        Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[]))
        Filter: ((enddate IS NULL) AND latest AND active AND (NOT deceased))
        ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37 rows=10984 width=0) (actual time=7.086..7.086 rows=9296 loops=1)

              Index Cond: ((latest = true) AND (active = true) AND (deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))

Total runtime: 57.536 ms


I've tested this with v8.4.4 as well with the same results. I also tested the same query with our previous production version of Postgres (v8.3.8) and that version only appears to use sorting not hash-aggregates.

Obviously, I can re-write the query to use a "distinct on (...)" clause to improve performance - which is what I've done, but my question is: Why is the hash-aggregate slower than the sort?

Is it something to do with the number of columns? ie. When sorting, the first few columns defined on the table (id, version) make the row unique - but when using the hash-aggregate feature, presumably every column needs to be hashed which takes longer especially for long text fields?

Thanks,
--Jatinder

Coalition Development Ltd 1st  Floor, One Newhams Row, London, United Kingdom, SE1 3UZ

Registration Number - 04328897 Registered Office - Direct Control 3rd Floor, Marvic House, Bishops Road, London, United Kingdom, SW6 7AD


В списке pgsql-performance по дате сообщения:

От: Josh Berkus
Дата:
Сообщение: Re: PostgreSQL as a local in-memory cache
От: Scott Carey
Дата:
Сообщение: Re: requested shared memory size overflows size_t