Re: external sort performance

Поиск
Список
Период
Сортировка
От Jon Nelson
Тема Re: external sort performance
Дата
Msg-id CAKuK5J3ZTLKG4ozhtnoGpYoZDzXtaMtdXcCuUJj1-X1brb5hyg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: external sort performance  (Claudio Freire <klaussfreire@gmail.com>)
Ответы Re: external sort performance  (Jon Nelson <jnelson+pgsql@jamponi.net>)
Список pgsql-performance
I'll try to compile multiple questions/answers into a single response.

On Thu, Nov 17, 2011 at 11:16 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Nov 17, 2011 at 2:10 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>> What sorts of things should I be looking at to improve the performance
>> of this query? Is my interpretation of that log line totally off base?

> You'll have to post some more details.
> Like a query and an explain/explain analyze.

Please see below, however, I am also very interested to know if I'm
interpreting that log line correctly.

> Memory consumption probably skyrockets since you'll need at least one
> sort per table, so if you have 100+, then that's (at least) 100+
> sorts.

Right, that much I had understood.


On Thu, Nov 17, 2011 at 11:28 AM, Craig James
<craig_james@emolecules.com> wrote:
> You don't give any details about how and why you are sorting. Are you
> actually using all of the columns in your aggregated-data table in the sort
> operation?  Or just a few of them?

> You're making the sort operation work with 175 GB of data.  If most of that
> data is only needed for the report (not the sort), then separate it into two
> tables - one of just the data that the sorting/grouping needs, and the other
> with the rest of the data. Then create a view that joins it all back
> together for reporting purposes.

I'm not actually using any ORDER BY at all. This is purely a GROUP BY.
The sort happens because of the group aggregate (vs. hash aggregate).
Two of the columns are used to group, the other two are aggregates (SUM).

On Thu, Nov 17, 2011 at 11:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> This is on PG 8.4.8 on Linux, 16GB of "real" RAM.
>> Most recently, I enabled trace_sort, disabled hash aggregation[1], and
>> set a large work_mem (normally very small, in this case I tried
>> anything from 8MB to 256MB. I even tried 1GB and 2GB).
>
> FWIW, I think hash aggregation is your best shot at getting reasonable
> performance.  Sorting 175GB of data is going to hurt no matter what.

> If the grouped table amounts to 5GB, I wouldn't have expected the hash
> table to be more than maybe 2-3X that size (although this does depend on
> what aggregates you're running...).  Letting the hash aggregation have
> all your RAM might be the best answer.

I'm re-running the query with work_mem set to 16GB (for just that query).

The query (with table and column names changed):

SELECT anon_1.columnA, sum(anon_1.columnB) AS columnB,
sum(anon_1.columnC) AS columnC, anon_1.columnD
FROM (
  SELECT columnA, columnB, columnC, columnD FROM tableA
  UNION ALL
  .... same select/union all pattern but from 90-ish other tables
) AS anon_1
GROUP BY anon_1.columnA, anon_1.columnD
HAVING (anon_1.columnB) > 0

The explain verbose with work_mem = 16GB

 HashAggregate  (cost=54692162.83..54692962.83 rows=40000 width=28)
   Output: columnA, sum(columnB), sum(columnC), columnD
   Filter: (sum(columnB) > 0)
   ->  Append  (cost=0.00..34547648.48 rows=1611561148 width=28)
         ->  Seq Scan on tableA  (cost=0.00..407904.40 rows=19045540 width=28)
               Output: columnA, columnB, columnC, columnD
         .... 90-ish more tables here

12 minutes into the query it is consuming 10.1GB of memory.
21 minutes into the query it is consuming 12.9GB of memory.
After just under 34 minutes it completed with about 15GB of memory being used.
That is a rather impressive improvement. Previously, I had been
advised against using a large work_mem value. I had never thought to
use one 3 times the size of the resulting table.

The explain verbose with enable_hashagg = false:

 GroupAggregate  (cost=319560040.24..343734257.46 rows=40000 width=28)
   Output: columnA, sum(columnB), sum(columnC), columnD
   Filter: (sum(columnB) > 0)
   ->  Sort  (cost=319560040.24..323588943.11 rows=1611561148 width=28)
         Output: columnA, columnB, columnC, columnD
         Sort Key: columnA, columnD
         ->  Result  (cost=0.00..34547648.48 rows=1611561148 width=28)
               Output: columnA, columnB, columnC, columnD
               ->  Append  (cost=0.00..34547648.48 rows=1611561148 width=28)
                     ->  Seq Scan on tableA  (cost=0.00..407904.40
rows=19045540 width=28)
                         Output: columnA, columnB, columnC, columnD
                     .... 90-ish more tables here



--
Jon

В списке pgsql-performance по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: external sort performance
Следующее
От: Jon Nelson
Дата:
Сообщение: Re: external sort performance