Обсуждение: Query memory usage greatly in excess of work_mem * query plan steps

Поиск
Список
Период
Сортировка

Query memory usage greatly in excess of work_mem * query plan steps

От
Timothy Garnett
Дата:
Hi,

I have a query that's pulling data for another system using COPY (query) to STDOUT CSV on a 9.2.4 db (we're in the process of upgrading to 9.3).  The final csv file is large (~75GB, 86 million rows).  The query is also large, consisting of one table (86 million rows) left joined to a total of 30 other tables (of mixed size), 3 of which are CTE supplied by a WITH clause of and consist of 3 joins each for a total of 39 joins in the plan. work_mem on the system is set to 256MB.

We're running into problems with the machine running out of memory with this single query process consuming over 100GB resident memory before the machine exhausts swap and the Linux OOM handling eventually kills it.  The query plan from explain comes to 186 rows, which assuming that each row requires the full work_mem (which should be a significant overestimate of the number operations and size) is < 50GB and we're observing substantially more then that. Is it reasonable to expect that a query will take ~ < work_mem * # of operations, or are there other factors in play?

The plan looks reasonable (though there are some odd right join uses, see below) and the row estimates look pretty accurate with the exception that one of the CTE queries is under-estimated row count wise by a little over 2 orders of magnitude (260k vs. 86 million rows). That query does a group by (plans as a sort then group aggregate, there are no hash aggregates in the plan which is something that might increase memory) and the group part miss-estimates the final number of rows for that CTE. Unlike the other CTEs when it's merged joined into the main query there's no materialize line in the plan (no idea if that's relevant).

As to the right join (used for a few of the joins, most are left join or merge):
                     ->  Hash Right Join (cost=225541299.19..237399743.38 rows=86681834 width=1108)
                           Hash Cond: (xxx.xxx = yyy.yyy)
                           ->  Seq Scan on xxx (cost=0.00..6188.18 rows=9941 width=20)
                                 Filter: (mode = 'live'::text)
                           ->  Hash  (cost=212606744.27..212606744.27 rows=86681834 width=1096)
                                  ....
I'm not sure if I'm reading it right, but it looks like it's hashing the 86 million row set and scanning over the 10k row set which seems to me like the opposite of what you'd want to do, but I haven't seen a lot of hash right joins in plans and I'm not sure if that's how it works.

Tim

Re: Query memory usage greatly in excess of work_mem * query plan steps

От
"Franklin, Dan"
Дата:
We had a problem in the 8.X series with COPY IN - it did not respect any configured maximums and just kept allocating memory until it could fit the entire COPY contents down to the \. into RAM.  Could there be a similar issue with COPY OUT?

-----
Dan


On Wed, Jun 11, 2014 at 6:02 PM, Timothy Garnett <tgarnett@panjiva.com> wrote:
Hi,

I have a query that's pulling data for another system using COPY (query) to STDOUT CSV on a 9.2.4 db (we're in the process of upgrading to 9.3).  The final csv file is large (~75GB, 86 million rows).  The query is also large, consisting of one table (86 million rows) left joined to a total of 30 other tables (of mixed size), 3 of which are CTE supplied by a WITH clause of and consist of 3 joins each for a total of 39 joins in the plan. work_mem on the system is set to 256MB.

We're running into problems with the machine running out of memory with this single query process consuming over 100GB resident memory before the machine exhausts swap and the Linux OOM handling eventually kills it.  The query plan from explain comes to 186 rows, which assuming that each row requires the full work_mem (which should be a significant overestimate of the number operations and size) is < 50GB and we're observing substantially more then that. Is it reasonable to expect that a query will take ~ < work_mem * # of operations, or are there other factors in play?

The plan looks reasonable (though there are some odd right join uses, see below) and the row estimates look pretty accurate with the exception that one of the CTE queries is under-estimated row count wise by a little over 2 orders of magnitude (260k vs. 86 million rows). That query does a group by (plans as a sort then group aggregate, there are no hash aggregates in the plan which is something that might increase memory) and the group part miss-estimates the final number of rows for that CTE. Unlike the other CTEs when it's merged joined into the main query there's no materialize line in the plan (no idea if that's relevant).

As to the right join (used for a few of the joins, most are left join or merge):
                     ->  Hash Right Join (cost=225541299.19..237399743.38 rows=86681834 width=1108)
                           Hash Cond: (xxx.xxx = yyy.yyy)
                           ->  Seq Scan on xxx (cost=0.00..6188.18 rows=9941 width=20)
                                 Filter: (mode = 'live'::text)
                           ->  Hash  (cost=212606744.27..212606744.27 rows=86681834 width=1096)
                                  ....
I'm not sure if I'm reading it right, but it looks like it's hashing the 86 million row set and scanning over the 10k row set which seems to me like the opposite of what you'd want to do, but I haven't seen a lot of hash right joins in plans and I'm not sure if that's how it works.

Tim

Re: Query memory usage greatly in excess of work_mem * query plan steps

От
Tom Lane
Дата:
Timothy Garnett <tgarnett@panjiva.com> writes:
> I have a query that's pulling data for another system using COPY (query) to
> STDOUT CSV on a 9.2.4 db (we're in the process of upgrading to 9.3).
> ...
> We're running into problems with the machine running out of memory with
> this single query process consuming over 100GB resident memory before the
> machine exhausts swap and the Linux OOM handling eventually kills it.

I wonder if you're hitting some sort of memory leak.  What I'd suggest
doing to help diagnose that is to show us a memory map.  Do this:

(1) Set a ulimit so that the process will get ENOMEM sometime before
the OOM killer awakens (this is good practice anyway, if you've not
disabled OOM kills).  On Linux systems, ulimit -m or -v generally
does the trick.  The easiest way to enforce this is to add a ulimit
command to the script that launches the postmaster, then restart.

(2) Make sure your logging setup will collect anything printed to
stderr by a backend.  If you use logging_collector you're good to go;
if you use syslog you need to check where the postmaster's stderr
was redirected, making sure it's not /dev/null.

(3) Run the failing query.  Collect the memory map it dumps to stderr
when it fails, and send it in.  What you're looking for is a couple
hundred lines looking like this:

TopMemoryContext: 69984 total in 10 blocks; 6152 free (16 chunks); 63832 used
  MessageContext: 8192 total in 1 blocks; 7112 free (1 chunks); 1080 used
  Operator class cache: 8192 total in 1 blocks; 1640 free (0 chunks); 6552 used
  smgr relation table: 24576 total in 2 blocks; 13872 free (3 chunks); 10704 used
  ... lots more in the same vein ...


> As to the right join (used for a few of the joins, most are left join or
> merge):
>                      ->  Hash Right Join (cost=225541299.19..237399743.38
> rows=86681834 width=1108)
>                            Hash Cond: (xxx.xxx = yyy.yyy)
>                            ->  Seq Scan on xxx (cost=0.00..6188.18
> rows=9941 width=20)
>                                  Filter: (mode = 'live'::text)
>                            ->  Hash  (cost=212606744.27..212606744.27
> rows=86681834 width=1096)
>                                   ....
> I'm not sure if I'm reading it right, but it looks like it's hashing the 86
> million row set and scanning over the 10k row set which seems to me like
> the opposite of what you'd want to do, but I haven't seen a lot of hash
> right joins in plans and I'm not sure if that's how it works.

That looks pretty odd to me too, though I guess the planner might think it
was sensible if xxx's join column had very low cardinality.  Still, it's
weird.  What have you got work_mem set to exactly?

            regards, tom lane