Обсуждение: [GENERAL] Query Using Massive Temp Space

Поиск
Список
Период
Сортировка

[GENERAL] Query Using Massive Temp Space

От
Cory Tucker
Дата:
Hello, 

I have a query that is using a tremendous amount of temp disk space given the overall size of the dataset.  I'd love for someone to try to explain what PG is doing and why its using so much space for the query.

First off, the system is PG 9.6 on Ubuntu with 4 cores and 28 GB of RAM.  The query in question is a fairly large join of several tables (6) including some aggregations.  The overall dataset size of the 6 tables in question is about 20GB and the largest table is about 15M rows.  The query is essentially a dump of most of the data from these tables joined together to be used in another system.

When the query runs it begins to use an aggressive amount of temp space on the volume over the course of many hours, until it reaches about 95% capacity and then tapers off.  Never completes though.  The temp space it uses is around 1.5TB out of a 2TB volume.  Again, the total size of the relations in question is only 20GB.  

Can anyone explain how the query could possibly use so much temp space?

Query and query plan are attached.  Please let me know if any more info is needed.

thanks
--Cory
Вложения

Re: [GENERAL] Query Using Massive Temp Space

От
Laurenz Albe
Дата:
Cory Tucker wrote:
> I have a query that is using a tremendous amount of temp disk space given the overall size of the dataset.
> I'd love for someone to try to explain what PG is doing and why its using so much space for the query.

It could be a sort or a hash operation.

Do determine what is going on, EXPLAIN (ANALYZE, BUFFERS) output
would be useful.

Yours,
Laurenz Albe


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Query Using Massive Temp Space

От
Semler Miloslav
Дата:

Hello,

I have one question… Why you using so huge amount of grouping columns? Is there some reason for it? It is not definitelly fast method.  I would prefer firstly do named query grouped by ids (account_id, candidate_id, parent_id) and then left join table candidates (to get rest of requested columns) out of this named query on this ids if it is possible.

You also avoid of hash indexing of these nine columns so you will save certain ammount of memory.

 

Miloslav

 

From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Cory Tucker
Sent: Monday, November 20, 2017 7:17 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Query Using Massive Temp Space

 

Hello, 

 

I have a query that is using a tremendous amount of temp disk space given the overall size of the dataset.  I'd love for someone to try to explain what PG is doing and why its using so much space for the query.

 

First off, the system is PG 9.6 on Ubuntu with 4 cores and 28 GB of RAM.  The query in question is a fairly large join of several tables (6) including some aggregations.  The overall dataset size of the 6 tables in question is about 20GB and the largest table is about 15M rows.  The query is essentially a dump of most of the data from these tables joined together to be used in another system.

 

When the query runs it begins to use an aggressive amount of temp space on the volume over the course of many hours, until it reaches about 95% capacity and then tapers off.  Never completes though.  The temp space it uses is around 1.5TB out of a 2TB volume.  Again, the total size of the relations in question is only 20GB.  

 

Can anyone explain how the query could possibly use so much temp space?

 

Query and query plan are attached.  Please let me know if any more info is needed.

 

thanks

--Cory

Re: [GENERAL] Query Using Massive Temp Space

От
legrand legrand
Дата:
Hi, 
there are many Hash and Merge joins that may generate a lot of temp space,
it could also be a problem of forgotten column in the join.

Could you also provide indexes definitions (pk, uk and others)
with the EXPLAIN (ANALYZE BUFFERS)

(you can limit the scope of the query to help it to finish by limiting the
number of lines from candidates  by
...  
*FROM match_candidate_address WHERE account_id NOT IN (:exclude_ids)
limit 100
*
)

Regards
PAscal




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Query Using Massive Temp Space

От
Tom Lane
Дата:
Cory Tucker <cory.tucker@gmail.com> writes:
> I have a query that is using a tremendous amount of temp disk space given
> the overall size of the dataset.  I'd love for someone to try to explain
> what PG is doing and why its using so much space for the query.

> First off, the system is PG 9.6 on Ubuntu with 4 cores and 28 GB of RAM.

9.6.what exactly?

> The query in question is a fairly large join of several tables (6)
> including some aggregations.  The overall dataset size of the 6 tables in
> question is about 20GB and the largest table is about 15M rows.  The query
> is essentially a dump of *most* of the data from these tables joined
> together to be used in another system.
> When the query runs it begins to use an aggressive amount of temp space on
> the volume over the course of many hours, until it reaches about 95%
> capacity and then tapers off.  Never completes though.  The temp space it
> uses is around 1.5TB out of a 2TB volume.  Again, the *total* size of the
> relations in question is only 20GB.

The only thing I can think of offhand that could create temp files far in
excess of the actual data volume is if a hash join repeatedly decides that
it needs to increase the number of hash batches.  We have seen that happen
in cases where an inner relation contains a vast number of identical hash
key values: they all go into the same bucket, and the executor keeps
splitting the batch trying to reduce its size, but being all identical
values it will never be the case that some rows go into a different batch
than others.  There is logic in there to give up splitting when this
happens, but maybe it's not firing in your case, or maybe somebody broke
it recently :-(.

I find it suspicious that your plan involves two separate hash joins both
of which have a much larger table on the inside than the outside ---
offhand that does not look very sane.  So there's also a question of
why did the planner do that.

What can you tell us about the distributions of the join keys for the two
hash joins --- are those keys unique, or a bit less than unique, or very
much not unique?

It would also be interesting to keep an eye on the temp files themselves.
How many are there, how large are they, how fast are new ones created?
If my theory above is right, it would likely manifest as there being a
lot of zero-size temp files from the wasted split attempts (and other
files being exactly the same size as the ones from the previous
iteration).
        regards, tom lane


Re: [GENERAL] Query Using Massive Temp Space

От
Cory Tucker
Дата:


On Mon, Nov 20, 2017 at 9:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
9.6.what exactly?

9.6.5
 

The only thing I can think of offhand that could create temp files far in
excess of the actual data volume is if a hash join repeatedly decides that
it needs to increase the number of hash batches.  We have seen that happen
in cases where an inner relation contains a vast number of identical hash
key values: they all go into the same bucket, and the executor keeps
splitting the batch trying to reduce its size, but being all identical
values it will never be the case that some rows go into a different batch
than others.  There is logic in there to give up splitting when this
happens, but maybe it's not firing in your case, or maybe somebody broke
it recently :-(.

I think this is exactly the scenario that happened. More below.
 

I find it suspicious that your plan involves two separate hash joins both
of which have a much larger table on the inside than the outside ---
offhand that does not look very sane.  So there's also a question of
why did the planner do that.

What can you tell us about the distributions of the join keys for the two
hash joins --- are those keys unique, or a bit less than unique, or very
much not unique?

We were able to get the query to run, without using much temp space at all, by eliminating this portion of the query:

  LEFT JOIN donation d2
    ON mca.parent_id = d2.candidate_id AND mca.parent_id IS NOT NULL AND mca.account_id = d2.account_id
       AND d2.account_id NOT IN (1, 2, 3, 195, 196, 81)
 
The resultant full query plan is attached (json format this time).

What was happening is that there is a fairly non-unique "parent_id" value (66k times) that is creating a very large hashkey that it cannot break into a smaller chunks -- so, essentially what your guess was, Tom.  Perhaps worth investigating whether that code is still functioning as intended.

Incidentally, I'd also be interested in any suggestions for refactoring this query for better performance.  It does complete now if we exclude the problematic account but still takes quite a bit of time and we expect our dataset to only get bigger.

thanks 
--Cory
Вложения

Re: [GENERAL] Query Using Massive Temp Space

От
Tom Lane
Дата:
Cory Tucker <cory.tucker@gmail.com> writes:
> On Mon, Nov 20, 2017 at 9:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The only thing I can think of offhand that could create temp files far in
>> excess of the actual data volume is if a hash join repeatedly decides that
>> it needs to increase the number of hash batches.  We have seen that happen
>> in cases where an inner relation contains a vast number of identical hash
>> key values: they all go into the same bucket, and the executor keeps
>> splitting the batch trying to reduce its size, but being all identical
>> values it will never be the case that some rows go into a different batch
>> than others.  There is logic in there to give up splitting when this
>> happens, but maybe it's not firing in your case, or maybe somebody broke
>> it recently :-(.

> What was happening is that there is a fairly non-unique "parent_id" value
> (66k times) that is creating a very large hashkey that it cannot break into
> a smaller chunks -- so, essentially what your guess was, Tom.  Perhaps
> worth investigating whether that code is still functioning as intended.

I tried to duplicate this result, using a test case constructed like this:

create table hj(f1 int, filler text);
alter table hj alter column filler set storage plain;  -- prevent compression
insert into hj select x, repeat('xyzzy', 4000/5) from generate_series(1,1000000) x;
insert into hj select 42, repeat('xyzzy', 4000/5) from generate_series(1,10000) x;
vacuum analyze hj;

set enable_mergejoin TO 0;
set max_parallel_workers_per_gather TO 0;

explain analyze
select count(h1.filler),count(h2.filler) from hj h1, hj h2 where h1.f1 = h2.f1;

This causes a hash join between two datasets a bit under 4GB each, with
the largest bucket being circa 40MB.  work_mem is at the default 4MB,
so that bucket will be large enough to force split attempts.

What I see is that the temp disk space rises slowly to ~7.5GB and then
declines smoothly to ~40MB, where it stays for awhile before the query
finishes.  The EXPLAIN ANALYZE report says
Aggregate  (cost=10020068069.47..10020068069.48 rows=1 width=16) (actual time=81812.299..81812.299 rows=1 loops=1)  ->
HashJoin  (cost=10001024835.00..10019362963.81 rows=141021131 width=8008) (actual time=10227.585..67092.480
rows=101020000loops=1)        Hash Cond: (h1.f1 = h2.f1)        ->  Seq Scan on hj h1  (cost=0.00..515100.00
rows=1010000width=4008) (actual time=0.023..2531.739 rows=1010000 loops=1)        ->  Hash  (cost=515100.00..515100.00
rows=1010000width=4008) (actual time=10197.922..10197.922 rows=1010000 loops=1)              Buckets: 1024 (originally
1024) Batches: 2097152 (originally 1024)  Memory Usage: 39462kB              ->  Seq Scan on hj h2
(cost=0.00..515100.00rows=1010000 width=4008) (actual time=0.003..2582.198 rows=1010000 loops=1)Planning time: 0.062
msExecutiontime: 81832.867 ms 

Now, there's definitely something busted here; it should not have gone as
far as 2 million batches before giving up on splitting.  But that breakage
isn't translating into disk space bloat.  I haven't traced through the
code yet, but I believe what's likely happening is that it's creating new
batch files but freeing the old ones, so that the disk space usage remains
constant during the fruitless split attempts.

So there's some aspect of what you're seeing that this simple case doesn't
capture.  I wonder if you can build a publishable test case that shows the
space bloat you're experiencing.  The behavior should only depend on the
hash join key values --- since those are apparently account numbers,
maybe they're nonsensitive, or you could obscure them without changing
the behavior?
        regards, tom lane


Re: [GENERAL] Query Using Massive Temp Space

От
Thomas Munro
Дата:
On Wed, Nov 22, 2017 at 7:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Now, there's definitely something busted here; it should not have gone as
> far as 2 million batches before giving up on splitting.

I had been meaning to discuss this.  We only give up when we reach the
point when a batch is entirely entirely kept or sent to a new batch
(ie splitting the batch resulted in one batch with the whole contents
and another empty batch).  If you have about 2 million evenly
distributed keys and an ideal hash function, and then you also have 42
billion keys that are the same (and exceed work_mem), we won't detect
extreme skew until the 2 million well behaved keys have been spread so
thin that the 42 billion keys are isolated in a batch on their own,
which we should expect to happen somewhere around 2 million batches.
I have wondered if our extreme skew detector needs to go off sooner.
I don't have a specific suggestion, but it could just be something
like 'you threw out or kept more than X% of the tuples'.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [GENERAL] Query Using Massive Temp Space

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Wed, Nov 22, 2017 at 7:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now, there's definitely something busted here; it should not have gone as
>> far as 2 million batches before giving up on splitting.

> I had been meaning to discuss this.  We only give up when we reach the
> point when a batch is entirely entirely kept or sent to a new batch
> (ie splitting the batch resulted in one batch with the whole contents
> and another empty batch).  If you have about 2 million evenly
> distributed keys and an ideal hash function, and then you also have 42
> billion keys that are the same (and exceed work_mem), we won't detect
> extreme skew until the 2 million well behaved keys have been spread so
> thin that the 42 billion keys are isolated in a batch on their own,
> which we should expect to happen somewhere around 2 million batches.

Yeah, I suspected it was something like that, but hadn't dug into the
code yet.

> I have wondered if our extreme skew detector needs to go off sooner.
> I don't have a specific suggestion, but it could just be something
> like 'you threw out or kept more than X% of the tuples'.

Doing this, with some threshold like 95% or 99%, sounds plausible to me.
I'd like to reproduce Cory's disk-space issue before we monkey with
related logic, though; fixing the part we understand might obscure
the part we still don't.
        regards, tom lane