Обсуждение: [GENERAL] Query Using Massive Temp Space
Вложения
Cory Tucker wrote: > I have a query that is using a tremendous amount of temp disk space given the overall size of the dataset. > I'd love for someone to try to explain what PG is doing and why its using so much space for the query. It could be a sort or a hash operation. Do determine what is going on, EXPLAIN (ANALYZE, BUFFERS) output would be useful. Yours, Laurenz Albe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Hello,
I have one question… Why you using so huge amount of grouping columns? Is there some reason for it? It is not definitelly fast method. I would prefer firstly do named query grouped by ids (account_id, candidate_id, parent_id) and then left join table candidates (to get rest of requested columns) out of this named query on this ids if it is possible.
You also avoid of hash indexing of these nine columns so you will save certain ammount of memory.
Miloslav
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Cory Tucker
Sent: Monday, November 20, 2017 7:17 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Query Using Massive Temp Space
Hello,
I have a query that is using a tremendous amount of temp disk space given the overall size of the dataset. I'd love for someone to try to explain what PG is doing and why its using so much space for the query.
First off, the system is PG 9.6 on Ubuntu with 4 cores and 28 GB of RAM. The query in question is a fairly large join of several tables (6) including some aggregations. The overall dataset size of the 6 tables in question is about 20GB and the largest table is about 15M rows. The query is essentially a dump of most of the data from these tables joined together to be used in another system.
When the query runs it begins to use an aggressive amount of temp space on the volume over the course of many hours, until it reaches about 95% capacity and then tapers off. Never completes though. The temp space it uses is around 1.5TB out of a 2TB volume. Again, the total size of the relations in question is only 20GB.
Can anyone explain how the query could possibly use so much temp space?
Query and query plan are attached. Please let me know if any more info is needed.
thanks
--Cory
Hi, there are many Hash and Merge joins that may generate a lot of temp space, it could also be a problem of forgotten column in the join. Could you also provide indexes definitions (pk, uk and others) with the EXPLAIN (ANALYZE BUFFERS) (you can limit the scope of the query to help it to finish by limiting the number of lines from candidates by ... *FROM match_candidate_address WHERE account_id NOT IN (:exclude_ids) limit 100 * ) Regards PAscal -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Cory Tucker <cory.tucker@gmail.com> writes: > I have a query that is using a tremendous amount of temp disk space given > the overall size of the dataset. I'd love for someone to try to explain > what PG is doing and why its using so much space for the query. > First off, the system is PG 9.6 on Ubuntu with 4 cores and 28 GB of RAM. 9.6.what exactly? > The query in question is a fairly large join of several tables (6) > including some aggregations. The overall dataset size of the 6 tables in > question is about 20GB and the largest table is about 15M rows. The query > is essentially a dump of *most* of the data from these tables joined > together to be used in another system. > When the query runs it begins to use an aggressive amount of temp space on > the volume over the course of many hours, until it reaches about 95% > capacity and then tapers off. Never completes though. The temp space it > uses is around 1.5TB out of a 2TB volume. Again, the *total* size of the > relations in question is only 20GB. The only thing I can think of offhand that could create temp files far in excess of the actual data volume is if a hash join repeatedly decides that it needs to increase the number of hash batches. We have seen that happen in cases where an inner relation contains a vast number of identical hash key values: they all go into the same bucket, and the executor keeps splitting the batch trying to reduce its size, but being all identical values it will never be the case that some rows go into a different batch than others. There is logic in there to give up splitting when this happens, but maybe it's not firing in your case, or maybe somebody broke it recently :-(. I find it suspicious that your plan involves two separate hash joins both of which have a much larger table on the inside than the outside --- offhand that does not look very sane. So there's also a question of why did the planner do that. What can you tell us about the distributions of the join keys for the two hash joins --- are those keys unique, or a bit less than unique, or very much not unique? It would also be interesting to keep an eye on the temp files themselves. How many are there, how large are they, how fast are new ones created? If my theory above is right, it would likely manifest as there being a lot of zero-size temp files from the wasted split attempts (and other files being exactly the same size as the ones from the previous iteration). regards, tom lane
9.6.what exactly?
The only thing I can think of offhand that could create temp files far in
excess of the actual data volume is if a hash join repeatedly decides that
it needs to increase the number of hash batches. We have seen that happen
in cases where an inner relation contains a vast number of identical hash
key values: they all go into the same bucket, and the executor keeps
splitting the batch trying to reduce its size, but being all identical
values it will never be the case that some rows go into a different batch
than others. There is logic in there to give up splitting when this
happens, but maybe it's not firing in your case, or maybe somebody broke
it recently :-(.
I find it suspicious that your plan involves two separate hash joins both
of which have a much larger table on the inside than the outside ---
offhand that does not look very sane. So there's also a question of
why did the planner do that.
What can you tell us about the distributions of the join keys for the two
hash joins --- are those keys unique, or a bit less than unique, or very
much not unique?
Вложения
Cory Tucker <cory.tucker@gmail.com> writes: > On Mon, Nov 20, 2017 at 9:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The only thing I can think of offhand that could create temp files far in >> excess of the actual data volume is if a hash join repeatedly decides that >> it needs to increase the number of hash batches. We have seen that happen >> in cases where an inner relation contains a vast number of identical hash >> key values: they all go into the same bucket, and the executor keeps >> splitting the batch trying to reduce its size, but being all identical >> values it will never be the case that some rows go into a different batch >> than others. There is logic in there to give up splitting when this >> happens, but maybe it's not firing in your case, or maybe somebody broke >> it recently :-(. > What was happening is that there is a fairly non-unique "parent_id" value > (66k times) that is creating a very large hashkey that it cannot break into > a smaller chunks -- so, essentially what your guess was, Tom. Perhaps > worth investigating whether that code is still functioning as intended. I tried to duplicate this result, using a test case constructed like this: create table hj(f1 int, filler text); alter table hj alter column filler set storage plain; -- prevent compression insert into hj select x, repeat('xyzzy', 4000/5) from generate_series(1,1000000) x; insert into hj select 42, repeat('xyzzy', 4000/5) from generate_series(1,10000) x; vacuum analyze hj; set enable_mergejoin TO 0; set max_parallel_workers_per_gather TO 0; explain analyze select count(h1.filler),count(h2.filler) from hj h1, hj h2 where h1.f1 = h2.f1; This causes a hash join between two datasets a bit under 4GB each, with the largest bucket being circa 40MB. work_mem is at the default 4MB, so that bucket will be large enough to force split attempts. What I see is that the temp disk space rises slowly to ~7.5GB and then declines smoothly to ~40MB, where it stays for awhile before the query finishes. The EXPLAIN ANALYZE report says Aggregate (cost=10020068069.47..10020068069.48 rows=1 width=16) (actual time=81812.299..81812.299 rows=1 loops=1) -> HashJoin (cost=10001024835.00..10019362963.81 rows=141021131 width=8008) (actual time=10227.585..67092.480 rows=101020000loops=1) Hash Cond: (h1.f1 = h2.f1) -> Seq Scan on hj h1 (cost=0.00..515100.00 rows=1010000width=4008) (actual time=0.023..2531.739 rows=1010000 loops=1) -> Hash (cost=515100.00..515100.00 rows=1010000width=4008) (actual time=10197.922..10197.922 rows=1010000 loops=1) Buckets: 1024 (originally 1024) Batches: 2097152 (originally 1024) Memory Usage: 39462kB -> Seq Scan on hj h2 (cost=0.00..515100.00rows=1010000 width=4008) (actual time=0.003..2582.198 rows=1010000 loops=1)Planning time: 0.062 msExecutiontime: 81832.867 ms Now, there's definitely something busted here; it should not have gone as far as 2 million batches before giving up on splitting. But that breakage isn't translating into disk space bloat. I haven't traced through the code yet, but I believe what's likely happening is that it's creating new batch files but freeing the old ones, so that the disk space usage remains constant during the fruitless split attempts. So there's some aspect of what you're seeing that this simple case doesn't capture. I wonder if you can build a publishable test case that shows the space bloat you're experiencing. The behavior should only depend on the hash join key values --- since those are apparently account numbers, maybe they're nonsensitive, or you could obscure them without changing the behavior? regards, tom lane
On Wed, Nov 22, 2017 at 7:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now, there's definitely something busted here; it should not have gone as > far as 2 million batches before giving up on splitting. I had been meaning to discuss this. We only give up when we reach the point when a batch is entirely entirely kept or sent to a new batch (ie splitting the batch resulted in one batch with the whole contents and another empty batch). If you have about 2 million evenly distributed keys and an ideal hash function, and then you also have 42 billion keys that are the same (and exceed work_mem), we won't detect extreme skew until the 2 million well behaved keys have been spread so thin that the 42 billion keys are isolated in a batch on their own, which we should expect to happen somewhere around 2 million batches. I have wondered if our extreme skew detector needs to go off sooner. I don't have a specific suggestion, but it could just be something like 'you threw out or kept more than X% of the tuples'. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Wed, Nov 22, 2017 at 7:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now, there's definitely something busted here; it should not have gone as >> far as 2 million batches before giving up on splitting. > I had been meaning to discuss this. We only give up when we reach the > point when a batch is entirely entirely kept or sent to a new batch > (ie splitting the batch resulted in one batch with the whole contents > and another empty batch). If you have about 2 million evenly > distributed keys and an ideal hash function, and then you also have 42 > billion keys that are the same (and exceed work_mem), we won't detect > extreme skew until the 2 million well behaved keys have been spread so > thin that the 42 billion keys are isolated in a batch on their own, > which we should expect to happen somewhere around 2 million batches. Yeah, I suspected it was something like that, but hadn't dug into the code yet. > I have wondered if our extreme skew detector needs to go off sooner. > I don't have a specific suggestion, but it could just be something > like 'you threw out or kept more than X% of the tuples'. Doing this, with some threshold like 95% or 99%, sounds plausible to me. I'd like to reproduce Cory's disk-space issue before we monkey with related logic, though; fixing the part we understand might obscure the part we still don't. regards, tom lane