Обсуждение: Weird CASE WHEN behaviour causing query to be suddenly very slow
Hi,
I have an issue with a rather large CASE WHEN and I cannot figure out why it is so slow...
First, here is my test query :
SELECT CASE WHEN dwh_company_id = 1
THEN CASE
WHEN wv.source ~ '^$' THEN 'Not tracked'
WHEN wv.source ~ '^1$' THEN 'Not tracked1'
WHEN wv.source ~ '^2$' THEN 'Not tracked2'
WHEN wv.source ~ '^3$' THEN 'Not tracked3'
WHEN wv.source ~ '^4$' THEN 'Not tracked4'
WHEN wv.source ~ '^5$' THEN 'Not tracked5'
WHEN wv.source ~ '^6$' THEN 'Not tracked6'
WHEN wv.source ~ '^7$' THEN 'Not tracked7'
WHEN wv.source ~ '^8$' THEN 'Not tracked8'
WHEN wv.source ~ '^9$' THEN 'Not tracked9'
WHEN wv.source ~ '^10$' THEN 'Not tracked10'
WHEN wv.source ~ '^11$' THEN 'Not tracked11'
WHEN wv.source ~ '^12$' THEN 'Not tracked12'
WHEN wv.source ~ '^13$' THEN 'Not tracked13'
WHEN wv.source ~ '^14$' THEN 'Not tracked14'
WHEN wv.source ~ '^15$' THEN 'Not tracked15'
WHEN wv.source ~ '^16$' THEN 'Not tracked16'
WHEN wv.source ~ '^17$' THEN 'Not tracked17'
WHEN wv.source ~ '^18$' THEN 'Not tracked18'
WHEN wv.source ~ '^19$' THEN 'Not tracked19'
WHEN wv.source ~ '^20$' THEN 'Not tracked20'
WHEN wv.source ~ '^21$' THEN 'Not tracked21'
WHEN wv.source ~ '^22$' THEN 'Not tracked22'
WHEN wv.source ~ '^23$' THEN 'Not tracked23'
WHEN wv.source ~ '^24$' THEN 'Not tracked24'
WHEN wv.source ~ '^25$' THEN 'Not tracked25'
WHEN wv.source ~ '^26$' THEN 'Not tracked26'
WHEN wv.source ~ '^27$' THEN 'Not tracked27'
WHEN wv.source ~ '^28$' THEN 'Not tracked28'
--WHEN wv.source ~ '^29$' THEN 'Not tracked29'
WHEN wv.source ~ '^30$' THEN 'Not tracked30'
WHEN wv.source ~ '^31$' THEN 'Not tracked31'
WHEN wv.source ~ '^32$' THEN 'Not tracked32'
END
ELSE
'Others'
END as channel
FROM (
SELECT wv.id,
wv.ga_id,
split_part(wv.ga_source_medium, ' / ', 1) as source,
ga.dwh_source_id,
s.dwh_company_id
FROM marketing.web_visits wv
INNER JOIN dwh_metadata.google_analytics ga ON ga.ga_id = wv.ga_id
INNER JOIN dwh_manager.sources s ON ga.dwh_source_id =s.dwh_source_id
--WHERE s.dwh_company_id = 1
LIMIT 100000
) wv
This is a pretty simple case, my subquery (or CTE when using WITH statement) should return 5 fields with more or less this structure :
Id : character(32)
Ga_id : bigint
Source : character(32)
Medium : character(32)
dwh_company_id : bigint
On top of which I apply a case when statement…
Now the weird thing is, using this query I notice a significant drop in performance as the “case when” is getting bigger. If I run the query as if, I get the following exec plain and execution time:
Subquery Scan on wv (cost=6.00..29098.17 rows=100000 width=36) (actual time=0.828..22476.917 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.209..133.429 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.208..119.297 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.056..0.148 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.599 ms
Execution time: 22486.216 ms
Then try commenting out only one line in the case when and the query run 10x faster :
Subquery Scan on wv (cost=6.00..28598.17 rows=100000 width=36) (actual time=0.839..2460.002 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.210..112.043 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.209..99.513 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.058..0.146 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.583 ms
Execution time: 2467.484 ms
Why this drop in performance for only one (in this simple example) condition ? I do not really understand it. If I add more conditions to the query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it is around 5 sec or so. (which is huge considering I only take in my example 1/500 of my data with LIMIT.
Before we deviate from the problem I have (which is why the sudden drop of performance) let me clarify a few things about this query :
- The purpose is not to rewrite it, with a join or whatever, the case when actually comes from a function which is auto-generated by another app we have
- My example is pretty simple and regex expressions could be replaced by equals, the real case when query contains way more complicated regex
- This is subset of my CASE WHEN, it is much bigger, I cut it at the “bottleneck” point for this post.
Thanks a lot.
Best Regards,
Kevin
Hi,
I have an issue with a rather large CASE WHEN and I cannot figure out why it is so slow...
First, here is my test query :
SELECT CASE WHEN dwh_company_id = 1
THEN CASE
WHEN wv.source ~ '^$' THEN 'Not tracked'
WHEN wv.source ~ '^1$' THEN 'Not tracked1'
WHEN wv.source ~ '^2$' THEN 'Not tracked2'
WHEN wv.source ~ '^3$' THEN 'Not tracked3'
WHEN wv.source ~ '^4$' THEN 'Not tracked4'
WHEN wv.source ~ '^5$' THEN 'Not tracked5'
WHEN wv.source ~ '^6$' THEN 'Not tracked6'
WHEN wv.source ~ '^7$' THEN 'Not tracked7'
WHEN wv.source ~ '^8$' THEN 'Not tracked8'
WHEN wv.source ~ '^9$' THEN 'Not tracked9'
WHEN wv.source ~ '^10$' THEN 'Not tracked10'
WHEN wv.source ~ '^11$' THEN 'Not tracked11'
WHEN wv.source ~ '^12$' THEN 'Not tracked12'
WHEN wv.source ~ '^13$' THEN 'Not tracked13'
WHEN wv.source ~ '^14$' THEN 'Not tracked14'
WHEN wv.source ~ '^15$' THEN 'Not tracked15'
WHEN wv.source ~ '^16$' THEN 'Not tracked16'
WHEN wv.source ~ '^17$' THEN 'Not tracked17'
WHEN wv.source ~ '^18$' THEN 'Not tracked18'
WHEN wv.source ~ '^19$' THEN 'Not tracked19'
WHEN wv.source ~ '^20$' THEN 'Not tracked20'
WHEN wv.source ~ '^21$' THEN 'Not tracked21'
WHEN wv.source ~ '^22$' THEN 'Not tracked22'
WHEN wv.source ~ '^23$' THEN 'Not tracked23'
WHEN wv.source ~ '^24$' THEN 'Not tracked24'
WHEN wv.source ~ '^25$' THEN 'Not tracked25'
WHEN wv.source ~ '^26$' THEN 'Not tracked26'
WHEN wv.source ~ '^27$' THEN 'Not tracked27'
WHEN wv.source ~ '^28$' THEN 'Not tracked28'
--WHEN wv.source ~ '^29$' THEN 'Not tracked29'
WHEN wv.source ~ '^30$' THEN 'Not tracked30'
WHEN wv.source ~ '^31$' THEN 'Not tracked31'
WHEN wv.source ~ '^32$' THEN 'Not tracked32'
END
ELSE
'Others'
END as channel
FROM (
SELECT wv.id,
wv.ga_id,
split_part(wv.ga_source_medium, ' / ', 1) as source,
ga.dwh_source_id,
s.dwh_company_id
FROM marketing.web_visits wv
INNER JOIN dwh_metadata.google_analytics ga ON ga.ga_id = wv.ga_id
INNER JOIN dwh_manager.sources s ON ga.dwh_source_id =s.dwh_source_id
--WHERE s.dwh_company_id = 1
LIMIT 100000
) wv
This is a pretty simple case, my subquery (or CTE when using WITH statement) should return 5 fields with more or less this structure :
Id : character(32)
Ga_id : bigint
Source : character(32)
Medium : character(32)
dwh_company_id : bigint
On top of which I apply a case when statement…
Now the weird thing is, using this query I notice a significant drop in performance as the “case when” is getting bigger. If I run the query as if, I get the following exec plain and execution time:
Subquery Scan on wv (cost=6.00..29098.17 rows=100000 width=36) (actual time=0.828..22476.917 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.209..133.429 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.208..119.297 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.056..0.148 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.599 ms
Execution time: 22486.216 ms
Then try commenting out only one line in the case when and the query run 10x faster :
Subquery Scan on wv (cost=6.00..28598.17 rows=100000 width=36) (actual time=0.839..2460.002 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.210..112.043 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.209..99.513 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.058..0.146 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.583 ms
Execution time: 2467.484 ms
Why this drop in performance for only one (in this simple example) condition ? I do not really understand it. If I add more conditions to the query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it is around 5 sec or so. (which is huge considering I only take in my example 1/500 of my data with LIMIT.
Before we deviate from the problem I have (which is why the sudden drop of performance) let me clarify a few things about this query :
- The purpose is not to rewrite it, with a join or whatever, the case when actually comes from a function which is auto-generated by another app we have
- My example is pretty simple and regex expressions could be replaced by equals, the real case when query contains way more complicated regex
- This is subset of my CASE WHEN, it is much bigger, I cut it at the “bottleneck” point for this post.
Thanks a lot.
Best Regards,
Kevin
Hi Pavel,
Thanks for your answer.
Yes sure, I could do that, but like I wrote the purpose is not to find a way to rewrite it. But to understand why at a certain point it is totally going off. I’m aware that the longer my case when will be the longest the query will run. But 10x slower for adding one condition, something feels wrong here.
Plus, the case when is part of a function so basically I use it this way :
SELECT col1, col2, get_channel(company_id, source_id, …)
FROM mytable;
Get_channel is coming from another app. And even though I have, I need to assume that I don’t have the control over this one and that I’m using it as if.
This is only my debugging query.
Best regards,
Kevin
From: Pavel Stehule [mailto:pavel.stehule@gmail.com]
Sent: Dienstag, 31. März 2015 11:09
To: Kevin Viraud
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Weird CASE WHEN behaviour causing query to be suddenly very slow
Hi
long CASE can be problem. Why you don't use a dictionary table and join?
Regards
Pavel
2015-03-31 10:53 GMT+02:00 Kevin Viraud <kevin.viraud@rocket-internet.de>:
Hi,
I have an issue with a rather large CASE WHEN and I cannot figure out why it is so slow...
First, here is my test query :
SELECT CASE WHEN dwh_company_id = 1
THEN CASE
WHEN wv.source ~ '^$' THEN 'Not tracked'
WHEN wv.source ~ '^1$' THEN 'Not tracked1'
WHEN wv.source ~ '^2$' THEN 'Not tracked2'
WHEN wv.source ~ '^3$' THEN 'Not tracked3'
WHEN wv.source ~ '^4$' THEN 'Not tracked4'
WHEN wv.source ~ '^5$' THEN 'Not tracked5'
WHEN wv.source ~ '^6$' THEN 'Not tracked6'
WHEN wv.source ~ '^7$' THEN 'Not tracked7'
WHEN wv.source ~ '^8$' THEN 'Not tracked8'
WHEN wv.source ~ '^9$' THEN 'Not tracked9'
WHEN wv.source ~ '^10$' THEN 'Not tracked10'
WHEN wv.source ~ '^11$' THEN 'Not tracked11'
WHEN wv.source ~ '^12$' THEN 'Not tracked12'
WHEN wv.source ~ '^13$' THEN 'Not tracked13'
WHEN wv.source ~ '^14$' THEN 'Not tracked14'
WHEN wv.source ~ '^15$' THEN 'Not tracked15'
WHEN wv.source ~ '^16$' THEN 'Not tracked16'
WHEN wv.source ~ '^17$' THEN 'Not tracked17'
WHEN wv.source ~ '^18$' THEN 'Not tracked18'
WHEN wv.source ~ '^19$' THEN 'Not tracked19'
WHEN wv.source ~ '^20$' THEN 'Not tracked20'
WHEN wv.source ~ '^21$' THEN 'Not tracked21'
WHEN wv.source ~ '^22$' THEN 'Not tracked22'
WHEN wv.source ~ '^23$' THEN 'Not tracked23'
WHEN wv.source ~ '^24$' THEN 'Not tracked24'
WHEN wv.source ~ '^25$' THEN 'Not tracked25'
WHEN wv.source ~ '^26$' THEN 'Not tracked26'
WHEN wv.source ~ '^27$' THEN 'Not tracked27'
WHEN wv.source ~ '^28$' THEN 'Not tracked28'
--WHEN wv.source ~ '^29$' THEN 'Not tracked29'
WHEN wv.source ~ '^30$' THEN 'Not tracked30'
WHEN wv.source ~ '^31$' THEN 'Not tracked31'
WHEN wv.source ~ '^32$' THEN 'Not tracked32'
END
ELSE
'Others'
END as channel
FROM (
SELECT wv.id,
wv.ga_id,
split_part(wv.ga_source_medium, ' / ', 1) as source,
ga.dwh_source_id,
s.dwh_company_id
FROM marketing.web_visits wv
INNER JOIN dwh_metadata.google_analytics ga ON ga.ga_id = wv.ga_id
INNER JOIN dwh_manager.sources s ON ga.dwh_source_id =s.dwh_source_id
--WHERE s.dwh_company_id = 1
LIMIT 100000
) wv
This is a pretty simple case, my subquery (or CTE when using WITH statement) should return 5 fields with more or less this structure :
Id : character(32)
Ga_id : bigint
Source : character(32)
Medium : character(32)
dwh_company_id : bigint
On top of which I apply a case when statement…
Now the weird thing is, using this query I notice a significant drop in performance as the “case when” is getting bigger. If I run the query as if, I get the following exec plain and execution time:
Subquery Scan on wv (cost=6.00..29098.17 rows=100000 width=36) (actual time=0.828..22476.917 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.209..133.429 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.208..119.297 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.056..0.148 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.599 ms
Execution time: 22486.216 ms
Then try commenting out only one line in the case when and the query run 10x faster :
Subquery Scan on wv (cost=6.00..28598.17 rows=100000 width=36) (actual time=0.839..2460.002 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.210..112.043 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.209..99.513 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.058..0.146 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.583 ms
Execution time: 2467.484 ms
Why this drop in performance for only one (in this simple example) condition ? I do not really understand it. If I add more conditions to the query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it is around 5 sec or so. (which is huge considering I only take in my example 1/500 of my data with LIMIT.
Before we deviate from the problem I have (which is why the sudden drop of performance) let me clarify a few things about this query :
- The purpose is not to rewrite it, with a join or whatever, the case when actually comes from a function which is auto-generated by another app we have
- My example is pretty simple and regex expressions could be replaced by equals, the real case when query contains way more complicated regex
- This is subset of my CASE WHEN, it is much bigger, I cut it at the “bottleneck” point for this post.
Thanks a lot.
Best Regards,
Kevin
Hi Pavel,
Thanks for your answer.
Yes sure, I could do that, but like I wrote the purpose is not to find a way to rewrite it. But to understand why at a certain point it is totally going off. I’m aware that the longer my case when will be the longest the query will run. But 10x slower for adding one condition, something feels wrong here.
Plus, the case when is part of a function so basically I use it this way :
SELECT col1, col2, get_channel(company_id, source_id, …)
FROM mytable;
Get_channel is coming from another app. And even though I have, I need to assume that I don’t have the control over this one and that I’m using it as if.
This is only my debugging query.
Best regards,
Kevin
From: Pavel Stehule [mailto:pavel.stehule@gmail.com]
Sent: Dienstag, 31. März 2015 11:09
To: Kevin Viraud
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Weird CASE WHEN behaviour causing query to be suddenly very slow
Hi
long CASE can be problem. Why you don't use a dictionary table and join?
Regards
Pavel
2015-03-31 10:53 GMT+02:00 Kevin Viraud <kevin.viraud@rocket-internet.de>:
Hi,
I have an issue with a rather large CASE WHEN and I cannot figure out why it is so slow...
First, here is my test query :
SELECT CASE WHEN dwh_company_id = 1
THEN CASE
WHEN wv.source ~ '^$' THEN 'Not tracked'
WHEN wv.source ~ '^1$' THEN 'Not tracked1'
WHEN wv.source ~ '^2$' THEN 'Not tracked2'
WHEN wv.source ~ '^3$' THEN 'Not tracked3'
WHEN wv.source ~ '^4$' THEN 'Not tracked4'
WHEN wv.source ~ '^5$' THEN 'Not tracked5'
WHEN wv.source ~ '^6$' THEN 'Not tracked6'
WHEN wv.source ~ '^7$' THEN 'Not tracked7'
WHEN wv.source ~ '^8$' THEN 'Not tracked8'
WHEN wv.source ~ '^9$' THEN 'Not tracked9'
WHEN wv.source ~ '^10$' THEN 'Not tracked10'
WHEN wv.source ~ '^11$' THEN 'Not tracked11'
WHEN wv.source ~ '^12$' THEN 'Not tracked12'
WHEN wv.source ~ '^13$' THEN 'Not tracked13'
WHEN wv.source ~ '^14$' THEN 'Not tracked14'
WHEN wv.source ~ '^15$' THEN 'Not tracked15'
WHEN wv.source ~ '^16$' THEN 'Not tracked16'
WHEN wv.source ~ '^17$' THEN 'Not tracked17'
WHEN wv.source ~ '^18$' THEN 'Not tracked18'
WHEN wv.source ~ '^19$' THEN 'Not tracked19'
WHEN wv.source ~ '^20$' THEN 'Not tracked20'
WHEN wv.source ~ '^21$' THEN 'Not tracked21'
WHEN wv.source ~ '^22$' THEN 'Not tracked22'
WHEN wv.source ~ '^23$' THEN 'Not tracked23'
WHEN wv.source ~ '^24$' THEN 'Not tracked24'
WHEN wv.source ~ '^25$' THEN 'Not tracked25'
WHEN wv.source ~ '^26$' THEN 'Not tracked26'
WHEN wv.source ~ '^27$' THEN 'Not tracked27'
WHEN wv.source ~ '^28$' THEN 'Not tracked28'
--WHEN wv.source ~ '^29$' THEN 'Not tracked29'
WHEN wv.source ~ '^30$' THEN 'Not tracked30'
WHEN wv.source ~ '^31$' THEN 'Not tracked31'
WHEN wv.source ~ '^32$' THEN 'Not tracked32'
END
ELSE
'Others'
END as channel
FROM (
SELECT wv.id,
wv.ga_id,
split_part(wv.ga_source_medium, ' / ', 1) as source,
ga.dwh_source_id,
s.dwh_company_id
FROM marketing.web_visits wv
INNER JOIN dwh_metadata.google_analytics ga ON ga.ga_id = wv.ga_id
INNER JOIN dwh_manager.sources s ON ga.dwh_source_id =s.dwh_source_id
--WHERE s.dwh_company_id = 1
LIMIT 100000
) wv
This is a pretty simple case, my subquery (or CTE when using WITH statement) should return 5 fields with more or less this structure :
Id : character(32)
Ga_id : bigint
Source : character(32)
Medium : character(32)
dwh_company_id : bigint
On top of which I apply a case when statement…
Now the weird thing is, using this query I notice a significant drop in performance as the “case when” is getting bigger. If I run the query as if, I get the following exec plain and execution time:
Subquery Scan on wv (cost=6.00..29098.17 rows=100000 width=36) (actual time=0.828..22476.917 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.209..133.429 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.208..119.297 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.056..0.148 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.599 ms
Execution time: 22486.216 ms
Then try commenting out only one line in the case when and the query run 10x faster :
Subquery Scan on wv (cost=6.00..28598.17 rows=100000 width=36) (actual time=0.839..2460.002 rows=100000 loops=1)
Buffers: shared hit=3136
-> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.210..112.043 rows=100000 loops=1)
Buffers: shared hit=3136
-> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.209..99.513 rows=100000 loops=1)
Hash Cond: (wv_1.ga_id = ga.ga_id)
Buffers: shared hit=3136
-> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1)
Buffers: shared hit=3133
-> Hash (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
Buffers: shared hit=3
-> Hash Join (cost=1.88..5.50 rows=40 width=12) (actual time=0.058..0.146 rows=111 loops=1)
Hash Cond: (ga.dwh_source_id = s.dwh_source_id)
Buffers: shared hit=3
-> Seq Scan on google_analytics ga (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111 loops=1)
Buffers: shared hit=2
-> Hash (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=1
-> Seq Scan on sources s (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56 loops=1)
Buffers: shared hit=1
Planning time: 0.583 ms
Execution time: 2467.484 ms
Why this drop in performance for only one (in this simple example) condition ? I do not really understand it. If I add more conditions to the query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it is around 5 sec or so. (which is huge considering I only take in my example 1/500 of my data with LIMIT.
Before we deviate from the problem I have (which is why the sudden drop of performance) let me clarify a few things about this query :
- The purpose is not to rewrite it, with a join or whatever, the case when actually comes from a function which is auto-generated by another app we have
- My example is pretty simple and regex expressions could be replaced by equals, the real case when query contains way more complicated regex
- This is subset of my CASE WHEN, it is much bigger, I cut it at the “bottleneck” point for this post.
Thanks a lot.
Best Regards,
Kevin
"Kevin Viraud" <kevin.viraud@rocket-internet.de> writes: > I have an issue with a rather large CASE WHEN and I cannot figure out why > it is so slow... Do all the arms of the CASE usually fail, leaving you at the ELSE? I suspect what's happening is that you're running into the MAX_CACHED_RES limit in src/backend/utils/adt/regexp.c, so that instead of just compiling each regexp once and then re-using 'em, the regexps are constantly falling out of cache and then having to be recompiled. They'd have to be used in a nearly perfect round robin in order for the behavior to have such a big cliff as you describe, though. In this CASE structure, that suggests that you're nearly always testing every regexp because they're all failing. I have to think there's probably a better way to do whatever you're trying to do, but there's not enough info here about your underlying goal to suggest a better approach. At the very least, if you need a many-armed CASE, it behooves you to make sure the common cases appear early. regards, tom lane
Touche ! Thanks a lot. Looking more at the data yes it goes very often to ELSE Clause. And therefore reaching the MAX_CACHED_RES. In there anyway to increase that value ? Basically, I have several tables containing millions of rows and let say 5 columns. Those five columns, depending of their combination give me a 6th value. We have complex patterns to match and using simple LIKE / EQUAL and so on wouldn't be enough. This can be applied to N number of table so we refactored this process into a function that we can use in the SELECT statement, by giving only the 5 values each time. I wouldn't mind using a table and mapping it through a join if it were for my own use. But the final query has to be readable and usable for almost-non-initiated SQL user... So using a function with encapsulated case when seemed to be a good idea and so far worked nicely. But we might consider changing it if we have no other choice... Regards, Kevin -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Dienstag, 31. März 2015 15:59 To: Kevin Viraud Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Weird CASE WHEN behaviour causing query to be suddenly very slow "Kevin Viraud" <kevin.viraud@rocket-internet.de> writes: > I have an issue with a rather large CASE WHEN and I cannot figure out > why it is so slow... Do all the arms of the CASE usually fail, leaving you at the ELSE? I suspect what's happening is that you're running into the MAX_CACHED_RES limit in src/backend/utils/adt/regexp.c, so that instead of just compiling each regexp once and then re-using 'em, the regexps are constantly falling out of cache and then having to be recompiled. They'd have to be used in a nearly perfect round robin in order for the behavior to have such a big cliff as you describe, though. In this CASE structure, that suggests that you're nearly always testing every regexp because they're all failing. I have to think there's probably a better way to do whatever you're trying to do, but there's not enough info here about your underlying goal to suggest a better approach. At the very least, if you need a many-armed CASE, it behooves you to make sure the common cases appear early. regards, tom lane
Touche ! Thanks a lot.
Looking more at the data yes it goes very often to ELSE Clause. And
therefore reaching the MAX_CACHED_RES.
In there anyway to increase that value ?
Basically, I have several tables containing millions of rows and let say 5
columns. Those five columns, depending of their combination give me a 6th
value.
We have complex patterns to match and using simple LIKE / EQUAL and so on
wouldn't be enough. This can be applied to N number of table so we
refactored this process into a function that we can use in the SELECT
statement, by giving only the 5 values each time.
I wouldn't mind using a table and mapping it through a join if it were for
my own use.
But the final query has to be readable and usable for almost-non-initiated
SQL user... So using a function with encapsulated case when seemed to be a
good idea and so far worked nicely.
But we might consider changing it if we have no other choice...
Regards,
Kevin
On 3/31/15 10:58 AM, Kevin Viraud wrote: > Touche ! Thanks a lot. > > Looking more at the data yes it goes very often to ELSE Clause. And > therefore reaching the MAX_CACHED_RES. > > In there anyway to increase that value ? Sure, change it and re-compile. But be aware that increasing it will probably increase the cost of some other stuff, so it's a tradeoff. If this is that complex though you very likely would do better in plperl, especially if you could pre-compile the RE's. AFAIK there's no way to do that in Postgres, though it might be interesting to add that ability. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
We have to confirm the theory first: a 'perf top' sampling during two runs shall give enough information. Regards, Qingqing On Tue, Mar 31, 2015 at 8:58 AM, Kevin Viraud <kevin.viraud@rocket-internet.de> wrote: > Touche ! Thanks a lot. > > Looking more at the data yes it goes very often to ELSE Clause. And > therefore reaching the MAX_CACHED_RES. > > In there anyway to increase that value ? > > Basically, I have several tables containing millions of rows and let say 5 > columns. Those five columns, depending of their combination give me a 6th > value. > We have complex patterns to match and using simple LIKE / EQUAL and so on > wouldn't be enough. This can be applied to N number of table so we > refactored this process into a function that we can use in the SELECT > statement, by giving only the 5 values each time. > > I wouldn't mind using a table and mapping it through a join if it were for > my own use. > But the final query has to be readable and usable for almost-non-initiated > SQL user... So using a function with encapsulated case when seemed to be a > good idea and so far worked nicely. > > But we might consider changing it if we have no other choice... > > Regards, > > Kevin > > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Dienstag, 31. März 2015 15:59 > To: Kevin Viraud > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Weird CASE WHEN behaviour causing query to be > suddenly very slow > > "Kevin Viraud" <kevin.viraud@rocket-internet.de> writes: >> I have an issue with a rather large CASE WHEN and I cannot figure out >> why it is so slow... > > Do all the arms of the CASE usually fail, leaving you at the ELSE? > > I suspect what's happening is that you're running into the MAX_CACHED_RES > limit in src/backend/utils/adt/regexp.c, so that instead of just compiling > each regexp once and then re-using 'em, the regexps are constantly falling > out of cache and then having to be recompiled. They'd have to be used in a > nearly perfect round robin in order for the behavior to have such a big > cliff as you describe, though. In this CASE structure, that suggests that > you're nearly always testing every regexp because they're all failing. > > I have to think there's probably a better way to do whatever you're trying > to do, but there's not enough info here about your underlying goal to > suggest a better approach. At the very least, if you need a many-armed > CASE, it behooves you to make sure the common cases appear early. > > regards, tom lane > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance