Re: Weird CASE WHEN behaviour causing query to be suddenly very slow

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Re: Weird CASE WHEN behaviour causing query to be suddenly very slow
Дата
Msg-id CAFj8pRDMZe1ubWdC76dWX1-gMF-LzvBU2O=NjHonKMcVecoetg@mail.gmail.com
обсуждение исходный текст
Ответ на Weird CASE WHEN behaviour causing query to be suddenly very slow  ("Kevin Viraud" <kevin.viraud@rocket-internet.de>)
Ответы Re: Weird CASE WHEN behaviour causing query to be suddenly very slow
Список pgsql-performance
Hi

long CASE can be problem. Why you don't use a dictionary table and join?

Regards

Pavel

2015-03-31 10:53 GMT+02:00 Kevin Viraud <kevin.viraud@rocket-internet.de>:

Hi,

 

I have an issue with a rather large CASE WHEN and I cannot figure out why  it is so slow...

 

First, here is my test query :

 

SELECT CASE WHEN dwh_company_id = 1

                                                                              THEN CASE

 

                                                                                   WHEN   wv.source ~ '^$' THEN 'Not tracked'

                                                                                              WHEN  wv.source ~ '^1$' THEN 'Not tracked1'

                                                                                              WHEN  wv.source ~ '^2$' THEN 'Not tracked2'

                                                                                              WHEN  wv.source ~ '^3$' THEN 'Not tracked3'

                                                                                              WHEN  wv.source ~ '^4$' THEN 'Not tracked4'

                                                                                              WHEN  wv.source ~ '^5$' THEN 'Not tracked5'

                                                                                              WHEN  wv.source ~ '^6$' THEN 'Not tracked6'

                                                                                              WHEN  wv.source ~ '^7$' THEN 'Not tracked7'

                                                                                              WHEN  wv.source ~ '^8$' THEN 'Not tracked8'

                                                                                              WHEN  wv.source ~ '^9$' THEN 'Not tracked9'

                                                                                              WHEN  wv.source ~ '^10$' THEN 'Not tracked10'

                                                                                              WHEN  wv.source ~ '^11$' THEN 'Not tracked11'

                                                                                              WHEN  wv.source ~ '^12$' THEN 'Not tracked12'

                                                                                              WHEN  wv.source ~ '^13$' THEN 'Not tracked13'

                                                                                              WHEN  wv.source ~ '^14$' THEN 'Not tracked14'

                                                                                              WHEN  wv.source ~ '^15$' THEN 'Not tracked15'

                                                                                              WHEN  wv.source ~ '^16$' THEN 'Not tracked16'

                                                                                              WHEN  wv.source ~ '^17$' THEN 'Not tracked17'

                                                                                              WHEN  wv.source ~ '^18$' THEN 'Not tracked18'

                                                                                              WHEN  wv.source ~ '^19$' THEN 'Not tracked19'

                                                                                              WHEN  wv.source ~ '^20$' THEN 'Not tracked20'

                                                                                              WHEN  wv.source ~ '^21$' THEN 'Not tracked21'

                                                                                              WHEN  wv.source ~ '^22$' THEN 'Not tracked22'

                                                                                              WHEN  wv.source ~ '^23$' THEN 'Not tracked23'

                                                                                              WHEN  wv.source ~ '^24$' THEN 'Not tracked24'

                                                                                              WHEN  wv.source ~ '^25$' THEN 'Not tracked25'

                                                                                              WHEN  wv.source ~ '^26$' THEN 'Not tracked26'

                                                                                              WHEN  wv.source ~ '^27$' THEN 'Not tracked27'

                                                                                              WHEN  wv.source ~ '^28$' THEN 'Not tracked28'

                                                                                              --WHEN  wv.source ~ '^29$' THEN 'Not tracked29'

                                                                                              WHEN  wv.source ~ '^30$' THEN 'Not tracked30'

                                                                                              WHEN  wv.source ~ '^31$' THEN 'Not tracked31'

                                                                                              WHEN  wv.source ~ '^32$' THEN 'Not tracked32'

                                                                                              END

                               ELSE

                               'Others'

                               END as channel

FROM (

                               SELECT wv.id,

                                      wv.ga_id,

                                      split_part(wv.ga_source_medium, ' / ', 1) as source,

                                      ga.dwh_source_id,

                                      s.dwh_company_id

                               FROM marketing.web_visits wv

                               INNER JOIN dwh_metadata.google_analytics ga ON ga.ga_id = wv.ga_id

                               INNER JOIN dwh_manager.sources s ON ga.dwh_source_id =s.dwh_source_id

                               --WHERE s.dwh_company_id = 1

                               LIMIT 100000

                ) wv

 

 

This is a pretty simple case,  my subquery (or CTE when using WITH statement) should return 5 fields with more or less this structure :

Id : character(32)

Ga_id : bigint

Source : character(32)

Medium : character(32)

dwh_company_id : bigint

 

On top of which I apply a case when statement…

 

Now the weird thing is, using this query I notice a significant drop in performance as the “case when” is getting bigger. If I run the query as if, I get the following exec plain and execution time:

Subquery Scan on wv  (cost=6.00..29098.17 rows=100000 width=36) (actual time=0.828..22476.917 rows=100000 loops=1)

   Buffers: shared hit=3136

   ->  Limit  (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.209..133.429 rows=100000 loops=1)

         Buffers: shared hit=3136

         ->  Hash Join  (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.208..119.297 rows=100000 loops=1)

               Hash Cond: (wv_1.ga_id = ga.ga_id)

               Buffers: shared hit=3136

               ->  Seq Scan on web_visits wv_1  (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1)

                     Buffers: shared hit=3133

               ->  Hash  (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)

                     Buckets: 1024  Batches: 1  Memory Usage: 5kB

                     Buffers: shared hit=3

                     ->  Hash Join  (cost=1.88..5.50 rows=40 width=12) (actual time=0.056..0.148 rows=111 loops=1)

                           Hash Cond: (ga.dwh_source_id = s.dwh_source_id)

                           Buffers: shared hit=3

                           ->  Seq Scan on google_analytics ga  (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111 loops=1)

                                 Buffers: shared hit=2

                           ->  Hash  (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)

                                 Buckets: 1024  Batches: 1  Memory Usage: 3kB

                                 Buffers: shared hit=1

                                 ->  Seq Scan on sources s  (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56 loops=1)

                                       Buffers: shared hit=1

 Planning time: 0.599 ms

 Execution time: 22486.216 ms

 

Then try commenting out only one line in the case when and the query run 10x faster :

 

Subquery Scan on wv  (cost=6.00..28598.17 rows=100000 width=36) (actual time=0.839..2460.002 rows=100000 loops=1)

   Buffers: shared hit=3136

   ->  Limit  (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.210..112.043 rows=100000 loops=1)

         Buffers: shared hit=3136

         ->  Hash Join  (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.209..99.513 rows=100000 loops=1)

               Hash Cond: (wv_1.ga_id = ga.ga_id)

               Buffers: shared hit=3136

               ->  Seq Scan on web_visits wv_1  (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1)

                     Buffers: shared hit=3133

               ->  Hash  (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)

                     Buckets: 1024  Batches: 1  Memory Usage: 5kB

                     Buffers: shared hit=3

                     ->  Hash Join  (cost=1.88..5.50 rows=40 width=12) (actual time=0.058..0.146 rows=111 loops=1)

                           Hash Cond: (ga.dwh_source_id = s.dwh_source_id)

                           Buffers: shared hit=3

                           ->  Seq Scan on google_analytics ga  (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111 loops=1)

                                 Buffers: shared hit=2

                           ->  Hash  (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)

                                 Buckets: 1024  Batches: 1  Memory Usage: 3kB

                                 Buffers: shared hit=1

                                 ->  Seq Scan on sources s  (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56 loops=1)

                                       Buffers: shared hit=1

 Planning time: 0.583 ms

 Execution time: 2467.484 ms

 

Why this drop in performance for only one (in this simple example) condition ? I do not really understand it. If I add more conditions to the query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it is around 5 sec or so. (which is huge considering I only take in my example 1/500 of my data with LIMIT.

 

Before we deviate from the problem I have (which is why the sudden drop of performance) let me clarify a few things about this query :

-          The purpose is not to rewrite it, with a join or whatever, the case when actually comes from a function which is auto-generated by another app we have

-          My example is pretty simple and regex expressions could be replaced by equals, the real case when query contains way more complicated regex

-          This is subset of my CASE WHEN, it is much bigger, I cut it at the “bottleneck” point for this post.

 

Thanks a lot.

 

Best Regards,

 

Kevin

 


В списке pgsql-performance по дате отправления:

Предыдущее
От: "Kevin Viraud"
Дата:
Сообщение: Weird CASE WHEN behaviour causing query to be suddenly very slow
Следующее
От: "Kevin Viraud"
Дата:
Сообщение: Re: Weird CASE WHEN behaviour causing query to be suddenly very slow