Weird CASE WHEN behaviour causing query to be suddenly very slow

Поиск
Список
Период
Сортировка
От Kevin Viraud
Тема Weird CASE WHEN behaviour causing query to be suddenly very slow
Дата
Msg-id 00a001d06b90$2d1bbe80$87533b80$@rocket-internet.de
обсуждение исходный текст
Ответы Re: Weird CASE WHEN behaviour causing query to be suddenly very slow  (Pavel Stehule <pavel.stehule@gmail.com>)
Re: Weird CASE WHEN behaviour causing query to be suddenly very slow  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-performance

Hi,

 

I have an issue with a rather large CASE WHEN and I cannot figure out why  it is so slow...

 

First, here is my test query :

 

SELECT CASE WHEN dwh_company_id = 1

                                                                              THEN CASE

 

                                                                                   WHEN   wv.source ~ '^$' THEN 'Not tracked'

                                                                                              WHEN  wv.source ~ '^1$' THEN 'Not tracked1'

                                                                                              WHEN  wv.source ~ '^2$' THEN 'Not tracked2'

                                                                                              WHEN  wv.source ~ '^3$' THEN 'Not tracked3'

                                                                                              WHEN  wv.source ~ '^4$' THEN 'Not tracked4'

                                                                                              WHEN  wv.source ~ '^5$' THEN 'Not tracked5'

                                                                                              WHEN  wv.source ~ '^6$' THEN 'Not tracked6'

                                                                                              WHEN  wv.source ~ '^7$' THEN 'Not tracked7'

                                                                                              WHEN  wv.source ~ '^8$' THEN 'Not tracked8'

                                                                                              WHEN  wv.source ~ '^9$' THEN 'Not tracked9'

                                                                                              WHEN  wv.source ~ '^10$' THEN 'Not tracked10'

                                                                                              WHEN  wv.source ~ '^11$' THEN 'Not tracked11'

                                                                                              WHEN  wv.source ~ '^12$' THEN 'Not tracked12'

                                                                                              WHEN  wv.source ~ '^13$' THEN 'Not tracked13'

                                                                                              WHEN  wv.source ~ '^14$' THEN 'Not tracked14'

                                                                                              WHEN  wv.source ~ '^15$' THEN 'Not tracked15'

                                                                                              WHEN  wv.source ~ '^16$' THEN 'Not tracked16'

                                                                                              WHEN  wv.source ~ '^17$' THEN 'Not tracked17'

                                                                                              WHEN  wv.source ~ '^18$' THEN 'Not tracked18'

                                                                                              WHEN  wv.source ~ '^19$' THEN 'Not tracked19'

                                                                                              WHEN  wv.source ~ '^20$' THEN 'Not tracked20'

                                                                                              WHEN  wv.source ~ '^21$' THEN 'Not tracked21'

                                                                                              WHEN  wv.source ~ '^22$' THEN 'Not tracked22'

                                                                                              WHEN  wv.source ~ '^23$' THEN 'Not tracked23'

                                                                                              WHEN  wv.source ~ '^24$' THEN 'Not tracked24'

                                                                                              WHEN  wv.source ~ '^25$' THEN 'Not tracked25'

                                                                                              WHEN  wv.source ~ '^26$' THEN 'Not tracked26'

                                                                                              WHEN  wv.source ~ '^27$' THEN 'Not tracked27'

                                                                                              WHEN  wv.source ~ '^28$' THEN 'Not tracked28'

                                                                                              --WHEN  wv.source ~ '^29$' THEN 'Not tracked29'

                                                                                              WHEN  wv.source ~ '^30$' THEN 'Not tracked30'

                                                                                              WHEN  wv.source ~ '^31$' THEN 'Not tracked31'

                                                                                              WHEN  wv.source ~ '^32$' THEN 'Not tracked32'

                                                                                              END

                               ELSE

                               'Others'

                               END as channel

FROM (

                               SELECT wv.id,

                                      wv.ga_id,

                                      split_part(wv.ga_source_medium, ' / ', 1) as source,

                                      ga.dwh_source_id,

                                      s.dwh_company_id

                               FROM marketing.web_visits wv

                               INNER JOIN dwh_metadata.google_analytics ga ON ga.ga_id = wv.ga_id

                               INNER JOIN dwh_manager.sources s ON ga.dwh_source_id =s.dwh_source_id

                               --WHERE s.dwh_company_id = 1

                               LIMIT 100000

                ) wv

 

 

This is a pretty simple case,  my subquery (or CTE when using WITH statement) should return 5 fields with more or less this structure :

Id : character(32)

Ga_id : bigint

Source : character(32)

Medium : character(32)

dwh_company_id : bigint

 

On top of which I apply a case when statement…

 

Now the weird thing is, using this query I notice a significant drop in performance as the “case when” is getting bigger. If I run the query as if, I get the following exec plain and execution time:

Subquery Scan on wv  (cost=6.00..29098.17 rows=100000 width=36) (actual time=0.828..22476.917 rows=100000 loops=1)

   Buffers: shared hit=3136

   ->  Limit  (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.209..133.429 rows=100000 loops=1)

         Buffers: shared hit=3136

         ->  Hash Join  (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.208..119.297 rows=100000 loops=1)

               Hash Cond: (wv_1.ga_id = ga.ga_id)

               Buffers: shared hit=3136

               ->  Seq Scan on web_visits wv_1  (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1)

                     Buffers: shared hit=3133

               ->  Hash  (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)

                     Buckets: 1024  Batches: 1  Memory Usage: 5kB

                     Buffers: shared hit=3

                     ->  Hash Join  (cost=1.88..5.50 rows=40 width=12) (actual time=0.056..0.148 rows=111 loops=1)

                           Hash Cond: (ga.dwh_source_id = s.dwh_source_id)

                           Buffers: shared hit=3

                           ->  Seq Scan on google_analytics ga  (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111 loops=1)

                                 Buffers: shared hit=2

                           ->  Hash  (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)

                                 Buckets: 1024  Batches: 1  Memory Usage: 3kB

                                 Buffers: shared hit=1

                                 ->  Seq Scan on sources s  (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56 loops=1)

                                       Buffers: shared hit=1

 Planning time: 0.599 ms

 Execution time: 22486.216 ms

 

Then try commenting out only one line in the case when and the query run 10x faster :

 

Subquery Scan on wv  (cost=6.00..28598.17 rows=100000 width=36) (actual time=0.839..2460.002 rows=100000 loops=1)

   Buffers: shared hit=3136

   ->  Limit  (cost=6.00..11598.17 rows=100000 width=58) (actual time=0.210..112.043 rows=100000 loops=1)

         Buffers: shared hit=3136

         ->  Hash Join  (cost=6.00..1069811.24 rows=9228690 width=58) (actual time=0.209..99.513 rows=100000 loops=1)

               Hash Cond: (wv_1.ga_id = ga.ga_id)

               Buffers: shared hit=3136

               ->  Seq Scan on web_visits wv_1  (cost=0.00..877005.78 rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1)

                     Buffers: shared hit=3133

               ->  Hash  (cost=5.50..5.50 rows=40 width=12) (actual time=0.184..0.184 rows=111 loops=1)

                     Buckets: 1024  Batches: 1  Memory Usage: 5kB

                     Buffers: shared hit=3

                     ->  Hash Join  (cost=1.88..5.50 rows=40 width=12) (actual time=0.058..0.146 rows=111 loops=1)

                           Hash Cond: (ga.dwh_source_id = s.dwh_source_id)

                           Buffers: shared hit=3

                           ->  Seq Scan on google_analytics ga  (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111 loops=1)

                                 Buffers: shared hit=2

                           ->  Hash  (cost=1.39..1.39 rows=39 width=8) (actual time=0.042..0.042 rows=56 loops=1)

                                 Buckets: 1024  Batches: 1  Memory Usage: 3kB

                                 Buffers: shared hit=1

                                 ->  Seq Scan on sources s  (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56 loops=1)

                                       Buffers: shared hit=1

 Planning time: 0.583 ms

 Execution time: 2467.484 ms

 

Why this drop in performance for only one (in this simple example) condition ? I do not really understand it. If I add more conditions to the query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it is around 5 sec or so. (which is huge considering I only take in my example 1/500 of my data with LIMIT.

 

Before we deviate from the problem I have (which is why the sudden drop of performance) let me clarify a few things about this query :

-          The purpose is not to rewrite it, with a join or whatever, the case when actually comes from a function which is auto-generated by another app we have

-          My example is pretty simple and regex expressions could be replaced by equals, the real case when query contains way more complicated regex

-          This is subset of my CASE WHEN, it is much bigger, I cut it at the “bottleneck” point for this post.

 

Thanks a lot.

 

Best Regards,

 

Kevin

 

В списке pgsql-performance по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: views much slower in 9.3 than 8.4
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: Weird CASE WHEN behaviour causing query to be suddenly very slow