Re: BUG #17330: EXPLAIN hangs and very long query plans

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Re: BUG #17330: EXPLAIN hangs and very long query plans
Дата
Msg-id CAFj8pRBhVvmPtyT5HLEF8RW=g+K1g2D-RMcPuHzqeHYbLK8Oqw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17330: EXPLAIN hangs and very long query plans  (Strahinja Kustudić <strahinjak@nordeus.com>)
Список pgsql-bugs


pá 10. 12. 2021 v 16:52 odesílatel Strahinja Kustudić <strahinjak@nordeus.com> napsal:
On Fri, Dec 10, 2021 at 10:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi

pá 10. 12. 2021 v 10:25 odesílatel Strahinja Kustudić <strahinjak@nordeus.com> napsal:
On Fri, Dec 10, 2021 at 2:36 AM Peter Geoghegan <pg@bowt.ie> wrote:
Sounds like a problem with get_actual_variable_range(), which can scan
indexes at plan time to determine minimum or maximum values.

This actually has been improved quite a bit since Postgres 10. So as
Jeff said, seems like you might benefit from upgrading to a newer
major version. v11 has improved things in this exact area.

On my Docker instance when I execute EXPLAIN it starts reading a lot of data. The indexes of the biggest table the query reads are 50GB, so my guess is that it reads those indexes.

I allowed EXPLAIN in Docker to finish and it took almost 500 seconds and it was reading data all the time. After I reindexed the biggest table, EXPLAIN finished instantly. Can the index corruption cause this?

The index can be bloated.

Is bloat just enough to make it slower, or the index needs to be corrupted?

corruption and bloatings are two very different things. Bloated index is slower. Corrupted index returns bad results.

 

Note that this started happening in production after we deleted a few million rows from the biggest table.

it is maybe same issue


Looks similar because there were deletes after which query plans went to a few seconds.

I just upgraded Postgres to 12 with pg_upgrade, did ANALYZE of the whole DB and the issue is similar, but EXPLAIN now finishes in 19 seconds instead of 500 seconds, which means the issue is still there. Here is the whole EXPLAIN ANALYZE (I added ANALYZE so that it displays planning duration):

Sort  (cost=1946.04..1946.16 rows=46 width=642) (actual time=10.929..10.932 rows=20 loops=1)
   Sort Key: m.delivery_start DESC
   Sort Method: quicksort  Memory: 44kB
   ->  Nested Loop Left Join  (cost=10.19..1944.77 rows=46 width=642) (actual time=2.305..10.401 rows=20 loops=1)
         Join Filter: ((c.game_name = csu.game_name) AND (um.user_id = csu.user_id) AND (c.id = csu.campaign_id))
         Filter: ((c.id IS NULL) OR (c.active AND (now() >= c.campaign_start) AND ((now() <= c.campaign_end) OR (c.campaign_end IS NULL)) AND ((c.segment_ids <@ '{5632,-2,2,5634,9731,5640,5641,9738,13,3599,5648,5655,13343,5
664,13344,13345,13346,13347,13348,13350,13351,13352,5673,13353,13354,1067,13355,13356,13357,13358,13359,5680,13360,6705,13361,13362,13363,13364,13365,13366,5687,13367,13368,13369,20538,13370,13371,13372,13373,6718,13374,190
07,19008,13378,5699,13379,13380,19013,3142,3143,19015,13385,5707,6731,20555,19019,3149,19021,3150,591,19023,592,19024,6737,19027,19028,5717,6743,17496,19033,13402,19035,6750,17502,19039,17504,19041,19043,19044,17508,6757,17
510,19047,104,19048,17512,17514,7788,6764,621,19053,623,19055,7792,625,19059,7796,1141,19061,19063,7800,19064,5241,19067,19068,3713,19073,1667,19075,1668,19079,19081,19083,19084,5262,19087,19088,8338,8339,7828,19093,20631,1
9095,20632,20633,12955,19099,19101,19103,19104,19107,19108,19113,19115,8876,19119,3760,19121,19123,19124,3785,4814,16605,4324,16622,16623,21253,21254,4359,21258,5901,5915,4900,22832,22835,22836,9013,3895,3902,1865,1866,1379
,1380,4978,7027,13685,14203,2428,2429,21887,21888,4485,22932,22933,21911,21417,432,434,20418,968,969,470,7642,991,992,10209,1512,1513,23024,9201,9207,23032,9208,9214}'::integer[]) OR (csu.user_id = '-XXXXXXXXXXXXXXXXXX'::b
igint))))
         Rows Removed by Filter: 11
         ->  Nested Loop  (cost=10.03..1910.89 rows=163 width=696) (actual time=2.029..10.092 rows=31 loops=1)
               ->  Hash Left Join  (cost=9.46..695.80 rows=61 width=684) (actual time=0.729..5.991 rows=55 loops=1)
                     Hash Cond: ((m.game_name = c.game_name) AND (m.campaign_id = c.id))
                     ->  Index Scan using mail_pkey on mail m  (cost=0.27..686.29 rows=61 width=552) (actual time=0.297..5.524 rows=55 loops=1)
                           Filter: (active AND (game_name = 'topeleven'::game_name) AND (now() >= delivery_start) AND ((now() <= delivery_end) OR (delivery_end IS NULL)))
                           Rows Removed by Filter: 97
                     ->  Hash  (cost=7.83..7.83 rows=91 width=136) (actual time=0.411..0.412 rows=91 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 26kB
                           ->  Seq Scan on campaign c  (cost=0.00..7.83 rows=91 width=136) (actual time=0.265..0.371 rows=91 loops=1)
                                 Filter: (game_name = 'topeleven'::game_name)
                                 Rows Removed by Filter: 55
               ->  Index Scan using user_mail_pkey on user_mail um  (cost=0.57..19.88 rows=4 width=24) (actual time=0.074..0.074 rows=1 loops=55)
                     Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name) AND (mail_id = m.id))
                     Filter: ((m.mail_type <> 'question'::mail_type) OR (mail_status <> 'completed'::mail_status))
         ->  Materialize  (cost=0.15..28.18 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=31)
               ->  Index Only Scan using campaign_static_users_pkey on campaign_static_users csu  (cost=0.15..28.17 rows=1 width=16) (actual time=0.266..0.266 rows=0 loops=1)
                     Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name))
                     Heap Fetches: 0
 Planning Time: 19408.862 ms
 Execution Time: 10.978 ms
(27 rows)


Just to mention again, reindexing the biggest table fixes the issue, but I'm more curious why did this happen and how can we prevent it from happening again? Or is it already fixed so that it cannot happen in the future?

Maybe VACUUM can solve this issue too.

Massive delete is a common issue - you can try to use partitioning and then istead massive delete, use DROP partition.

Regards

Pavel



Regards,
Strahinja

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Strahinja Kustudić
Дата:
Сообщение: Re: BUG #17330: EXPLAIN hangs and very long query plans
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #17330: EXPLAIN hangs and very long query plans