Обсуждение: BUG #17710: Slow queries (100% CPU) after auto-vacuum

Поиск
Список
Период
Сортировка

BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17710
Logged by:          Jean-Francois Levesque
Email address:      jf.levesque@gmail.com
PostgreSQL version: 11.18
Operating system:   Docker postgres:11.18-bullseye on macOS 13.0.1
Description:

Hi!

I am having an issue that started with 11.0 and is still present with 11.18.
The issue is not present with 10.23.

The project is using Django with PostgreSQL as the backend DB. I have one
pytest test that is doing a lot of things (~4500 SQL queries) that usually
takes 17 seconds to run when running normally. The same test is running for
years without problems.

When I run this test in a loop (with a 10s sleep between each execution),
after 10-20 runs, the test starts to take ~300 seconds. All subsequent
executions are that long. At this point, I have one postgresql process that
is taking 100% CPU utilization during the tests.

When I have the issue, the SQL query [1] (see below) is taking a lot of time
(between 25 and 50 seconds).

During my troubleshooting, I realized that the slow queries start to happen
after an auto-vacuum on the tables (see logs [2]). If I disable the auto
vacuums (autovacuum=off), I am not able to reproduce the problem.

The pytest execution is done inside a transaction that is rolled back after
the test is completed. Can it be related?

I am testing mostly on macOS 13.0.1 (Macbook Pro, M1 Pro) with Docker
4.12.0. I have been able to reproduce the same issue on Bitbucket Pipeline
again using Docker containers. The official postgreSQL containers are being
used without configuration changes.

Is this a known issue? Is there something I need to tune? How can I
troubleshoot this issue?

Thank you for your help.

Jean-Francois


[1]: UPDATE "shared_models_session" SET "plan_disabled" = true WHERE
"shared_models_session"."id" IN (SELECT V0."id" FROM "shared_models_session"
V0 INNER JOIN "shared_models_sessionmonitoring" V1 ON (V0."id" =
V1."session_id") WHERE V1."monitoring_id" IN (SELECT U0."id" FROM
"shared_models_monitoring" U0 WHERE U0."owner_id" = 441) FOR UPDATE OF V0)


[2] Auto vacuum logs:
obkio_db   | 2022-12-09 15:02:43.323 UTC [842] LOG:  00000: automatic vacuum
of table "test.public.shared_models_sessionmonitoring": index scans: 1
obkio_db   |    pages: 0 removed, 15 remain, 0 skipped due to pins, 0
skipped frozen
obkio_db   |    tuples: 750 removed, 0 remain, 0 are dead but not yet
removable, oldest xmin: 1794
obkio_db   |    buffer usage: 337 hits, 1 misses, 3 dirtied
obkio_db   |    avg read rate: 0.002 MB/s, avg write rate: 0.005 MB/s
obkio_db   |    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed:
5.17 s
obkio_db   | 2022-12-09 15:02:43.323 UTC [842] LOCATION:  lazy_vacuum_rel,
vacuumlazy.c:407
obkio_db   | 2022-12-09 15:02:43.735 UTC [842] LOG:  00000: automatic vacuum
of table "test.public.shared_models_session": index scans: 1
obkio_db   |    pages: 0 removed, 442 remain, 0 skipped due to pins, 0
skipped frozen
obkio_db   |    tuples: 479 removed, 0 remain, 0 are dead but not yet
removable, oldest xmin: 1794
obkio_db   |    buffer usage: 3357 hits, 1 misses, 2 dirtied
obkio_db   |    avg read rate: 0.022 MB/s, avg write rate: 0.044 MB/s
obkio_db   |    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed:
0.35 s
obkio_db   | 2022-12-09 15:02:43.735 UTC [842] LOCATION:  lazy_vacuum_rel,
vacuumlazy.c:407


Re: BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> [ query gets slower after auto-vacuum ]

This sounds less like an auto-vacuum problem than an auto-analyze
problem.  That is, probably a background update of table statistics
is happening and that is triggering the planner to pick a different
plan.  In an ideal world, more up-to-date stats would always result
in a better plan, but reality is not always so kind.

There is not enough info here to diagnose the problem though.
At minimum we need to see EXPLAIN ANALYZE output for the problem
query in both the fast and slow states.  We have a wiki page about
reporting slow-query problems [1] which asks for quite a bit of
info, but for starters maybe the EXPLAIN ANALYZE results would
be enough.

> I am having an issue that started with 11.0 and is still present with 11.18.
> The issue is not present with 10.23.

If you could get EXPLAIN ANALYZE results from v10 as well as v11,
that would be useful too.

            regards, tom lane

[1] https://wiki.postgresql.org/wiki/Slow_Query_Questions



Re: BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
Jean-Francois Levesque
Дата:
Hi Tom,

Thanks for the wiki page. Here are the EXPLAIN ANALYZE results on the query.
   
The EXPLAINs have been executed inside the transaction, at the end of the test, just before the rollback. Initially I ran them outside the transaction but they are pretty clean all the time.

FYI, I also have this issue with v12, v13 and v14. I was unable to test with v15 for the moment.

Thanks,

Jean-Francois

------- v10 -------

v1/shared_models/tests/test_session.py Update on shared_models_session  (cost=46.88..175.60 rows=16 width=530) (actual time=0.003..0.004 rows=0 loops=1)
  ->  Nested Loop  (cost=46.88..175.60 rows=16 width=530) (actual time=0.003..0.003 rows=0 loops=1)
        ->  HashAggregate  (cost=46.60..46.76 rows=16 width=32) (actual time=0.003..0.003 rows=0 loops=1)
              Group Key: "ANY_subquery".id
              ->  Subquery Scan on "ANY_subquery"  (cost=4.79..46.56 rows=16 width=32) (actual time=0.002..0.002 rows=0 loops=1)
                    ->  LockRows  (cost=4.79..46.40 rows=16 width=22) (actual time=0.002..0.002 rows=0 loops=1)
                          ->  Nested Loop  (cost=4.79..46.24 rows=16 width=22) (actual time=0.002..0.002 rows=0 loops=1)
                                ->  Nested Loop  (cost=4.51..27.86 rows=16 width=16) (actual time=0.002..0.002 rows=0 loops=1)
                                      ->  Index Scan using shared_models_monitoring_owner_id_aedf0b23 on shared_models_monitoring u0  (cost=0.14..8.16 rows=1 width=10) (actual time=0.002..0.002 rows=0 loops=1)
                                            Index Cond: (owner_id = 28)
                                      ->  Bitmap Heap Scan on shared_models_sessionmonitoring v1  (cost=4.37..19.59 rows=11 width=14) (never executed)
                                            Recheck Cond: (monitoring_id = u0.id)
                                            ->  Bitmap Index Scan on shared_models_sessionmonitoring_monitoring_id_e9953e3e  (cost=0.00..4.36 rows=11 width=0) (never executed)
                                                  Index Cond: (monitoring_id = u0.id)
                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0  (cost=0.28..1.15 rows=1 width=10) (never executed)
                                      Index Cond: (id = v1.session_id)
        ->  Index Scan using shared_models_session_pkey on shared_models_session  (cost=0.28..8.05 rows=1 width=453) (never executed)
              Index Cond: (id = "ANY_subquery".id)
Planning time: 0.116 ms
Execution time: 0.032 ms


------- v11 - normal speed -------

v1/shared_models/tests/test_session.py Update on shared_models_session  (cost=45.54..170.25 rows=16 width=530) (actual time=0.006..0.007 rows=0 loops=1)
  ->  Nested Loop  (cost=45.54..170.25 rows=16 width=530) (actual time=0.006..0.006 rows=0 loops=1)
        ->  HashAggregate  (cost=45.25..45.41 rows=16 width=32) (actual time=0.005..0.006 rows=0 loops=1)
              Group Key: "ANY_subquery".id
              ->  Subquery Scan on "ANY_subquery"  (cost=4.79..45.21 rows=16 width=32) (actual time=0.004..0.005 rows=0 loops=1)
                    ->  LockRows  (cost=4.79..45.05 rows=16 width=22) (actual time=0.004..0.005 rows=0 loops=1)
                          ->  Nested Loop  (cost=4.79..44.89 rows=16 width=22) (actual time=0.004..0.004 rows=0 loops=1)
                                ->  Nested Loop  (cost=4.51..27.86 rows=16 width=16) (actual time=0.004..0.004 rows=0 loops=1)
                                      ->  Index Scan using shared_models_monitoring_owner_id_aedf0b23 on shared_models_monitoring u0  (cost=0.14..8.16 rows=1 width=10) (actual time=0.004..0.004 rows=0 loops=1)
                                            Index Cond: (owner_id = 2)
                                      ->  Bitmap Heap Scan on shared_models_sessionmonitoring v1  (cost=4.37..19.59 rows=11 width=14) (never executed)
                                            Recheck Cond: (monitoring_id = u0.id)
                                            ->  Bitmap Index Scan on shared_models_sessionmonitoring_monitoring_id_e9953e3e  (cost=0.00..4.36 rows=11 width=0) (never executed)
                                                  Index Cond: (monitoring_id = u0.id)
                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0  (cost=0.28..1.06 rows=1 width=10) (never executed)
                                      Index Cond: (id = v1.session_id)
        ->  Index Scan using shared_models_session_pkey on shared_models_session  (cost=0.28..7.80 rows=1 width=453) (never executed)
              Index Cond: (id = "ANY_subquery".id)
Planning Time: 0.162 ms
Execution Time: 0.061 ms


------- v11 - slow query -------

v1/shared_models/tests/test_session.py Update on shared_models_session  (cost=0.90..33.02 rows=1 width=530) (actual time=86918.598..86918.600 rows=0 loops=1)
  ->  Nested Loop Semi Join  (cost=0.90..33.02 rows=1 width=530) (actual time=86918.597..86918.598 rows=0 loops=1)
        Join Filter: (shared_models_session.id = "ANY_subquery".id)
        ->  Index Scan using shared_models_session_pkey on shared_models_session  (cost=0.25..8.26 rows=1 width=453) (actual time=0.005..1.762 rows=488 loops=1)
        ->  Subquery Scan on "ANY_subquery"  (cost=0.65..24.74 rows=1 width=32) (actual time=178.107..178.107 rows=0 loops=488)
              ->  LockRows  (cost=0.65..24.73 rows=1 width=22) (actual time=178.107..178.107 rows=0 loops=488)
                    ->  Nested Loop  (cost=0.65..24.72 rows=1 width=22) (actual time=178.106..178.106 rows=0 loops=488)
                          Join Filter: (v1.monitoring_id = u0.id)
                          ->  Nested Loop  (cost=0.50..16.54 rows=1 width=20) (actual time=0.537..177.918 rows=488 loops=488)
                                Join Filter: (v0.id = v1.session_id)
                                Rows Removed by Join Filter: 118828
                                ->  Index Scan using shared_models_sessionmonitoring_monitoring_id_e9953e3e on shared_models_sessionmonitoring v1  (cost=0.25..8.26 rows=1 width=14) (actual time=0.004..0.116 rows=488 loops=488)
                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0  (cost=0.25..8.26 rows=1 width=10) (actual time=0.002..0.354 rows=244 loops=238144)
                          ->  Index Scan using shared_models_monitoring_owner_id_aedf0b23 on shared_models_monitoring u0  (cost=0.14..8.16 rows=1 width=10) (actual time=0.000..0.000 rows=0 loops=238144)
                                Index Cond: (owner_id = 6)
Planning Time: 0.111 ms
Execution Time: 86918.632 ms


On Fri, Dec 9, 2022 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> [ query gets slower after auto-vacuum ]

This sounds less like an auto-vacuum problem than an auto-analyze
problem.  That is, probably a background update of table statistics
is happening and that is triggering the planner to pick a different
plan.  In an ideal world, more up-to-date stats would always result
in a better plan, but reality is not always so kind.

There is not enough info here to diagnose the problem though.
At minimum we need to see EXPLAIN ANALYZE output for the problem
query in both the fast and slow states.  We have a wiki page about
reporting slow-query problems [1] which asks for quite a bit of
info, but for starters maybe the EXPLAIN ANALYZE results would
be enough.

> I am having an issue that started with 11.0 and is still present with 11.18.
> The issue is not present with 10.23.

If you could get EXPLAIN ANALYZE results from v10 as well as v11,
that would be useful too.

                        regards, tom lane

[1] https://wiki.postgresql.org/wiki/Slow_Query_Questions

Re: BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
Jean-Francois Levesque
Дата:
FYI, I have been able to execute the same queries on a local database without the transaction that is added by pytest and I don't have the issue.

On Mon, Dec 12, 2022 at 9:37 AM Jean-Francois Levesque <jf.levesque@gmail.com> wrote:
Hi Tom,

Thanks for the wiki page. Here are the EXPLAIN ANALYZE results on the query.
   
The EXPLAINs have been executed inside the transaction, at the end of the test, just before the rollback. Initially I ran them outside the transaction but they are pretty clean all the time.

FYI, I also have this issue with v12, v13 and v14. I was unable to test with v15 for the moment.

Thanks,

Jean-Francois

------- v10 -------

v1/shared_models/tests/test_session.py Update on shared_models_session  (cost=46.88..175.60 rows=16 width=530) (actual time=0.003..0.004 rows=0 loops=1)
  ->  Nested Loop  (cost=46.88..175.60 rows=16 width=530) (actual time=0.003..0.003 rows=0 loops=1)
        ->  HashAggregate  (cost=46.60..46.76 rows=16 width=32) (actual time=0.003..0.003 rows=0 loops=1)
              Group Key: "ANY_subquery".id
              ->  Subquery Scan on "ANY_subquery"  (cost=4.79..46.56 rows=16 width=32) (actual time=0.002..0.002 rows=0 loops=1)
                    ->  LockRows  (cost=4.79..46.40 rows=16 width=22) (actual time=0.002..0.002 rows=0 loops=1)
                          ->  Nested Loop  (cost=4.79..46.24 rows=16 width=22) (actual time=0.002..0.002 rows=0 loops=1)
                                ->  Nested Loop  (cost=4.51..27.86 rows=16 width=16) (actual time=0.002..0.002 rows=0 loops=1)
                                      ->  Index Scan using shared_models_monitoring_owner_id_aedf0b23 on shared_models_monitoring u0  (cost=0.14..8.16 rows=1 width=10) (actual time=0.002..0.002 rows=0 loops=1)
                                            Index Cond: (owner_id = 28)
                                      ->  Bitmap Heap Scan on shared_models_sessionmonitoring v1  (cost=4.37..19.59 rows=11 width=14) (never executed)
                                            Recheck Cond: (monitoring_id = u0.id)
                                            ->  Bitmap Index Scan on shared_models_sessionmonitoring_monitoring_id_e9953e3e  (cost=0.00..4.36 rows=11 width=0) (never executed)
                                                  Index Cond: (monitoring_id = u0.id)
                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0  (cost=0.28..1.15 rows=1 width=10) (never executed)
                                      Index Cond: (id = v1.session_id)
        ->  Index Scan using shared_models_session_pkey on shared_models_session  (cost=0.28..8.05 rows=1 width=453) (never executed)
              Index Cond: (id = "ANY_subquery".id)
Planning time: 0.116 ms
Execution time: 0.032 ms


------- v11 - normal speed -------

v1/shared_models/tests/test_session.py Update on shared_models_session  (cost=45.54..170.25 rows=16 width=530) (actual time=0.006..0.007 rows=0 loops=1)
  ->  Nested Loop  (cost=45.54..170.25 rows=16 width=530) (actual time=0.006..0.006 rows=0 loops=1)
        ->  HashAggregate  (cost=45.25..45.41 rows=16 width=32) (actual time=0.005..0.006 rows=0 loops=1)
              Group Key: "ANY_subquery".id
              ->  Subquery Scan on "ANY_subquery"  (cost=4.79..45.21 rows=16 width=32) (actual time=0.004..0.005 rows=0 loops=1)
                    ->  LockRows  (cost=4.79..45.05 rows=16 width=22) (actual time=0.004..0.005 rows=0 loops=1)
                          ->  Nested Loop  (cost=4.79..44.89 rows=16 width=22) (actual time=0.004..0.004 rows=0 loops=1)
                                ->  Nested Loop  (cost=4.51..27.86 rows=16 width=16) (actual time=0.004..0.004 rows=0 loops=1)
                                      ->  Index Scan using shared_models_monitoring_owner_id_aedf0b23 on shared_models_monitoring u0  (cost=0.14..8.16 rows=1 width=10) (actual time=0.004..0.004 rows=0 loops=1)
                                            Index Cond: (owner_id = 2)
                                      ->  Bitmap Heap Scan on shared_models_sessionmonitoring v1  (cost=4.37..19.59 rows=11 width=14) (never executed)
                                            Recheck Cond: (monitoring_id = u0.id)
                                            ->  Bitmap Index Scan on shared_models_sessionmonitoring_monitoring_id_e9953e3e  (cost=0.00..4.36 rows=11 width=0) (never executed)
                                                  Index Cond: (monitoring_id = u0.id)
                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0  (cost=0.28..1.06 rows=1 width=10) (never executed)
                                      Index Cond: (id = v1.session_id)
        ->  Index Scan using shared_models_session_pkey on shared_models_session  (cost=0.28..7.80 rows=1 width=453) (never executed)
              Index Cond: (id = "ANY_subquery".id)
Planning Time: 0.162 ms
Execution Time: 0.061 ms


------- v11 - slow query -------

v1/shared_models/tests/test_session.py Update on shared_models_session  (cost=0.90..33.02 rows=1 width=530) (actual time=86918.598..86918.600 rows=0 loops=1)
  ->  Nested Loop Semi Join  (cost=0.90..33.02 rows=1 width=530) (actual time=86918.597..86918.598 rows=0 loops=1)
        Join Filter: (shared_models_session.id = "ANY_subquery".id)
        ->  Index Scan using shared_models_session_pkey on shared_models_session  (cost=0.25..8.26 rows=1 width=453) (actual time=0.005..1.762 rows=488 loops=1)
        ->  Subquery Scan on "ANY_subquery"  (cost=0.65..24.74 rows=1 width=32) (actual time=178.107..178.107 rows=0 loops=488)
              ->  LockRows  (cost=0.65..24.73 rows=1 width=22) (actual time=178.107..178.107 rows=0 loops=488)
                    ->  Nested Loop  (cost=0.65..24.72 rows=1 width=22) (actual time=178.106..178.106 rows=0 loops=488)
                          Join Filter: (v1.monitoring_id = u0.id)
                          ->  Nested Loop  (cost=0.50..16.54 rows=1 width=20) (actual time=0.537..177.918 rows=488 loops=488)
                                Join Filter: (v0.id = v1.session_id)
                                Rows Removed by Join Filter: 118828
                                ->  Index Scan using shared_models_sessionmonitoring_monitoring_id_e9953e3e on shared_models_sessionmonitoring v1  (cost=0.25..8.26 rows=1 width=14) (actual time=0.004..0.116 rows=488 loops=488)
                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0  (cost=0.25..8.26 rows=1 width=10) (actual time=0.002..0.354 rows=244 loops=238144)
                          ->  Index Scan using shared_models_monitoring_owner_id_aedf0b23 on shared_models_monitoring u0  (cost=0.14..8.16 rows=1 width=10) (actual time=0.000..0.000 rows=0 loops=238144)
                                Index Cond: (owner_id = 6)
Planning Time: 0.111 ms
Execution Time: 86918.632 ms


On Fri, Dec 9, 2022 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> [ query gets slower after auto-vacuum ]

This sounds less like an auto-vacuum problem than an auto-analyze
problem.  That is, probably a background update of table statistics
is happening and that is triggering the planner to pick a different
plan.  In an ideal world, more up-to-date stats would always result
in a better plan, but reality is not always so kind.

There is not enough info here to diagnose the problem though.
At minimum we need to see EXPLAIN ANALYZE output for the problem
query in both the fast and slow states.  We have a wiki page about
reporting slow-query problems [1] which asks for quite a bit of
info, but for starters maybe the EXPLAIN ANALYZE results would
be enough.

> I am having an issue that started with 11.0 and is still present with 11.18.
> The issue is not present with 10.23.

If you could get EXPLAIN ANALYZE results from v10 as well as v11,
that would be useful too.

                        regards, tom lane

[1] https://wiki.postgresql.org/wiki/Slow_Query_Questions

Re: BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
David Rowley
Дата:
On Tue, 13 Dec 2022 at 04:04, Jean-Francois Levesque
<jf.levesque@gmail.com> wrote:
> ------- v11 - normal speed -------
>
>                                             ->  Bitmap Index Scan on
shared_models_sessionmonitoring_monitoring_id_e9953e3e (cost=0.00..4.36 rows=11 width=0) (never executed)
 


> ------- v11 - slow query -------
>
>                                 ->  Index Scan using shared_models_sessionmonitoring_monitoring_id_e9953e3e on
shared_models_sessionmonitoringv1  (cost=0.25..8.26 rows=1 width=14) (actual time=0.004..0.116 rows=488
 
>                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0
(cost=0.25..8.26rows=1 width=10) (actual time=0.002..0.354 rows=244 loops=238144)
 

It looks like the problem might be due to auto-vacuum only finding 1
live tuple in the table and setting pg_class.retuples to 1.0.  Looks
like by the time the query runs that there's 488 rows in that table,
not 1. That might be tricky to work around. The planner seems to
prefer to use a non-parameterized nested loop thinking that only 1 row
will exist.  That choice turns out not to be a good one as 488 rows
are found and the subquery is executed 488 times.

It would be good to see the SQL that produces this.  What's of most
interest is you seem to have a FOR UPDATE in the subquery.  It's
possible you could just rewrite the query using the UPDATE FROM syntax
and avoid this entire problem. If you're doing something like FOR
UPDATE (SKIP LOCKED | NOWAIT), then that's not going to be possible.
It's hard to speculate without seeing the SQL.

David



Re: BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
Jean-Francois Levesque
Дата:
Here is the SQL query:

UPDATE "shared_models_session" SET "plan_disabled" = true WHERE
"shared_models_session"."id" IN (SELECT V0."id" FROM "shared_models_session"
V0 INNER JOIN "shared_models_sessionmonitoring" V1 ON (V0."id" =
V1."session_id") WHERE V1."monitoring_id" IN (SELECT U0."id" FROM
"shared_models_monitoring" U0 WHERE U0."owner_id" = 441) FOR UPDATE OF V0)


Le lun. 12 déc. 2022 à 20:40, David Rowley <dgrowleyml@gmail.com> a écrit :
On Tue, 13 Dec 2022 at 04:04, Jean-Francois Levesque
<jf.levesque@gmail.com> wrote:
> ------- v11 - normal speed -------
>
>                                             ->  Bitmap Index Scan on shared_models_sessionmonitoring_monitoring_id_e9953e3e  (cost=0.00..4.36 rows=11 width=0) (never executed)


> ------- v11 - slow query -------
>
>                                 ->  Index Scan using shared_models_sessionmonitoring_monitoring_id_e9953e3e on shared_models_sessionmonitoring v1  (cost=0.25..8.26 rows=1 width=14) (actual time=0.004..0.116 rows=488
>                                ->  Index Scan using shared_models_session_pkey on shared_models_session v0  (cost=0.25..8.26 rows=1 width=10) (actual time=0.002..0.354 rows=244 loops=238144)

It looks like the problem might be due to auto-vacuum only finding 1
live tuple in the table and setting pg_class.retuples to 1.0.  Looks
like by the time the query runs that there's 488 rows in that table,
not 1. That might be tricky to work around. The planner seems to
prefer to use a non-parameterized nested loop thinking that only 1 row
will exist.  That choice turns out not to be a good one as 488 rows
are found and the subquery is executed 488 times.

It would be good to see the SQL that produces this.  What's of most
interest is you seem to have a FOR UPDATE in the subquery.  It's
possible you could just rewrite the query using the UPDATE FROM syntax
and avoid this entire problem. If you're doing something like FOR
UPDATE (SKIP LOCKED | NOWAIT), then that's not going to be possible.
It's hard to speculate without seeing the SQL.

David

Re: BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
David Rowley
Дата:
On Tue, 13 Dec 2022 at 14:46, Jean-Francois Levesque
<jf.levesque@gmail.com> wrote:
> UPDATE "shared_models_session" SET "plan_disabled" = true WHERE
> "shared_models_session"."id" IN (SELECT V0."id" FROM "shared_models_session"
> V0 INNER JOIN "shared_models_sessionmonitoring" V1 ON (V0."id" =
> V1."session_id") WHERE V1."monitoring_id" IN (SELECT U0."id" FROM
> "shared_models_monitoring" U0 WHERE U0."owner_id" = 441) FOR UPDATE OF V0)

There may be some subtle differences around when the records are
locked, but I wonder if you couldn't just write:

UPDATE "shared_models_session" SET "plan_disabled" = true
FROM shared_models_sessionmonitoring V1
WHERE "shared_models_session"."id" = V1."session_id"
  AND V1."monitoring_id" IN (SELECT U0."id" FROM
"shared_models_monitoring" U0 WHERE U0."owner_id" = 441);

which I think should get around the issue of the subquery in the
original version being executed once per matching row. I'm unsure if
there might have been some good reason to form the query in the way it
was formed. Getting around some deadlocking issue with another query
seems unlikely as there is no ORDER BY clause.

I'm not seeing why the UPDATE FROM syntax can't be used. You didn't
seem to have mentioned anything about the schema design, but even if
the join has multiple rows per shared_models_session, since you're
just setting plan_disabled to true, then it does not really matter.
Any subsequent updates will just not be done because the current
command has already updated the row. The same wouldn't be true if you
were setting plan_disabled to some value from one of the other tables.

David



Re: BUG #17710: Slow queries (100% CPU) after auto-vacuum

От
Jean-Francois Levesque
Дата:
David, Tom,

We have been able to refactor the query to reduce the number of nested loops. With this new query, the issue is less frequent in our tests and it is taking less time when it occurs.

However, the issue still happens from time to time after an auto-vacuum/auto-analyze run when running inside a transaction.

Since it was working fine before and something happened in v11, do you think something needs to be fixed?

Let me know if I can help in any way.

Thanks,

JF



On Mon, Dec 12, 2022 at 9:40 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 13 Dec 2022 at 14:46, Jean-Francois Levesque
<jf.levesque@gmail.com> wrote:
> UPDATE "shared_models_session" SET "plan_disabled" = true WHERE
> "shared_models_session"."id" IN (SELECT V0."id" FROM "shared_models_session"
> V0 INNER JOIN "shared_models_sessionmonitoring" V1 ON (V0."id" =
> V1."session_id") WHERE V1."monitoring_id" IN (SELECT U0."id" FROM
> "shared_models_monitoring" U0 WHERE U0."owner_id" = 441) FOR UPDATE OF V0)

There may be some subtle differences around when the records are
locked, but I wonder if you couldn't just write:

UPDATE "shared_models_session" SET "plan_disabled" = true
FROM shared_models_sessionmonitoring V1
WHERE "shared_models_session"."id" = V1."session_id"
  AND V1."monitoring_id" IN (SELECT U0."id" FROM
"shared_models_monitoring" U0 WHERE U0."owner_id" = 441);

which I think should get around the issue of the subquery in the
original version being executed once per matching row. I'm unsure if
there might have been some good reason to form the query in the way it
was formed. Getting around some deadlocking issue with another query
seems unlikely as there is no ORDER BY clause.

I'm not seeing why the UPDATE FROM syntax can't be used. You didn't
seem to have mentioned anything about the schema design, but even if
the join has multiple rows per shared_models_session, since you're
just setting plan_disabled to true, then it does not really matter.
Any subsequent updates will just not be done because the current
command has already updated the row. The same wouldn't be true if you
were setting plan_disabled to some value from one of the other tables.

David