Обсуждение: Simple DELETE on modest-size table runs 100% CPU forever

Поиск
Список
Период
Сортировка

Simple DELETE on modest-size table runs 100% CPU forever

От
Craig James
Дата:
I'm completely baffled by this problem: I'm doing a delete that joins three modest-sized tables, and it gets completely stuck: 100% CPU use forever.  Here's the query:

explain analyze
 select count(1) from registry.categories
  where category_id = 15 and id in
    (select c.id from registry.categories c
     left join registry.category_staging_15 st on (c.id = st.id)  where c.category_id = 15 and st.id is null);

If I leave out the "analyze", here's what I get (note that the categories_staging_N table's name changes every time; it's
created on demand as "create table categories_staging_n(id integer)").

Aggregate  (cost=193.54..193.55 rows=1 width=8)
  ->  Nested Loop Semi Join  (cost=0.84..193.54 rows=1 width=0)
        Join Filter: (categories.id = c.id)
        ->  Index Scan using i_categories_category_id on categories  (cost=0.42..2.44 rows=1 width=4)
              Index Cond: (category_id = 23)
        ->  Nested Loop Anti Join  (cost=0.42..191.09 rows=1 width=4)
              Join Filter: (c.id = st.id)
              ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..2.44 rows=1 width=4)
                    Index Cond: (category_id = 23)
              ->  Seq Scan on category_staging_23 st  (cost=0.00..99.40 rows=7140 width=4)


The tables are small. From a debugging printout:
  7997 items in table registry.category_staging_15
228292 items in table registry.categories
309398 items in table registry.smiles
  7997 items in joined registry.category_staging_15 / registry.categories

What on Earth could be causing this simple query to be running 100% CPU for hours?

Postgres: 10.10
Ubuntu 16.04
This is a VirtualBox virtual machine running on a Mac host.

Everything else seems to work as expected; just this one query does this.

Thanks,
Craig

Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Alvaro Herrera
Дата:
On 2019-Nov-14, Craig James wrote:

> I'm completely baffled by this problem: I'm doing a delete that joins three
> modest-sized tables, and it gets completely stuck: 100% CPU use forever.

Do you have any FKs there?  If any delete is cascading, and you don't
have an index on the other side, it'd do tons of seqscans.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Michael Lewis
Дата:
If I leave out the "analyze", here's what I get (note that the categories_staging_N table's name changes every time; it's
created on demand as "create table categories_staging_n(id integer)").

How/when are they created? In the same statement? After create, are you analyzing these tables? If not, the optimizer is blind and may be choosing a bad plan by chance.

Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Justin Pryzby
Дата:
On Thu, Nov 14, 2019 at 02:19:51PM -0800, Craig James wrote:
> I'm completely baffled by this problem: I'm doing a delete that joins three
> modest-sized tables, and it gets completely stuck: 100% CPU use forever.
> Here's the query:
> 
> explain analyze
>  select count(1) from registry.categories
>   where category_id = 15 and id in
>     (select c.id from registry.categories c
>      left join registry.category_staging_15 st on (c.id = st.id)  where
> c.category_id = 15 and st.id is null);
> 
> If I leave out the "analyze", here's what I get (note that the

Do you mean that you're doing DELETE..USING, and that's an explain for SELECT
COUNT() with same join conditions ?  Can you show explain for the DELETE, and
\d for the tables ?  

If there's FKs, do the other tables have indices on their referencING columns ?

https://www.postgresql.org/docs/devel/static/ddl-constraints.html#DDL-CONSTRAINTS-FK
"Since a DELETE of a row from the referenced table [...] will require a scan of
the referencing table for rows matching the old value, it is often a good idea
to index the referencing columns too."

Justin



Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Andres Freund
Дата:
Hi,

On 2019-11-14 14:19:51 -0800, Craig James wrote:
> I'm completely baffled by this problem: I'm doing a delete that joins three
> modest-sized tables, and it gets completely stuck: 100% CPU use forever.
> Here's the query:

I assume this is intended to be an equivalent SELECT? Because you did
mention DELETE, but I'm not seeing one here?  Could you actually show
that query - surely that didn't include a count() etc...  You can
EPLAIN DELETEs too.



> explain analyze
>  select count(1) from registry.categories
>   where category_id = 15 and id in
>     (select c.id from registry.categories c
>      left join registry.category_staging_15 st on (c.id = st.id)  where
> c.category_id = 15 and st.id is null);
> 
> If I leave out the "analyze", here's what I get (note that the
> categories_staging_N table's name changes every time; it's
> created on demand as "create table categories_staging_n(id integer)").

> Aggregate  (cost=193.54..193.55 rows=1 width=8)
>   ->  Nested Loop Semi Join  (cost=0.84..193.54 rows=1 width=0)
>         Join Filter: (categories.id = c.id)
>         ->  Index Scan using i_categories_category_id on categories
>  (cost=0.42..2.44 rows=1 width=4)
>               Index Cond: (category_id = 23)
>         ->  Nested Loop Anti Join  (cost=0.42..191.09 rows=1 width=4)
>               Join Filter: (c.id = st.id)
>               ->  Index Scan using i_categories_category_id on categories c
>  (cost=0.42..2.44 rows=1 width=4)
>                     Index Cond: (category_id = 23)
>               ->  Seq Scan on category_staging_23 st  (cost=0.00..99.40
> rows=7140 width=4)
> 
> The tables are small. From a debugging printout:


Is categories.category_id unique?  Does the plan change if you ANALYZE
the tables?


This plan doesn't look like it'd actually take long, if the estimates
are correct.


> What on Earth could be causing this simple query to be running 100% CPU for
> hours?

Is the DELETE actually taking that long, or the query you showed the
explain for, or both?

Greetings,

Andres Freund



Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Craig James
Дата:
On Thu, Nov 14, 2019 at 2:29 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-11-14 14:19:51 -0800, Craig James wrote:
> I'm completely baffled by this problem: I'm doing a delete that joins three
> modest-sized tables, and it gets completely stuck: 100% CPU use forever.
> Here's the query:

I assume this is intended to be an equivalent SELECT? Because you did
mention DELETE, but I'm not seeing one here?  Could you actually show
that query - surely that didn't include a count() etc...  You can
EPLAIN DELETEs too.

Sorry, my explanation was misleading. It is a "delete ... where id in (select ...)". But I discovered that the select part itself never completes, whether you include it in the delete or not. So I only showed the select, which I converted to a "select count(1) ..." for simplicity.
 
> explain analyze
>  select count(1) from registry.categories
>   where category_id = 15 and id in
>     (select c.id from registry.categories c
>      left join registry.category_staging_15 st on (c.id = st.id)  where
> c.category_id = 15 and st.id is null);
>
> If I leave out the "analyze", here's what I get (note that the
> categories_staging_N table's name changes every time; it's
> created on demand as "create table categories_staging_n(id integer)").

> Aggregate  (cost=193.54..193.55 rows=1 width=8)
>   ->  Nested Loop Semi Join  (cost=0.84..193.54 rows=1 width=0)
>         Join Filter: (categories.id = c.id)
>         ->  Index Scan using i_categories_category_id on categories
>  (cost=0.42..2.44 rows=1 width=4)
>               Index Cond: (category_id = 23)
>         ->  Nested Loop Anti Join  (cost=0.42..191.09 rows=1 width=4)
>               Join Filter: (c.id = st.id)
>               ->  Index Scan using i_categories_category_id on categories c
>  (cost=0.42..2.44 rows=1 width=4)
>                     Index Cond: (category_id = 23)
>               ->  Seq Scan on category_staging_23 st  (cost=0.00..99.40
> rows=7140 width=4)
>
> The tables are small. From a debugging printout:


Is categories.category_id unique?

No, categories.category_id is not unique. It has a b-tree index.
 
Does the plan change if you ANALYZE
the tables?

No. No difference.

But interestingly, it changes as the process goes forward. And it's inconsistent. Here's an example: it's going through several "categories" to update each. The first plan works, and it typically uses this plan a few times. But when selects the second plan, it gets stuck.

----------------
15994 items in table registry.category_staging_15
245598 items in table registry.categories
309398 items in table registry.smiles
15994 items in joined registry.category_staging_15 / registry.categories
0 items to be inserted
inserted: 0E0
EXPLAIN: Aggregate  (cost=3464.82..3464.83 rows=1 width=8)
EXPLAIN:   ->  Hash Semi Join  (cost=2029.16..3464.05 rows=311 width=0)
EXPLAIN:         Hash Cond: (categories.id = c.id)
EXPLAIN:         ->  Index Scan using i_categories_category_id on categories  (cost=0.42..1405.28 rows=7900 width=4)
EXPLAIN:               Index Cond: (category_id = 15)
EXPLAIN:         ->  Hash  (cost=1933.44..1933.44 rows=7624 width=4)
EXPLAIN:               ->  Hash Anti Join  (cost=431.28..1933.44 rows=7624 width=4)
EXPLAIN:                     Hash Cond: (c.id = st.id)
EXPLAIN:                     ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..1405.28 rows=7900 width=4)
EXPLAIN:                           Index Cond: (category_id = 15)
EXPLAIN:                     ->  Hash  (cost=230.94..230.94 rows=15994 width=4)
EXPLAIN:                           ->  Seq Scan on category_staging_15 st  (cost=0.00..230.94 rows=15994 width=4)
0 items deleted
7997 items inserted
----------------
6250 items in table registry.category_staging_25
245598 items in table registry.categories
309398 items in table registry.smiles
6250 items in joined registry.category_staging_25 / registry.categories
6250 items to be inserted
inserted: 3125
EXPLAIN: Aggregate  (cost=173.51..173.52 rows=1 width=8)
EXPLAIN:   ->  Nested Loop Semi Join  (cost=0.84..173.51 rows=1 width=0)
EXPLAIN:         Join Filter: (categories.id = c.id)
EXPLAIN:         ->  Index Scan using i_categories_category_id on categories  (cost=0.42..2.44 rows=1 width=4)
EXPLAIN:               Index Cond: (category_id = 25)
EXPLAIN:         ->  Nested Loop Anti Join  (cost=0.42..171.06 rows=1 width=4)
EXPLAIN:               Join Filter: (c.id = st.id)
EXPLAIN:               ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..2.44 rows=1 width=4)
EXPLAIN:                     Index Cond: (category_id = 25)
EXPLAIN:               ->  Seq Scan on category_staging_25 st  (cost=0.00..90.50 rows=6250 width=4)


This plan doesn't look like it'd actually take long, if the estimates
are correct.

Another data point: during this query, Postgres is burning 100% CPU and doing no I/O. Pretty much for hours if I let it go. 
 
Thanks for your help,
Craig

RE: Simple DELETE on modest-size table runs 100% CPU forever

От
Ravi Rai
Дата:

Hey James,

 

Looking at your select query, if that’s taking forever STILL. Could it be because of the IN clause? If so, try using EXISTS instead of IN.. should give much better results.

 

From: Craig James <cjames@emolecules.com>
Sent: Friday, November 15, 2019 1:07 PM
To: Andres Freund <andres@anarazel.de>
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Simple DELETE on modest-size table runs 100% CPU forever

 

On Thu, Nov 14, 2019 at 2:29 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-11-14 14:19:51 -0800, Craig James wrote:
> I'm completely baffled by this problem: I'm doing a delete that joins three
> modest-sized tables, and it gets completely stuck: 100% CPU use forever.
> Here's the query:

I assume this is intended to be an equivalent SELECT? Because you did
mention DELETE, but I'm not seeing one here?  Could you actually show
that query - surely that didn't include a count() etc...  You can
EPLAIN DELETEs too.

 

Sorry, my explanation was misleading. It is a "delete ... where id in (select ...)". But I discovered that the select part itself never completes, whether you include it in the delete or not. So I only showed the select, which I converted to a "select count(1) ..." for simplicity.

 

> explain analyze
>  select count(1) from registry.categories
>   where category_id = 15 and id in
>     (select c.id from registry.categories c
>      left join registry.category_staging_15 st on (c.id = st.id)  where
> c.category_id = 15 and st.id is null);
>
> If I leave out the "analyze", here's what I get (note that the
> categories_staging_N table's name changes every time; it's
> created on demand as "create table categories_staging_n(id integer)").

> Aggregate  (cost=193.54..193.55 rows=1 width=8)
>   ->  Nested Loop Semi Join  (cost=0.84..193.54 rows=1 width=0)
>         Join Filter: (categories.id = c.id)
>         ->  Index Scan using i_categories_category_id on categories
>  (cost=0.42..2.44 rows=1 width=4)
>               Index Cond: (category_id = 23)
>         ->  Nested Loop Anti Join  (cost=0.42..191.09 rows=1 width=4)
>               Join Filter: (c.id = st.id)
>               ->  Index Scan using i_categories_category_id on categories c
>  (cost=0.42..2.44 rows=1 width=4)
>                     Index Cond: (category_id = 23)
>               ->  Seq Scan on category_staging_23 st  (cost=0.00..99.40
> rows=7140 width=4)
>
> The tables are small. From a debugging printout:


Is categories.category_id unique?

 

No, categories.category_id is not unique. It has a b-tree index.

 

Does the plan change if you ANALYZE
the tables?

 

No. No difference.

 

But interestingly, it changes as the process goes forward. And it's inconsistent. Here's an example: it's going through several "categories" to update each. The first plan works, and it typically uses this plan a few times. But when selects the second plan, it gets stuck.

 

----------------
15994 items in table registry.category_staging_15
245598 items in table registry.categories
309398 items in table registry.smiles
15994 items in joined registry.category_staging_15 / registry.categories
0 items to be inserted
inserted: 0E0
EXPLAIN: Aggregate  (cost=3464.82..3464.83 rows=1 width=8)
EXPLAIN:   ->  Hash Semi Join  (cost=2029.16..3464.05 rows=311 width=0)
EXPLAIN:         Hash Cond: (categories.id = c.id)
EXPLAIN:         ->  Index Scan using i_categories_category_id on categories  (cost=0.42..1405.28 rows=7900 width=4)
EXPLAIN:               Index Cond: (category_id = 15)
EXPLAIN:         ->  Hash  (cost=1933.44..1933.44 rows=7624 width=4)
EXPLAIN:               ->  Hash Anti Join  (cost=431.28..1933.44 rows=7624 width=4)
EXPLAIN:                     Hash Cond: (c.id = st.id)
EXPLAIN:                     ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..1405.28 rows=7900 width=4)
EXPLAIN:                           Index Cond: (category_id = 15)
EXPLAIN:                     ->  Hash  (cost=230.94..230.94 rows=15994 width=4)
EXPLAIN:                           ->  Seq Scan on category_staging_15 st  (cost=0.00..230.94 rows=15994 width=4)
0 items deleted
7997 items inserted
----------------
6250 items in table registry.category_staging_25
245598 items in table registry.categories
309398 items in table registry.smiles
6250 items in joined registry.category_staging_25 / registry.categories
6250 items to be inserted
inserted: 3125
EXPLAIN: Aggregate  (cost=173.51..173.52 rows=1 width=8)
EXPLAIN:   ->  Nested Loop Semi Join  (cost=0.84..173.51 rows=1 width=0)
EXPLAIN:         Join Filter: (categories.id = c.id)
EXPLAIN:         ->  Index Scan using i_categories_category_id on categories  (cost=0.42..2.44 rows=1 width=4)
EXPLAIN:               Index Cond: (category_id = 25)
EXPLAIN:         ->  Nested Loop Anti Join  (cost=0.42..171.06 rows=1 width=4)
EXPLAIN:               Join Filter: (c.id = st.id)
EXPLAIN:               ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..2.44 rows=1 width=4)
EXPLAIN:                     Index Cond: (category_id = 25)
EXPLAIN:               ->  Seq Scan on category_staging_25 st  (cost=0.00..90.50 rows=6250 width=4)

 

This plan doesn't look like it'd actually take long, if the estimates
are correct.

 

Another data point: during this query, Postgres is burning 100% CPU and doing no I/O. Pretty much for hours if I let it go. 

 

Thanks for your help,

Craig

 

Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Jeff Janes
Дата:
On Thu, Nov 14, 2019 at 5:20 PM Craig James <cjames@emolecules.com> wrote:
I'm completely baffled by this problem: I'm doing a delete that joins three modest-sized tables, and it gets completely stuck: 100% CPU use forever.  Here's the query:


Aggregate  (cost=193.54..193.55 rows=1 width=8)
  ->  Nested Loop Semi Join  (cost=0.84..193.54 rows=1 width=0)
        Join Filter: (categories.id = c.id)
        ->  Index Scan using i_categories_category_id on categories  (cost=0.42..2.44 rows=1 width=4)
              Index Cond: (category_id = 23)
        ->  Nested Loop Anti Join  (cost=0.42..191.09 rows=1 width=4)
              Join Filter: (c.id = st.id)
              ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..2.44 rows=1 width=4)
                    Index Cond: (category_id = 23)
              ->  Seq Scan on category_staging_23 st  (cost=0.00..99.40 rows=7140 width=4)


If the estimates were correct, this shouldn't be slow.  But how can it screw up the estimate for this by much, when the conditions are so simple?  How many rows are there actually in categories where category_id=23?

What do you see in `select * from pg_stats where tablename='categories' and attname='category_id' \x\g\x`?

Since it thinks the seq scan of  category_staging_23 is only going to happen once (at the bottom of two nested loops, but each executing just once) it sees no benefit in hashing that table.  Of course it is actually happening a lot more than once.

Cheers,

Jeff

Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Craig James
Дата:

On Fri, Nov 15, 2019 at 2:45 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Nov 14, 2019 at 5:20 PM Craig James <cjames@emolecules.com> wrote:
I'm completely baffled by this problem: I'm doing a delete that joins three modest-sized tables, and it gets completely stuck: 100% CPU use forever.  Here's the query:


Aggregate  (cost=193.54..193.55 rows=1 width=8)
  ->  Nested Loop Semi Join  (cost=0.84..193.54 rows=1 width=0)
        Join Filter: (categories.id = c.id)
        ->  Index Scan using i_categories_category_id on categories  (cost=0.42..2.44 rows=1 width=4)
              Index Cond: (category_id = 23)
        ->  Nested Loop Anti Join  (cost=0.42..191.09 rows=1 width=4)
              Join Filter: (c.id = st.id)
              ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..2.44 rows=1 width=4)
                    Index Cond: (category_id = 23)
              ->  Seq Scan on category_staging_23 st  (cost=0.00..99.40 rows=7140 width=4)


If the estimates were correct, this shouldn't be slow.  But how can it screw up the estimate for this by much, when the conditions are so simple?  How many rows are there actually in categories where category_id=23?

I actually waited long enough for this to finish an "explain analyze". Here it is, preceded by stats about the table that I added to the program:

10000 items in table registry.category_staging_8
274602 items in table registry.categories
309398 items in table registry.smiles
10000 items in joined registry.category_staging_8 / registry.categories
Aggregate  (cost=274.90..274.91 rows=1 width=8) (actual time=7666916.832..7666916.832 rows=1 loops=1)
  ->  Nested Loop Semi Join  (cost=0.84..274.89 rows=1 width=0) (actual time=7666916.829..7666916.829 rows=0 loops=1)
        Join Filter: (categories.id = c.id)
        ->  Index Scan using i_categories_category_id on categories  (cost=0.42..2.44 rows=1 width=4) (actual time=0.015..6.192 rows=5000 loops=1)
              Index Cond: (category_id = 8)
        ->  Nested Loop Anti Join  (cost=0.42..272.44 rows=1 width=4) (actual time=1533.380..1533.380 rows=0 loops=5000)
              Join Filter: (c.id = st.id)
              Rows Removed by Join Filter: 12497500
              ->  Index Scan using i_categories_category_id on categories c  (cost=0.42..2.44 rows=1 width=4) (actual time=0.017..1.927 rows=5000 loops=5000)
                    Index Cond: (category_id = 8)
              ->  Seq Scan on category_staging_8 st  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.003..0.153 rows=2500 loops=25000000)
Planning time: 0.311 ms
Execution time: 7666916.865 ms


BTW, I'll note at this point that "analyze category_staging_8" prior to this query made no difference.
 
What do you see in `select * from pg_stats where tablename='categories' and attname='category_id' \x\g\x`?

db=> select * from pg_stats where tablename='categories' and attname='category_id' \x\g\x;
Expanded display is on.
-[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | registry
tablename              | categories
attname                | category_id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 21
most_common_vals       | {4,3,2,10,11,13,12,16,9,6,7,5,15,23,14,25,24,1,26,28,27}
most_common_freqs      | {0.2397,0.159933,0.0926667,0.0556,0.0555667,0.0546333,0.0525333,0.0439,0.0426667,0.0346333,0.0331,0.0302333,0.0288333,0.0240667,0.0224,0.0122333,0.011,0.0035,0.00233333,0.000366667,0.0001}
histogram_bounds       |
correlation            | -0.0200765
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |
-[ RECORD 2 ]----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | test
tablename              | categories
attname                | category_id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 11
most_common_vals       | {10,30,50,1,2,0,3,9,6,7,5}
most_common_freqs      | {0.132051,0.132051,0.132051,0.10641,0.0935897,0.0807692,0.0807692,0.0807692,0.0769231,0.0551282,0.0294872}
histogram_bounds       |
correlation            | -0.435298
most_common_elems      |
most_common_elem_freqs |
 

Since it thinks the seq scan of  category_staging_23 is only going to happen once (at the bottom of two nested loops, but each executing just once) it sees no benefit in hashing that table.  Of course it is actually happening a lot more than once.

Yeah, 25 million times to be exact.
 

Cheers,

Jeff

Thanks,
Craig


--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
3430 Carmel Mountain Road, Suite 250
San Diego, CA 92121
---------------------------------

Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Jeff Janes
Дата:
On Fri, Nov 15, 2019 at 7:27 PM Craig James <cjames@emolecules.com> wrote:
On Fri, Nov 15, 2019 at 2:45 PM Jeff Janes <jeff.janes@gmail.com> wrote:
BTW, I'll note at this point that "analyze category_staging_8" prior to this query made no difference.

Isn't that the wrong table to have analyzed? The offender here is "categories", not  "category_staging_8".  Is this some sort of inheritance situation?

 
What do you see in `select * from pg_stats where tablename='categories' and attname='category_id' \x\g\x`?

db=> select * from pg_stats where tablename='categories' and attname='category_id' \x\g\x;
Expanded display is on.
 
... 
n_distinct             | 21
most_common_vals       | {4,3,2,10,11,13,12,16,9,6,7,5,15,23,14,25,24,1,26,28,27}
most_common_freqs      | {0.2397,0.159933,0.0926667,0.0556,0.0555667,0.0546333,0.0525333,0.0439,0.0426667,0.0346333,0.0331,0.0302333,0.0288333,0.0240667,0.0224,0.0122333,0.011,0.0035,0.00233333,0.000366667,0.0001}

There is a path in the analyze code where if the least-seen value in the sample was seen more than once (i.e. no value was seen exactly once) then it assumes that the seen values are all the values that exist.  I think the logic behind that is dubious.  I think it is pretty clear that that is kicking in here.  But why?  I think the simple answer is that you analyzed the wrong table, and the statistics shown here might be accurate for some time in the past but are no longer accurate.  It is hard to see how a value present 5000 times in a table of 274602 rows could have evaded sampling if they were present at the time the sample was done.

Cheers,

Jeff

Re: Simple DELETE on modest-size table runs 100% CPU forever

От
Craig James
Дата:
Problem solved ... see below. Thanks everyone for your suggestions and insights!

On Sat, Nov 16, 2019 at 7:16 AM Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Nov 15, 2019 at 7:27 PM Craig James <cjames@emolecules.com> wrote:
On Fri, Nov 15, 2019 at 2:45 PM Jeff Janes <jeff.janes@gmail.com> wrote:
BTW, I'll note at this point that "analyze category_staging_8" prior to this query made no difference.

Isn't that the wrong table to have analyzed? The offender here is "categories", not  "category_staging_8".  Is this some sort of inheritance situation?

 
What do you see in `select * from pg_stats where tablename='categories' and attname='category_id' \x\g\x`?

db=> select * from pg_stats where tablename='categories' and attname='category_id' \x\g\x;
Expanded display is on.
 
... 
n_distinct             | 21
most_common_vals       | {4,3,2,10,11,13,12,16,9,6,7,5,15,23,14,25,24,1,26,28,27}
most_common_freqs      | {0.2397,0.159933,0.0926667,0.0556,0.0555667,0.0546333,0.0525333,0.0439,0.0426667,0.0346333,0.0331,0.0302333,0.0288333,0.0240667,0.0224,0.0122333,0.011,0.0035,0.00233333,0.000366667,0.0001}

There is a path in the analyze code where if the least-seen value in the sample was seen more than once (i.e. no value was seen exactly once) then it assumes that the seen values are all the values that exist.  I think the logic behind that is dubious.  I think it is pretty clear that that is kicking in here.  But why?  I think the simple answer is that you analyzed the wrong table, and the statistics shown here might be accurate for some time in the past but are no longer accurate.  It is hard to see how a value present 5000 times in a table of 274602 rows could have evaded sampling if they were present at the time the sample was done.

As I mentioned in a reply to Andreas, I also added an "analyze ..." to the other two tables as an experiment. It made no difference. However ...

Your comment about missing 5000 values solved the problem: those values were only inserted in the previous SQL statement, inside of a transaction. The code is reconciling two collections across two different servers: First it inserts all new values, then it deletes obsolete values. So the "select ..." in question is including the very 5000 rows that were just inserted.

I added an "analyze" between the insert and the delete. Instant fix.

It also solves one other mystery: This query only caused problems on the small test system, and has been working well on a production database with about 100x more data. In production, each "category" is already populated with a significant amount of data. The production system already has good statistics, so this one insert/delete doesn't change the statistics.
 
Cheers,

Jeff

Thanks!
Craig