Обсуждение: n_dead_tup could be way off just after a vacuum

Поиск
Список
Период
Сортировка

n_dead_tup could be way off just after a vacuum

От
Frédéric Yhuel
Дата:
My colleague Christophe Courtois and I came across some surprising 
behavior. After executing the following script, n_live_tup and 
n_dead_tup are way off. We know they are not meant to be exact, but the 
table will be autovacuumed for no reason afterwards.

DROP TABLE IF EXISTS foo ;
CREATE TABLE foo (id int);
INSERT INTO foo (id) SELECT * FROM generate_series(1,1000);
DELETE FROM foo WHERE id <= 500;

VACUUM foo ;

SELECT pg_sleep(1);

SELECT relname, n_tup_ins, n_tup_del,
n_live_tup, n_dead_tup,
last_autovacuum, last_autoanalyze,
last_vacuum, last_analyze
FROM pg_stat_user_tables WHERE relname='foo' \gx

-[ RECORD 1 ]----+------------------------------
relname          | foo
n_tup_ins        | 1000
n_tup_del        | 500
n_live_tup       | 1000 <--- should be 500 after a VACUUM
n_dead_tup       | 500 <--- should be 0 after a VACUUM
last_autovacuum  | ∅
last_autoanalyze | ∅
last_vacuum      | 2026-03-10 18:11:09.893913+01
last_analyze     | ∅

This issue appears on all tested versions from 9.6 to 18.

If we sleep one second between the DELETE and the VACUUM, the problem 
disappears, because (IIUC) pgstat_relation_flush_cb() gets executed 
before the VACUUM.

I wonder if this is a known issue, and if pg_stat_report_vacuum() should 
clear the pending stats, or something.



Re: n_dead_tup could be way off just after a vacuum

От
Andres Freund
Дата:
On 2026-03-10 18:24:18 +0100, Frédéric Yhuel wrote:
> My colleague Christophe Courtois and I came across some surprising behavior.
> After executing the following script, n_live_tup and n_dead_tup are way off.
> We know they are not meant to be exact, but the table will be autovacuumed
> for no reason afterwards.
> 
> DROP TABLE IF EXISTS foo ;
> CREATE TABLE foo (id int);
> INSERT INTO foo (id) SELECT * FROM generate_series(1,1000);
> DELETE FROM foo WHERE id <= 500;
> 
> VACUUM foo ;
> 
> SELECT pg_sleep(1);
> 
> SELECT relname, n_tup_ins, n_tup_del,
> n_live_tup, n_dead_tup,
> last_autovacuum, last_autoanalyze,
> last_vacuum, last_analyze
> FROM pg_stat_user_tables WHERE relname='foo' \gx

> -[ RECORD 1 ]----+------------------------------
> relname          | foo
> n_tup_ins        | 1000
> n_tup_del        | 500
> n_live_tup       | 1000 <--- should be 500 after a VACUUM
> n_dead_tup       | 500 <--- should be 0 after a VACUUM
> last_autovacuum  | ∅
> last_autoanalyze | ∅
> last_vacuum      | 2026-03-10 18:11:09.893913+01
> last_analyze     | ∅
> 
> This issue appears on all tested versions from 9.6 to 18.
> 
> If we sleep one second between the DELETE and the VACUUM, the problem
> disappears, because (IIUC) pgstat_relation_flush_cb() gets executed before
> the VACUUM.
> 
> I wonder if this is a known issue, and if pg_stat_report_vacuum() should
> clear the pending stats, or something.

This is just because of the small gap between the DELETE and the VACUUM. The
stats from queries are only merged into the shared state every now and then,
as it'd be way too expensive to do so all the time.  Because you issue the
statements in quick successing, the report of the row insertion and deletions
are only taken into account after the VACUUM.

If you put a
  SELECT pg_stat_force_next_flush();

after the DELETE, you get the stats you expect:
┌─[ RECORD 1 ]─────┬───────────────────────────────┐
│ relname          │ foo                           │
│ n_tup_ins        │ 1000                          │
│ n_tup_del        │ 500                           │
│ n_live_tup       │ 500                           │
│ n_dead_tup       │ 0                             │
│ last_autovacuum  │ (null)                        │
│ last_autoanalyze │ (null)                        │
│ last_vacuum      │ 2026-03-11 11:13:11.936146-04 │
│ last_analyze     │ (null)                        │
└──────────────────┴───────────────────────────────┘


Note that you'd *also* get the good stats if you didn't do the pg_sleep(1)
after the VACUUM, because after the VACUUM the stats actually *are*
accurate. It's just the stats from the DELETE are merged later.


I don't think this is a bug. It's just an engineering compromise between cost
and accuracy.  In real scenarios the effect of this is much smaller, because
either the DELETE is only a smaller portion of the rows, or it takes longer to
run and thus will trigger a stats flush during transaction commit.

Greetings,

Andres Freund



Re: n_dead_tup could be way off just after a vacuum

От
Frédéric Yhuel
Дата:

Le 11/03/2026 à 16:17, Andres Freund a écrit :
> This is just because of the small gap between the DELETE and the VACUUM. The
> stats from queries are only merged into the shared state every now and then,
> as it'd be way too expensive to do so all the time.  Because you issue the
> statements in quick successing, the report of the row insertion and deletions
> are only taken into account after the VACUUM.
> 
> If you put a
>    SELECT pg_stat_force_next_flush();
> 
> after the DELETE, you get the stats you expect:
> ┌─[ RECORD 1 ]─────┬───────────────────────────────┐
> │ relname          │ foo                           │
> │ n_tup_ins        │ 1000                          │
> │ n_tup_del        │ 500                           │
> │ n_live_tup       │ 500                           │
> │ n_dead_tup       │ 0                             │
> │ last_autovacuum  │ (null)                        │
> │ last_autoanalyze │ (null)                        │
> │ last_vacuum      │ 2026-03-11 11:13:11.936146-04 │
> │ last_analyze     │ (null)                        │
> └──────────────────┴───────────────────────────────┘
> 
> 
> Note that you'd*also* get the good stats if you didn't do the pg_sleep(1)
> after the VACUUM, because after the VACUUM the stats actually*are*
> accurate. It's just the stats from the DELETE are merged later.
> 

Yes. I was wondering if pg_stat_report_vacuum() should clear the pending 
stats... but upon further thought, that wouldn't be the right thing to 
do, as long as we have more than one process running in parallel (even 
if it's just one client backend and one autovac worker). Not to mention 
the fact that it may be technically unfeasible or overkill.

> 
> I don't think this is a bug. It's just an engineering compromise between cost
> and accuracy.  In real scenarios the effect of this is much smaller, because
> either the DELETE is only a smaller portion of the rows, or it takes longer to
> run and thus will trigger a stats flush during transaction commit.

OK thank you! It makes sense indeed.