Обсуждение: BUG #18351: VACUUM FULL fails with error: missing chunk number 0 for toast value XXX

Поиск
Список
Период
Сортировка

BUG #18351: VACUUM FULL fails with error: missing chunk number 0 for toast value XXX

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18351
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.2
Operating system:   Ubuntu 22.04
Description:

The following script:
for ((c=1;c<=20;c++)); do createdb db$c; done

for ((i=1;i<=100;i++)); do
echo "iteration $i"
  for ((c=1;c<=20;c++)); do
echo "\\d sometable" | psql -d db$c >psql-1-$c.log 2>&1 &
echo "DROP TABLE IF EXISTS tbl;
CREATE TABLE tbl (i int, t text);
ALTER TABLE tbl ALTER COLUMN t SET STORAGE EXTERNAL;
INSERT INTO tbl(i, t) VALUES (1, repeat('1234567890', 300));
DELETE FROM tbl;

VACUUM tbl;
VACUUM FULL tbl;
" | psql -d db$c >psql-2-$c.log 2>&1 &
  done
wait
grep 'missing chunk number' server.log  && break;
done

produces errors like:
...
iteration 6
2024-02-17 18:12:41.160 UTC [2305730] ERROR:  missing chunk number 0 for
toast value 17559 in pg_toast_17548

with the stack trace:
...
#5  0x0000555d4dae19e1 in errfinish (filename=0x555d4db77969 "heaptoast.c",
lineno=784, funcname=0x555d4db77bd0 <__func__.1> "heap_fetch_toast_slice")
at elog.c:604
#6  0x0000555d4d3cd25f in heap_fetch_toast_slice (toastrel=0x7f9b400aa740,
valueid=17559, attrsize=3000, sliceoffset=0, slicelength=3000,
result=0x555d4efd7028) at heaptoast.c:784
#7  0x0000555d4d35f0a3 in table_relation_fetch_toast_slice
(toastrel=0x7f9b400aa740, valueid=17559, attrsize=3000, sliceoffset=0,
slicelength=3000, result=0x555d4efd7028) at
../../../../src/include/access/tableam.h:1917
#8  0x0000555d4d35ff52 in toast_fetch_datum (attr=0x555d4efce51c) at
detoast.c:375
#9  0x0000555d4d35f119 in detoast_external_attr (attr=0x555d4efce51c) at
detoast.c:54
#10 0x0000555d4d422046 in toast_tuple_init (ttc=0x7fff6cb47400) at
toast_helper.c:144
#11 0x0000555d4d3cbb3c in heap_toast_insert_or_update (rel=0x7f9b400a2828,
newtup=0x555d4efce4e8, oldtup=0x0, options=10) at heaptoast.c:157
#12 0x0000555d4d3d1df2 in raw_heap_insert (state=0x555d4efd6f08,
tup=0x555d4efce4e8) at rewriteheap.c:645
#13 0x0000555d4d3d19d0 in rewrite_heap_tuple (state=0x555d4efd6f08,
old_tuple=0x555d4efcf360, new_tuple=0x555d4efce4e8) at rewriteheap.c:473
#14 0x0000555d4d3c83e4 in reform_and_rewrite_tuple (tuple=0x555d4efcf360,
OldHeap=0x7f9b400a2618, NewHeap=0x7f9b400a2828, values=0x555d4efce6a8,
isnull=0x555d4efcf2f0, rwstate=0x555d4efd6f08) at heapam_handler.c:2487
#15 0x0000555d4d3c5a44 in heapam_relation_copy_for_cluster
(OldHeap=0x7f9b400a2618, NewHeap=0x7f9b400a2828, OldIndex=0x0,
use_sort=false, OldestXmin=1545, xid_cutoff=0x7fff6cb54c70,
multi_cutoff=0x7fff6cb54c74, 
    num_tuples=0x7fff6cb54c00, tups_vacuumed=0x7fff6cb54c08,
tups_recently_dead=0x7fff6cb54c10) at heapam_handler.c:932
#16 0x0000555d4d5588f4 in table_relation_copy_for_cluster
(OldTable=0x7f9b400a2618, NewTable=0x7f9b400a2828, OldIndex=0x0,
use_sort=false, OldestXmin=1545, xid_cutoff=0x7fff6cb54c70,
multi_cutoff=0x7fff6cb54c74, 
    num_tuples=0x7fff6cb54c00, tups_vacuumed=0x7fff6cb54c08,
tups_recently_dead=0x7fff6cb54c10) at
../../../src/include/access/tableam.h:1683
#17 0x0000555d4d55a19f in copy_table_data (OIDNewHeap=17572,
OIDOldHeap=17548, OIDOldIndex=0, verbose=false,
pSwapToastByContent=0x7fff6cb54d9d, pFreezeXid=0x7fff6cb54da0,
pCutoffMulti=0x7fff6cb54da4) at cluster.c:971
#18 0x0000555d4d5598fc in rebuild_relation (OldHeap=0x7f9b400a2618,
indexOid=0, verbose=false) at cluster.c:663
#19 0x0000555d4d5592cd in cluster_rel (tableOid=17548, indexOid=0,
params=0x7fff6cb54e60) at cluster.c:477
#20 0x0000555d4d61b110 in vacuum_rel (relid=17548, relation=0x555d4eee41f8,
params=0x7fff6cb550c0, bstrategy=0x0) at vacuum.c:2214
#21 0x0000555d4d618bf8 in vacuum (relations=0x555d4efcaf08,
params=0x7fff6cb550c0, bstrategy=0x0, vac_context=0x555d4efcad70,
isTopLevel=true) at vacuum.c:622
#22 0x0000555d4d6186f5 in ExecVacuum (pstate=0x555d4efb86f8,
vacstmt=0x555d4eee42c8, isTopLevel=true) at vacuum.c:449
#23 0x0000555d4d8fabf9 in standard_ProcessUtility (pstmt=0x555d4eee4418,
queryString=0x555d4eee3818 "VACUUM FULL tbl;", readOnlyTree=false,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x555d4eee46d8, 
    qc=0x7fff6cb55570) at utility.c:866
#24 0x0000555d4d8fa194 in ProcessUtility (pstmt=0x555d4eee4418,
queryString=0x555d4eee3818 "VACUUM FULL tbl;", readOnlyTree=false,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x555d4eee46d8, qc=0x7fff6cb55570)
    at utility.c:530
#25 0x0000555d4d8f8a66 in PortalRunUtility (portal=0x555d4ef5ddd8,
pstmt=0x555d4eee4418, isTopLevel=true, setHoldSnapshot=false,
dest=0x555d4eee46d8, qc=0x7fff6cb55570) at pquery.c:1158
#26 0x0000555d4d8f8cdd in PortalRunMulti (portal=0x555d4ef5ddd8,
isTopLevel=true, setHoldSnapshot=false, dest=0x555d4eee46d8,
altdest=0x555d4eee46d8, qc=0x7fff6cb55570) at pquery.c:1315
#27 0x0000555d4d8f8127 in PortalRun (portal=0x555d4ef5ddd8,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x555d4eee46d8, altdest=0x555d4eee46d8, qc=0x7fff6cb55570) at
pquery.c:791
#28 0x0000555d4d8f0e36 in exec_simple_query (query_string=0x555d4eee3818
"VACUUM FULL tbl;") at postgres.c:1274
#29 0x0000555d4d8f5ea4 in PostgresMain (dbname=0x555d4ef1bc78 "db9",
username=0x555d4eedf808 "law") at postgres.c:4637
#30 0x0000555d4d816cce in BackendRun (port=0x555d4ef0c500) at
postmaster.c:4464
#31 0x0000555d4d81655a in BackendStartup (port=0x555d4ef0c500) at
postmaster.c:4192
#32 0x0000555d4d81289f in ServerLoop () at postmaster.c:1782
#33 0x0000555d4d812149 in PostmasterMain (argc=3, argv=0x555d4eedd750) at
postmaster.c:1466
#34 0x0000555d4d6c5f7c in main (argc=3, argv=0x555d4eedd750) at main.c:198

Additional logging shows that this tuple check in
heapam_relation_copy_for_cluster():
        switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
        {
            case HEAPTUPLE_DEAD:
                /* Definitely dead */
                isdead = true;
                break;
            case HEAPTUPLE_RECENTLY_DEAD:
                *tups_recently_dead += 1;
                /* fall through */
            case HEAPTUPLE_LIVE:
                /* Live or recently dead, must copy it */
                isdead = false;
                break;

gives HEAPTUPLE_RECENTLY_DEAD when the error occurs.
So it looks like recently dead tuples might have no correspondent toast
values to detoast attributes of such tuples.

Reproduced on REL_12_STABLE .. master.
Probably, the same issue was reported some time ago:
https://www.postgresql.org/message-id/flat/CALdSSPhmqoN02ciT4UxS6ax0N84NpRwPWm87nKJ_%2B0G-Na8qOQ%40mail.gmail.com


Re: BUG #18351: VACUUM FULL fails with error: missing chunk number 0 for toast value XXX

От
Alexander Lakhin
Дата:
17.02.2024 23:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      18351
>
> The following script:
> ...
> VACUUM tbl;
> VACUUM FULL tbl;
> " | psql -d db$c >psql-2-$c.log 2>&1 &
>    done
> wait
> grep 'missing chunk number' server.log  && break;
> done
>

> So it looks like recently dead tuples might have no correspondent toast
> values to detoast attributes of such tuples.
>
> Reproduced on REL_12_STABLE .. master.
> Probably, the same issue was reported some time ago:
> https://www.postgresql.org/message-id/flat/CALdSSPhmqoN02ciT4UxS6ax0N84NpRwPWm87nKJ_%2B0G-Na8qOQ%40mail.gmail.com

I've discovered the old bug #5998
https://www.postgresql.org/message-id/flat/201104291506.p3TF6dA9003698%40wwwmaster.postgresql.org
with similar symptoms and a very relevant question:
 > So, how is it that the vacuum removed the underlying toast data but not
 > the putatively-recently-dead tuple?

In fact, the script presented here, produces the error even on 83b758494,
so this anomaly is pretty old.
But for the given case, the answer to that question is different.
With additional debug and verbose logging (on REL_13_STABLE, as
lazy_scan_heap() was more flat back then) I see:

2024-02-27 08:22:04.859 UTC|law|db9|65dd9bac.12914b|LOG:  statement: VACUUM (VERBOSE) tbl;

2024-02-27 08:22:04.859 UTC|law|db9|65dd9bac.12914b|LOG: !!!vacuum_rel: relid: 16980, snapshot->xmin: 845, 
snapshot->xmax: 865

2024-02-27 08:22:04.859 UTC|law|db9|65dd9bac.12914b|LOG: !!!lazy_scan_heap| relname: tbl, blkno: 0, offnum: 1, 
ItemIdIsDead(itemid): 0

2024-02-27 08:22:04.859 UTC|law|db9|65dd9bac.12914b|LOG: !!!lazy_scan_heap| relname: tbl, blkno: 0, offnum: 1, 
tuple.t_data->t_choice.t_heap.t_xmin: 859, tuple.t_data->t_choice.t_heap.t_xmax: 864, OldestXmin: 845, 
HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf): 2
### The tbl's tuple is RECENTLY_DEAD ###

2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|INFO:  "tbl": found 0 removable, 1 nonremovable row versions in 1 
out of 1 pages
2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|DETAIL:  1 dead row versions cannot be removed yet, oldest xmin:
845
     There were 0 unused item identifiers.
     Skipped 0 pages due to buffer pins, 0 frozen pages.
     0 pages are entirely empty.
     CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|LOG: !!!lazy_scan_heap| relname: pg_toast_16980, blkno: 0, offnum:

1, ItemIdIsDead(itemid): 1
2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|CONTEXT:  while scanning block 0 of relation
"pg_toast.pg_toast_16980"

2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|LOG: !!!lazy_scan_heap| relname: pg_toast_16980, blkno: 0, offnum:

2, ItemIdIsDead(itemid): 1
2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|CONTEXT:  while scanning block 0 of relation
"pg_toast.pg_toast_16980"
### But the toast's tuples are DEAD according to their lp_flags ###

2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|INFO: "pg_toast_16980": found 2 removable, 0 nonremovable row 
versions in 1 out of 1 pages
2024-02-27 08:22:04.860 UTC|law|db9|65dd9bac.12914b|DETAIL:  0 dead row versions cannot be removed yet, oldest xmin:
866
     There were 0 unused item identifiers.
     Skipped 0 pages due to buffer pins, 0 frozen pages.
     0 pages are entirely empty.
     CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

2024-02-27 08:22:04.861 UTC|law|db9|65dd9bac.12914b|STATEMENT: VACUUM FULL tbl;

2024-02-27 08:22:04.862 UTC|law|db9|65dd9bac.12914b|ERROR:  missing chunk number 0 for toast value 17045 in
pg_toast_16980

So in this case, the heap tuple and the pg_toast_xxx tuples are out-of-sync
in regard to their DEAD status after DELETE, then VACUUM removes dead
tuples from pg_toast_xxx, and the following VACUUM FULL
(heapam_relation_copy_for_cluster()) fails to copy the heap tuple in
absence of corresponding toast tuples.

Best regards,
Alexander



27.02.2024 12:00, Alexander Lakhin wrote:
> 17.02.2024 23:00, PG Bug reporting form wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      18351
>>
>> The following script:
>> ...
>> VACUUM tbl;
>> VACUUM FULL tbl;
>> " | psql -d db$c >psql-2-$c.log 2>&1 &
>>    done
>> wait
>> grep 'missing chunk number' server.log  && break;
>> done
>>
>

I've discovered that not only VACUUM FULL can stumble over such missing
toast values. CREATE INDEX behaves similarly, as the following script
shows:
for ((c=1;c<=20;c++)); do createdb db$c; done

for ((i=1;i<=100;i++)); do
echo "iteration $i"
   for ((c=1;c<=20;c++)); do
echo "\\d sometable" | psql -d db$c >psql-1-$c.log 2>&1 &
echo "DROP TABLE IF EXISTS tbl;
CREATE TABLE tbl (i int, t text);
ALTER TABLE tbl ALTER COLUMN t SET STORAGE EXTERNAL;
INSERT INTO tbl(i, t) VALUES (1, repeat('1234567890', 250));
DELETE FROM tbl;

VACUUM (VERBOSE) tbl;
CREATE INDEX t_idx ON tbl(t);
" | psql -d db$c >psql-2-$c.log 2>&1 &
   done
wait
grep 'missing chunk number' server.log  && break;
done

The backtrace of the heap_fetch_toast_slice() call emitting error is:
2024-03-05 17:41:07.786 UTC|law|db10|65e75933.208c6b|ERROR:  missing chunk number 0 for toast value 17314 in
pg_toast_17289
2024-03-05 17:41:07.786 UTC|law|db10|65e75933.208c6b|BACKTRACE:
heap_fetch_toast_slice at heaptoast.c:784:3
toast_fetch_datum at detoast.c:379:2
detoast_external_attr at detoast.c:54:12
index_form_tuple_context at indextuple.c:111:21
tuplesort_putindextuplevalues at tuplesortvariants.c:678:15
_bt_spool at nbtsort.c:537:1
_bt_build_callback at nbtsort.c:610:24
heapam_index_build_range_scan at heapam_handler.c:1329:22
table_index_build_scan at tableam.h:1781:9
  (inlined by) _bt_spools_heapscan at nbtsort.c:483:15
btbuild at nbtsort.c:329:14
index_build at index.c:3042:10
index_create at index.c:1265:3
DefineIndex at indexcmds.c:1166:3
...

Best regards,
Alexander



Hello, Alexander!

I suspect that the main problem here is that we calculate

data_oldest_nonremovable for main and toast-tables.

1) 

I took your example and printed xmin and xid for all entries in PGPROC,

the following was produced:


pgprocno 79, dbid 16402, xid 0, xmin 1469, statusFlags 2

pgprocno 88, dbid 16402, xid 0, xmin 1469, statusFlags 0

MyDatabaseId 16402, my_pgprocno 79, latest_completed 1479, 

oldest_considered_running 1449, shared_oldest_nonremovable 1449, 

catalog_oldest_nonremovable 1469, data_oldest_nonremovable 1469

GlobalVisDataRels.maybe_needed = 1469

GlobalVisDataRels.definitely_needed = 1469


The first proc was doing VACUUM (VERBOSE) tbl.

Since it was lazy vacuum we didn't take it into account.

The second proc was doing \d sometable.

It's xmin became data_oldest_nonremovable.


2) 

Let's have a look into (0,1) tuple from the tbl relation during vacuum:

xmin 1475, xmax 1479, t_infomask 1286

The tuple couldn't be removed since xmax > definitely_needed

1479 > 1469

3) 

When the toast table was vacuuming, the second proc has

already finished its query.


pgprocno 79, dbid 16402, xid 0, xmin 1478, statusFlags 2

pgprocno 88, dbid 16402, xid 0, xmin 0, statusFlags 0


Therefore, we got the following horizons:

MyDatabaseId 16402, my_pgprocno 79, latest_completed 1479

oldest_considered_running 1468, shared_oldest_nonremovable 1468

catalog_oldest_nonremovable 1480, data_oldest_nonremovable 1480

GlobalVisDataRels.maybe_needed = 1480

GlobalVisDataRels.definitely_needed = 1480



4) 

Let's have a look into (0, 1) and (0,2) tuples from the toast table

They were related with the (0,1) tuple from the tbl relation:

blockno 0, offnum 2, xmin 1475, xmax 1479, t_infomask 1282

blockno 0, offnum 1, xmin 1475, xmax 1479, t_infomask 1282


Since their xmax < maybe_needed then they were removed.

As a result, we got the situation when the tuple in the tbl relation

referenced to non-existing tuples from the toast relation.


Therefore, there are two questions:

1) 

Is it possible to process toast table by using the same horizons for the main table?

2)

Is maybe_needed = definitely_needed a corner case?

Should it be handled differently?

Thank you in advance.

Sincerely yours.

Peter Petrov.

On 05.03.2024 21:00, Alexander Lakhin wrote:

I've discovered that not only VACUUM FULL can stumble over such missing
toast values. CREATE INDEX behaves similarly, as the following script
shows: