Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328

Поиск
Список
Период
Сортировка
От Pius Chan
Тема Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328
Дата
Msg-id C62EC84B2D3CF847899CCF4B589CCF70B20AA08F@BBMBX.backbone.local
обсуждение исходный текст
Ответ на Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
Hi Tom,

The ERROR happened again. After several days of investigation and testing, =
I can now reproduce the ERROR in my testing environment. The reason why the=
 ERROR used to happen in a certain time period is that our report batch job=
s run in that period and the batch job can make the TOAST area grow.  I can=
 repeat the ERROR with this set up and testing procedure.

Set up
=3D=3D=3D=3D=3D
(1) check that the configuration parameter log_autovacuum_min_duration =3D =
0
(2) in a 9.1.7 database cluster, create a database called "TEST_LOCK".
(3) In the database "TEST_LOCK", create one table called LOCK with one row =
in it:
     create table lock (id bigint, constraint lock_pk primary key (id));
     insert into lock values (1);

(4) in the same database cluster, create another database called "TEST".
(5) in the database "TEST", create one table test with the following column=
s:

create table test (
     id                          bigint not null,
     txop                     char(1),
     messageblob    bytea,
     constraint test_pk primary key (id)
);


Pre-testing set up:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D  =20
(1) use psql to connect to TEST_LOCK and issue the following SQLs and leave=
 it open during the test:
      begin;
      select * from lock where id =3D1 for update;

(2) use another psql to connect to TEST_LOCK and issue the same SQLs as in =
(1). This session will be blocked by (1).
(3) this is to simulate the situation that in our production environment, o=
ur middle tier service use the TEST_LOCK.lock table for master/slave fail-o=
ver. We run the middle-tier service in several servers.  When the middle-ti=
er service starts up, it tries to put a lock on that ROW.  The server succe=
ssfully holds this LOCK becomes the master.


Testing set-up:
=3D=3D=3D=3D=3D=3D=3D=3D=3D
Part 1 - traffic that will consume the TOAST area of the table "test" in TE=
ST database
(1) use program to INSERT INTO TEST with length of messaegblob > 1MB and tx=
op=3D'A'. This is to ensure the TOAST area of the test table is used.  Then=
 update the table txop to 'D.  Then delete the same row.  These operations =
are in their own transactions.
(2) this simulates the batch report traffic we see from 00:15 - 00:45 every=
 night.


Part 2 - traffic that will not consume the TOAST area of the table "test"
(1) use program to INSERT INTO TEST with length of messageblob of size abou=
t 3000 bytes and txop=3D'A'.  Then update the row.txop to 'D'. Then delete =
the same row. These operations are in their own transactions.
(2) this simulates the background traffic we see during the whole day.


For part 1 and part 2, the traffic must be high enough so that from the dat=
abase log:
(1) autovacuum for the table TEST.public.test runs every minute
(2) autovacuum for the table TOAST area runs every minute
(3) from result of analyze verbose test, the # of pages scanned keeps on in=
creasing, even with autovacuum running on the table
(4) similarly, when querying the size of the TOAST, its size keeps on incre=
asing, even with autovacuum running on the TOAST every one 1 minute
(5) cluster the table test can not reduce the table size and TOAST area siz=
e as well
(6) Any any time, the # of live rows is only a few but the # of dead rows i=
s high.




Testing procedures:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
(1) use psql to connect to TEST and frequently analyze the table so as to m=
onitor the size of the table.  Run SQL to check the size of the table TOAST=
 area as well.
(2) tail the database log and check that the automatic vacuum of the table =
and its TOAST area happens every one minute
(3) Wait until the size of the table is at least 400 pages and the TOAST ar=
ea is at least 20MB.
(4) wait for the database log to emit the "automatic vacuum of table "TEST.=
pg_toast.pg_toast_xxxx": index scans:  (pg_toast_xxxx is the TOAST area of =
the table test),
IMMEDIATELY issue the command to cluster the table test in the psql session=
:

    cluster test using test_pk;


Testing results:
=3D=3D=3D=3D=3D=3D=3D=3D=3D
(1) If the timing is right, occasionally I got the ERROR (about once out of=
 ten):

ERROR:  missing chunk number 0 for toast value 13113341 in pg_toast_1308452=
0

(2) Once I got the ERROR, subsequent CLUSTER command gets the same ERROR un=
til the next autovacuum of the TOAST is done.

(3) I tested in PostgreSQL 9.1.7 and 9.1.3 and can get the ERROR.

So this is not a 9.1.7 specific bug.


Thanks a lot for your help.

Pius
________________________________________
From: Tom Lane [tgl@sss.pgh.pa.us]
Sent: Tuesday, January 22, 2013 8:14 PM
To: Pius Chan
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #7819: missing chunk number 0 for toast value 12359=
19 in pg_toast_35328

Pius Chan <pchan@contigo.com> writes:
> Hi Tom,
> Yes, we start seeing this ERROR after upgrade to 9.1.7. The ERROR is from=
 the "cluster jms_messages" command.  Last night, the ERROR happened three =
times:

> (1) at 00:20:01
>  ERROR:  missing chunk number 0 for toast value 1239124 in pg_toast_35328

> (2) at 00:25:01
> ERROR:  missing chunk number 0 for toast value 1239124 in pg_toast_35328

> (3) at 00:35:01
>  ERROR:  missing chunk number 0 for toast value 1241022 in pg_toast_35328

> The "cluster jms_messages" runs every 5 minutes. However, so far, it seem=
s that the ERROR happens at about mid-night and 35328 is the toast area of =
the "jms_message" table:

So what else is this application doing around midnight?  In particular,
it seems like something must've happened between 00:15 and 00:20 to
create the problem with OID 1239124, and then something else happened
between 00:25 and 00:30 to get rid of it.  And then a similar sequence
happened between 00:30 and 00:40 involving OID 1241022.  Most likely the
trigger events are application actions against jms_messages, but we have
no info about what.  (It's also possible that autovacuums of
jms_messages are involved, so you might want to crank up your logging
enough so you can see those in the postmaster log.)

I had first thought that this might have something to do with the
toast-value-processing changes we made in CLUSTER a year or so ago,
but a look in the commit logs confirms that those patches were in 9.1.3.
So I have no idea what might've broken between 9.1.3 and 9.1.7.  We
need data, or even better an actual test case.

                        regards, tom lane

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Violation of primary key constraint
Следующее
От: Toby Murray
Дата:
Сообщение: Re: Violation of primary key constraint