Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328
От | Pius Chan |
---|---|
Тема | Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328 |
Дата | |
Msg-id | C62EC84B2D3CF847899CCF4B589CCF70B20AA08F@BBMBX.backbone.local обсуждение исходный текст |
Ответ на | Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328 (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-bugs |
Hi Tom, The ERROR happened again. After several days of investigation and testing, = I can now reproduce the ERROR in my testing environment. The reason why the= ERROR used to happen in a certain time period is that our report batch job= s run in that period and the batch job can make the TOAST area grow. I can= repeat the ERROR with this set up and testing procedure. Set up =3D=3D=3D=3D=3D (1) check that the configuration parameter log_autovacuum_min_duration =3D = 0 (2) in a 9.1.7 database cluster, create a database called "TEST_LOCK". (3) In the database "TEST_LOCK", create one table called LOCK with one row = in it: create table lock (id bigint, constraint lock_pk primary key (id)); insert into lock values (1); (4) in the same database cluster, create another database called "TEST". (5) in the database "TEST", create one table test with the following column= s: create table test ( id bigint not null, txop char(1), messageblob bytea, constraint test_pk primary key (id) ); Pre-testing set up: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 (1) use psql to connect to TEST_LOCK and issue the following SQLs and leave= it open during the test: begin; select * from lock where id =3D1 for update; (2) use another psql to connect to TEST_LOCK and issue the same SQLs as in = (1). This session will be blocked by (1). (3) this is to simulate the situation that in our production environment, o= ur middle tier service use the TEST_LOCK.lock table for master/slave fail-o= ver. We run the middle-tier service in several servers. When the middle-ti= er service starts up, it tries to put a lock on that ROW. The server succe= ssfully holds this LOCK becomes the master. Testing set-up: =3D=3D=3D=3D=3D=3D=3D=3D=3D Part 1 - traffic that will consume the TOAST area of the table "test" in TE= ST database (1) use program to INSERT INTO TEST with length of messaegblob > 1MB and tx= op=3D'A'. This is to ensure the TOAST area of the test table is used. Then= update the table txop to 'D. Then delete the same row. These operations = are in their own transactions. (2) this simulates the batch report traffic we see from 00:15 - 00:45 every= night. Part 2 - traffic that will not consume the TOAST area of the table "test" (1) use program to INSERT INTO TEST with length of messageblob of size abou= t 3000 bytes and txop=3D'A'. Then update the row.txop to 'D'. Then delete = the same row. These operations are in their own transactions. (2) this simulates the background traffic we see during the whole day. For part 1 and part 2, the traffic must be high enough so that from the dat= abase log: (1) autovacuum for the table TEST.public.test runs every minute (2) autovacuum for the table TOAST area runs every minute (3) from result of analyze verbose test, the # of pages scanned keeps on in= creasing, even with autovacuum running on the table (4) similarly, when querying the size of the TOAST, its size keeps on incre= asing, even with autovacuum running on the TOAST every one 1 minute (5) cluster the table test can not reduce the table size and TOAST area siz= e as well (6) Any any time, the # of live rows is only a few but the # of dead rows i= s high. Testing procedures: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D (1) use psql to connect to TEST and frequently analyze the table so as to m= onitor the size of the table. Run SQL to check the size of the table TOAST= area as well. (2) tail the database log and check that the automatic vacuum of the table = and its TOAST area happens every one minute (3) Wait until the size of the table is at least 400 pages and the TOAST ar= ea is at least 20MB. (4) wait for the database log to emit the "automatic vacuum of table "TEST.= pg_toast.pg_toast_xxxx": index scans: (pg_toast_xxxx is the TOAST area of = the table test), IMMEDIATELY issue the command to cluster the table test in the psql session= : cluster test using test_pk; Testing results: =3D=3D=3D=3D=3D=3D=3D=3D=3D (1) If the timing is right, occasionally I got the ERROR (about once out of= ten): ERROR: missing chunk number 0 for toast value 13113341 in pg_toast_1308452= 0 (2) Once I got the ERROR, subsequent CLUSTER command gets the same ERROR un= til the next autovacuum of the TOAST is done. (3) I tested in PostgreSQL 9.1.7 and 9.1.3 and can get the ERROR. So this is not a 9.1.7 specific bug. Thanks a lot for your help. Pius ________________________________________ From: Tom Lane [tgl@sss.pgh.pa.us] Sent: Tuesday, January 22, 2013 8:14 PM To: Pius Chan Cc: pgsql-bugs@postgresql.org Subject: Re: [BUGS] BUG #7819: missing chunk number 0 for toast value 12359= 19 in pg_toast_35328 Pius Chan <pchan@contigo.com> writes: > Hi Tom, > Yes, we start seeing this ERROR after upgrade to 9.1.7. The ERROR is from= the "cluster jms_messages" command. Last night, the ERROR happened three = times: > (1) at 00:20:01 > ERROR: missing chunk number 0 for toast value 1239124 in pg_toast_35328 > (2) at 00:25:01 > ERROR: missing chunk number 0 for toast value 1239124 in pg_toast_35328 > (3) at 00:35:01 > ERROR: missing chunk number 0 for toast value 1241022 in pg_toast_35328 > The "cluster jms_messages" runs every 5 minutes. However, so far, it seem= s that the ERROR happens at about mid-night and 35328 is the toast area of = the "jms_message" table: So what else is this application doing around midnight? In particular, it seems like something must've happened between 00:15 and 00:20 to create the problem with OID 1239124, and then something else happened between 00:25 and 00:30 to get rid of it. And then a similar sequence happened between 00:30 and 00:40 involving OID 1241022. Most likely the trigger events are application actions against jms_messages, but we have no info about what. (It's also possible that autovacuums of jms_messages are involved, so you might want to crank up your logging enough so you can see those in the postmaster log.) I had first thought that this might have something to do with the toast-value-processing changes we made in CLUSTER a year or so ago, but a look in the commit logs confirms that those patches were in 9.1.3. So I have no idea what might've broken between 9.1.3 and 9.1.7. We need data, or even better an actual test case. regards, tom lane
В списке pgsql-bugs по дате отправления: