Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toastvalue 76753264 in pg_toast_10920100

Поиск
Список
Период
Сортировка
От Harry Ambrose
Тема Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toastvalue 76753264 in pg_toast_10920100
Дата
Msg-id EBB819BB-5660-4D9A-BD87-10ED5D7B1DE0@gmail.com
обсуждение исходный текст
Ответ на Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toastvalue 76753264 in pg_toast_10920100  (Achilleas Mantzios <achill@matrix.gatewaynet.com>)
Ответы Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toastvalue 76753264 in pg_toast_10920100  (Achilleas Mantzios <achill@matrix.gatewaynet.com>)
Re: [GENERAL] ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
Hi,

I too have been experiencing this with a busy PostgreSQL instance.

I have been following the updates to the 9.4 branch hoping a fix will appear, but sadly no luck yet. I have manually replicated the issue on 9.4.4, 9.4.10 and 9.4.12. My replication steps are:

BEGIN;
CREATE TABLE x (id BIGSERIAL PRIMARY KEY, payload1 VARCHAR, payload2 VARCHAR, payload3 VARCHAR, payload4 BIGINT, payload5 BIGINT);
/* Repeat until 2,000,000 rows are inserted */
INSERT INTO x (id, payload1, payload2, payload3, payload4, payload5) VALUES (random values of varying length/size to force random toast usage);
COMMIT;

VACUUM (ANALYZE, FULL);

BEGIN;
/* Repeat until all 2,000,000 rows are updated */
UPDATE x SET payload1 = , payload2 = , payload3 = , payload4 = , payload5 = ... again random values of varying length/size to force random toast usage
COMMIT;

VACCUM (ANALYZE, FULL);

The second vacuum causes an ERROR identical to that you are reporting below (unexpected chunk number n (expected n) for toast value...). However it may take up to ten attempts to replicate it.

Out of interest, are you using any tablespaces other than pg_default? I can only replicate the issue when using separately mounted tablespaces.

I have been investigating this quite extensively and everything I can find on the web suggests data corruption. However running the the following DO reports no errors and I can dump the database without issue.

DO $$
DECLARE

        curid INT := 0;
        vcontent RECORD;
        badid BIGINT;
    
        var1_sub VARCHAR;
        var2_sub VARCHAR;
        var3_sub VARCHAR;
        var4_sub VARCHAR;
        var5_sub VARCHAR;
        
BEGIN
        FOR badid IN SELECT id FROM x 
        LOOP
            curid = curid + 1;
                
            IF curid % 100000 = 0 
            THEN
                RAISE NOTICE '% rows inspected', curid;
            END IF;
            
            BEGIN
                SELECT *
                INTO vcontent
                FROM x
                WHERE rowid = badid;
            
                var1_sub := SUBSTR(vcontent.var1,2000,5000);
                var2_sub := SUBSTR(vcontent.var2,2000,5000);   
                var3_sub := SUBSTR(vcontent.var3,2000,5000);
                var4_sub := SUBSTR(vcontent.var4::VARCHAR,2000,5000);
                var5_sub := SUBSTR(vcontent.var5::VARCHAR,2000,5000);
            
            EXCEPTION WHEN OTHERS THEN
                RAISE NOTICE 'Data for rowid % is corrupt', badid;
                CONTINUE;
                END;
        
        END LOOP;
END;
$$;


Best wishes,
Harry

On 7 Jun 2017, at 15:22, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

On 07/06/2017 16:33, ADSJ (Adam Sjøgren) wrote:
Our database has started reporting errors like this:

  2017-05-31 13:48:10 CEST ERROR:  unexpected chunk number 0 (expected 1) for toast value 14242189 in pg_toast_10919630
      ...
  2017-06-01 11:06:56 CEST ERROR:  unexpected chunk number 0 (expected 1) for toast value 19573520 in pg_toast_10919630

(157 times, for different toast values, same pg_toast_nnn). pg_toast_10919630
corresponds to a table with around 168 million rows.

These went away, but the next day we got similar errors from another
table:

  2017-06-02 05:59:50 CEST ERROR:  unexpected chunk number 0 (expected 1) for toast value 47060150 in pg_toast_10920100
      ...
  2017-06-02 06:14:54 CEST ERROR:  unexpected chunk number 0 (expected 1) for toast value 47226455 in pg_toast_10920100

(Only 4 this time) pg_toast_10920100 corresponds to a table with holds
around 320 million rows (these are our two large tables).

The next day we got 6 such errors and the day after 10 such errors. On
June 5th we got 94, yesterday we got 111, of which one looked a little
different:

  2017-06-06 17:32:21 CEST ERROR:  unexpected chunk size 1996 (expected 1585) in final chunk 0 for toast value 114925100 in pg_toast_10920100

and today the logs have 65 lines, ending with these:

  2017-06-07 14:49:53 CEST ERROR:  unexpected chunk number 0 (expected 1) for toast value 131114834 in pg_toast_10920100
  2017-06-07 14:53:41 CEST ERROR:  unexpected chunk number 0 (expected 1) for toast value 131149566 in pg_toast_10920100
First try to find which tables those toast relations refer to :
select 10919630::regclass , 10920100::regclass ;
Are those critical tables? Can you restore them somehow?

Also you may consider
REINDEX TABLE pg_toast.pg_toast_10920100;
REINDEX TABLE pg_toast.pg_toast_10919630;
REINDEX TABLE <name of table 10920100>;
REINDEX TABLE <name of table 10919630>;

also VACUUM the above tables.

You might want to write a function which iterates over the damaged table's rows in order to identify the damaged row(s). And then do some good update to create a new version.

The database is 10 TB on disk (SSDs) and runs on a 48 core server with 3
TB RAM on Ubuntu 14.04 (Linux 3.18.13).

We are updating rows in the database a lot/continuously.

There are no apparent indications of hardware errors (like ECC) in
dmesg, nor any error messages logged by the LSI MegaRAID controller, as
far as I can tell.

We are running PostgreSQL 9.3.14 currently.

The only thing I could see in the release notes since 9.3.14 that might
be related is this:

 "* Avoid very-low-probability data corruption due to testing tuple
    visibility without holding buffer lock (Thomas Munro, Peter Geoghegan,
    Tom Lane)"

Although reading more about it, it doesn't sound like it would exhibit
the symptoms we see?

We have recently increased the load (to around twice the number of
cores), though, which made me think we could be triggering corner cases
we haven't hit before.

We will be upgrading to PostgreSQL 9.3.17 during the weekend, but I'd like to hear
if anyone has seen something like this, or have some ideas of how to
investigate/what the cause might be.


  Best regards,

    Adam


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

В списке pgsql-general по дате отправления:

Предыдущее
От: Merlin Moncure
Дата:
Сообщение: Re: [GENERAL] Advisory lock deadlock issue
Следующее
От: tel medola
Дата:
Сообщение: Re: [GENERAL] Redo the filenode link in tablespace