Обсуждение: cannot delete corrupted rows after DB corruption: tuple concurrently updated

Поиск
Список
Период
Сортировка

cannot delete corrupted rows after DB corruption: tuple concurrently updated

От
john gale
Дата:
We ran into an open file limit on the DB host (Mac OS X 10.9.0, Postgres 9.3.2) and caused the familiar "ERROR:
unexpectedchunk number 0 (expected 1) for toast value 155900302 in pg_toast_16822" when selecting data. 

Previously when we've run into this kind of corruption we could find the specific corrupted rows in the table and
deleteby ctid.  However, this time we're running into a persistent "ERROR:  tuple concurrently updated" when deleting
byctid. 

munin2=# select ctid from testruns where id = 141889653;
     ctid
--------------
 (37069816,3)
(1 row)

munin2=# delete from testruns where ctid = '(37069816,3)';
ERROR:  tuple concurrently updated

This always occurs and seems to prevent us from cleaning up the database by removing the corrupted rows.

Before attempting to do more drastic things like restart the postgres instance, is there some known way of getting
aroundthis error and cleaning up the corruption (other than the full replicate / reindex / suggestions from around the
webthat are more involved than deleting corrupted rows by ctid). 

thanks,

    ~ john


Re: cannot delete corrupted rows after DB corruption: tuple concurrently updated

От
john gale
Дата:
Does anybody have any ideas about this.

We restarted the postmaster and the issue persists.  So previously in 9.0.4 where we could clean corruption, it seems
in9.3.2 we can no longer clean corruption.o  I'm assuming this because our data insert environment has not changed, so
weshouldn't be hitting any different transaction concurrency / isolation problems than we did before. 

Is there a way to force deletion of a row, ignoring concurrency, similar to concurrent updates.  It looks like changing
default_transaction_isolationdid not affect this: 

munin2=# delete from testruns where ctid = '(37069305,4)';
ERROR:  tuple concurrently updated

2014-02-26 07:42:46 GMT LOG:  received SIGHUP, reloading configuration files
2014-02-26 07:42:46 GMT LOG:  parameter "default_transaction_isolation" changed to "read uncommitted"
2014-02-26 07:42:53 GMT ERROR:  tuple concurrently updated
2014-02-26 07:42:53 GMT STATEMENT:  delete from testruns where ctid = '(37069305,4)';

thanks,

    ~ john


On Feb 25, 2014, at 11:43 AM, john gale <john@smadness.com> wrote:

> We ran into an open file limit on the DB host (Mac OS X 10.9.0, Postgres 9.3.2) and caused the familiar "ERROR:
unexpectedchunk number 0 (expected 1) for toast value 155900302 in pg_toast_16822" when selecting data. 
>
> Previously when we've run into this kind of corruption we could find the specific corrupted rows in the table and
deleteby ctid.  However, this time we're running into a persistent "ERROR:  tuple concurrently updated" when deleting
byctid. 
>
> munin2=# select ctid from testruns where id = 141889653;
>     ctid
> --------------
> (37069816,3)
> (1 row)
>
> munin2=# delete from testruns where ctid = '(37069816,3)';
> ERROR:  tuple concurrently updated
>
> This always occurs and seems to prevent us from cleaning up the database by removing the corrupted rows.
>
> Before attempting to do more drastic things like restart the postgres instance, is there some known way of getting
aroundthis error and cleaning up the corruption (other than the full replicate / reindex / suggestions from around the
webthat are more involved than deleting corrupted rows by ctid). 
>
> thanks,
>
>     ~ john



Re: cannot delete corrupted rows after DB corruption: tuple concurrently updated

От
"Tomas Vondra"
Дата:
On 26 Únor 2014, 8:45, john gale wrote:
>
> Does anybody have any ideas about this.
>
> We restarted the postmaster and the issue persists.  So previously in
> 9.0.4 where we could clean corruption, it seems in 9.3.2 we can no longer
> clean corruption.o  I'm assuming this because our data insert environment
> has not changed, so we shouldn't be hitting any different transaction
> concurrency / isolation problems than we did before.
>
> Is there a way to force deletion of a row, ignoring concurrency, similar
> to concurrent updates.  It looks like changing
> default_transaction_isolation did not affect this:
>
> munin2=# delete from testruns where ctid = '(37069305,4)';
> ERROR:  tuple concurrently updated

AFAIK this error is raised when a before trigger modifies the row that is
being deleted. Imagine a trigger that does this

   UPDATE testruns SET mycolumn = 1 WHERE id = OLD.id;
   RETURN OLD;

Given the way MVCC in postgres works (copying row when updating), the
error makes sense. In 9.0 this worked by silently skipping the DELETE
(incidentally, I had a few reports about tables that can't be deleted
because of this in the past month).

Anyway, do you have any triggers on the table? If yes, try to disable
them. I suspect the data are corrupted in a way that causes update on the
deleted row - either directly, or maybe because of a cascading effect.

I'm wondering if it might be caused by RI triggers - maybe yes, but I'm
not aware of any RI trigger doing updates.

That being said, I think that what you're doing is wrong. If you think you
have a corrupted database, I'd strongly suggest doing dump/restore. Or how
do you know there's no other corruption lurking in the files, slowly
spreading to other parts of the database?

Tomas



Re: cannot delete corrupted rows after DB corruption: tuple concurrently updated

От
john gale
Дата:

On Feb 26, 2014, at 2:59 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 26 Únor 2014, 8:45, john gale wrote:

munin2=# delete from testruns where ctid = '(37069305,4)';
ERROR:  tuple concurrently updated

AFAIK this error is raised when a before trigger modifies the row that is
being deleted. Imagine a trigger that does this

  UPDATE testruns SET mycolumn = 1 WHERE id = OLD.id;
  RETURN OLD;

Given the way MVCC in postgres works (copying row when updating), the
error makes sense. In 9.0 this worked by silently skipping the DELETE
(incidentally, I had a few reports about tables that can't be deleted
because of this in the past month).

Anyway, do you have any triggers on the table? If yes, try to disable
them. I suspect the data are corrupted in a way that causes update on the
deleted row - either directly, or maybe because of a cascading effect.


There were a few triggers auto-created by a foreign key constraint but we removed the constraint, which removed the triggers:

munin2=# select * from pg_trigger;
 tgrelid | tgname | tgfoid | tgtype | tgenabled | tgisinternal | tgconstrrelid | tgconstrindid | tgconstraint | tgdeferrable | tginitdeferred | tgnargs | tgattr | tgargs | tgqual 
---------+--------+--------+--------+-----------+--------------+---------------+---------------+--------------+--------------+----------------+---------+--------+--------+--------
(0 rows)

The tuple error still exists, however:

munin2=# delete from testruns where ctid = '(37069305,4)';
ERROR:  tuple concurrently updated
munin2=# select id from testruns where ctid = '(37069305,4)';
    id     
-----------
 141908486
(1 row)

munin2=# delete from testruns where id = 141908486;
ERROR:  tuple concurrently updated
munin2=# select * from testruns where id = 141908486;
ERROR:  unexpected chunk number 0 (expected 1) for toast value 155900302 in pg_toast_16822


I'm wondering if it might be caused by RI triggers - maybe yes, but I'm
not aware of any RI trigger doing updates.

That being said, I think that what you're doing is wrong. If you think you
have a corrupted database, I'd strongly suggest doing dump/restore. Or how
do you know there's no other corruption lurking in the files, slowly
spreading to other parts of the database?


We're aware that we're shoveling dirt to patch a crack in a large highway.  However at the moment we value uptime rather than strict integrity of the data (141mil rows allows some wiggle room), and since we don't modify these rows after they're inserted, I can't imagine how this kind of row corruption can "slowly spread to other parts of the database".

~ john