Обсуждение: 7.0.3 database corruption
I know you guys want to focus on 7.1 or 7.2, and yes I am trying to move to
7.1, but it won't happen overnight.
We have a serious problem with 7.0.3 and data corruption. We have a program
that compares records in two tables. It creates a set of SQL scripts that
either update, insert, or delelte records based on the comparison of the two
tables. These scripts are then run against multiple database servers, which are
slaves.
Some of these scripts get pretty big, and take a while to run (10s of thousands
of records are affected). After we run the scripts the database seems fine.
Then we run two SQL scripts which create some summary tables. After we run the
scripts, it looks like the database is corrupt.
Oddly enough, if we run vacuum prior to running these scripts, the database
does not seem to get corrupted.
All I really need to know is if anyone has seen anything in the code which
would explain this, and if so, do you know if is fixed in 7.1.x?
The scripts look like:
<<<<<<<<<<<<< stattbl.sql >>>
drop table musicstat ;
-- This crap is to find oldest new release
create temp table lastnr ( name varchar, value date );
insert into lastnr select 'LASTNRUPDATE' as "name", min(
to_date(released,'MM/DD/YYYY')) as "value" from new_ztitles where released !=
'n/a' ;
select name, released as "value" into musicstat from new_ztitles, lastnr where
to_date(released, 'MM/DD/YYYY') = lastnr.value limit 1;
-- Count number of songs
insert into musicstat select 'tracks' as "name", count(*) as "value" from
zsong;
-- Count number of albums
insert into musicstat select 'albums' as "name", count(*) as "value" from
ztitles;
-- Count number of artists
insert into musicstat select 'artists' as "name", count(*) as "value" from
dartists;
insert into musicstat select 'dotclick_albums' as "name", count(*) as "value"
from dotclicktitle ;
insert into musicstat select 'dotclick_songs' as "name", count(*) as "value"
from dotclicksong ;
insert into musicstat select 'deleted songs' as "name", count(*) as "value"
from old_zsong ;
insert into musicstat select 'deleted titles' as "name", count(*) as "value"
from old_ztitles ;
insert into musicstat select 'new titles' as "name", count(*) as "value" from
ztitles where acd = 'A';
insert into musicstat select 'new songs' as "name", count(*) as "value" from
zsong, ztitles where ztitles.acd = 'A' and zsong.muzenbr = ztitles.muzenbr;
insert into musicstat select 'updated albums' as "name", count( distinct(
muzenbr)) as "value" from zsong where acd = 'U' ;
insert into musicstat select 'new artists' as "name", count(*) as "value" from
tartists, dartists where tartists.artist = dartists.artist ;
insert into musicstat (name,value) values('CONCERTGO' , '1' );
<<<<<<<<< end >>>>>>>>>>>>>>>>>
<<<<<<<< newrel >>>>>>>>>>>>>>
drop table new_ztitles ;
drop table new_releases ;
create table new_ztitles as select * from ztitles T
where T.origrel > (date_part('year','now'::abstime)-1)
and to_date(T.released, 'MM/DD/YYYY') > ('now'::abstime - '30 days'::reltime)
and T.datasrc = 1;
create index newztitles_muzenbr on new_ztitles (muzenbr);
create table new_releases as select T.origrel, T.released, T.title, T.muzenbr,
T.artistid,
T.performer2 as artist, S.song, S.trackid
from new_ztitles T, zsong S where S.muzenbr = T.muzenbr
and S.datasrc = 1
order by performer2;
-- new releases indexes
create index new_rel_artist_lower on new_releases (varchar_lower(artist) );
create index new_rel_artist_meta on new_releases (metatext(artist) );
create index new_rel_artist_strip on new_releases (strip(artist) );
create index new_rel_title_lower on new_releases (varchar_lower(title) );
create index new_rel_title_meta on new_releases (metatext(title) );
create index new_rel_title_strip on new_releases (strip(title) );
create index new_rel_song_lower on new_releases (varchar_lower(song) );
create index new_rel_song_meta on new_releases (metatext(song) );
create index new_rel_song_strip on new_releases (strip(song) );
-- new_ztitles indexes
create index newztitles_artistid on new_ztitles (artistid);
create index newztitles_cat3 on new_ztitles(cat3);
<<<<<<<<end>>>>>>>>>>>>>
mlw wrote: > > I know you guys want to focus on 7.1 or 7.2, and yes I am trying to move to > 7.1, but it won't happen overnight. > > We have a serious problem with 7.0.3 and data corruption. We have a program > that compares records in two tables. It creates a set of SQL scripts that > either update, insert, or delelte records based on the comparison of the two > tables. These scripts are then run against multiple database servers, which are > slaves. > > Some of these scripts get pretty big, and take a while to run (10s of thousands > of records are affected). After we run the scripts the database seems fine. > Then we run two SQL scripts which create some summary tables. After we run the > scripts, it looks like the database is corrupt. > > Oddly enough, if we run vacuum prior to running these scripts, the database > does not seem to get corrupted. > > All I really need to know is if anyone has seen anything in the code which > would explain this, and if so, do you know if is fixed in 7.1.x? There certainly are bugs in 7.0.3 - I can describe at least two: 1. an index on varchar(8) (an user name) gets corrupted so that some names are no longer found when searching by index - they are still there when doing an unqualified select and come back after reindex for the qualified one. 1a. "FATAL: bits falling of the end of world" or something like it in logs and then broken db connection after that 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting" or "DELETE waiting" state. This happens sporadically and requires a db system restart to go away ------------- Hannu
> mlw wrote:
>> After we run the
>> scripts, it looks like the database is corrupt.
It's impossible to say anything useful with such an undescriptive
description of the problem.
Hannu Krosing <hannu@tm.ee> writes:
> There certainly are bugs in 7.0.3 - I can describe at least two:
I would really like to see a reproducible example of index corruption
in 7.0.*. We've heard such reports often enough to know the problem
is real, but without a test case in hand it's difficult to do much about
it.
> 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting"
7.0.*'s deadlock detection algorithm is known to have some holes, but
deadlock couldn't be the explanation for just a single stuck backend.
Again, any chance of looking at an example?
regards, tom lane
Tom Lane wrote: > > > mlw wrote: > >> After we run the > >> scripts, it looks like the database is corrupt. > > It's impossible to say anything useful with such an undescriptive > description of the problem. > > Hannu Krosing <hannu@tm.ee> writes: > > There certainly are bugs in 7.0.3 - I can describe at least two: > > I would really like to see a reproducible example of index corruption > in 7.0.*. We've heard such reports often enough to know the problem > is real, but without a test case in hand it's difficult to do much about > it. I know ;( Unfortunately this has happened only a few times on some quite busy servers receiving a workload of quite varied queries. > > 2. Some kind of stuck locks - a single backend stuck in "INSERT waiting" > > 7.0.*'s deadlock detection algorithm is known to have some holes, but > deadlock couldn't be the explanation for just a single stuck backend. that's what "ps ax| grep post" output looks like in my logs Sun Jun 10 06:31:00 EET 2001 828 ? S 0:02 /usr/bin/postmaster -i -o -F 26652 ? S 5:20 /usr/bin/postgres localhost gamer casino idle 30082 ? S 0:20 /usr/bin/postgres 127.0.0.1 nobody casino idle 30084 ? S 1:26 /usr/bin/postgres 127.0.0.1 nobody casino idle 31565 ? S 0:43 /usr/bin/postgres 127.0.0.1 nobody casino idle 31595 ? S 0:19 /usr/bin/postgres 127.0.0.1 nobody casino idle 31596 ? S 0:21 /usr/bin/postgres 127.0.0.1 nobody casino idle 31597 ? S 0:31 /usr/bin/postgres 127.0.0.1 nobody casino idle 31598 ? S 1:39 /usr/bin/postgres 127.0.0.1 nobody casino idle 31600 ? S 0:17 /usr/bin/postgres 127.0.0.1 nobody casino idle 31608 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 31612 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 32080 ? S 0:43 /usr/bin/postgres localhost gamer casino UPDATE waiti 32706 ? S 0:10 /usr/bin/postgres localhost gamer casino idle 302 ? S 0:00 /usr/bin/postgres 127.0.0.1nobody casino idle 361 ? S 0:00 sh -c date;ps ax|grep post 364 ? S 0:00 grep post CHECKING WAITING PIDS: ['32080'] Sun Jun 10 06:31:10 EET 2001 828 ? S 0:02 /usr/bin/postmaster -i -o -F 26652 ? S 5:20 /usr/bin/postgres localhost gamer casino idle 30082 ? S 0:20 /usr/bin/postgres 127.0.0.1 nobody casino idle 30084 ? S 1:26 /usr/bin/postgres 127.0.0.1 nobody casino idle 31565 ? S 0:43 /usr/bin/postgres 127.0.0.1 nobody casino idle 31595 ? S 0:19 /usr/bin/postgres 127.0.0.1 nobody casino idle 31596 ? S 0:21 /usr/bin/postgres 127.0.0.1 nobody casino idle 31597 ? S 0:31 /usr/bin/postgres 127.0.0.1 nobody casino idle 31598 ? S 1:39 /usr/bin/postgres 127.0.0.1 nobody casino idle 31600 ? S 0:17 /usr/bin/postgres 127.0.0.1 nobody casino idle 31608 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 31612 ? S 0:24 /usr/bin/postgres 127.0.0.1 nobody casino idle 32080 ? S 0:43 /usr/bin/postgres localhost gamer casino UPDATE waiti 32706 ? S 0:10 /usr/bin/postgres localhost gamer casino idle 302 ? S 0:00 /usr/bin/postgres 127.0.0.1nobody casino idle 365 ? S 0:00 sh -c date;ps ax|grep post 368 ? S 0:00 grep post PROCESS 32080 STILL WAITING, RESTART TIME > Again, any chance of looking at an example? I could send you tails of postgres logfiles that are rotated on detecting the INSERT/UPDATE wait condition that does not go away in 10 sec. How long logfiles (time) would be enough ? There seems to be no general pattern that leads to it though ;( --------------- Hannu
Hannu Krosing <hannu@tm.ee> writes:
> I could send you tails of postgres logfiles that are rotated on
> detecting
> the INSERT/UPDATE wait condition that does not go away in 10 sec.
> How long logfiles (time) would be enough ?
Do the logs show the queries being executed? The queries forming
the current transaction of the stuck backend, and all the transactions
that have occurred since that transaction started, would be useful
to look at.
regards, tom lane