Обсуждение: Re: pg_dump error... Follow up
Adam Witney <awitney@sgul.ac.uk> writes: > Here you go.... > pg_filedump-3.0/pg_filedump -i -f -R 34318 34320 134401986.1 Thanks. What it looks like to me is that block 34320 (really 165392) is data from some other file altogether. It's evidently still Postgres heap data, but instead of having 3 non-null columns as any toast row ought to have, these rows have 77 columns many of which are nulls. They've got OIDs, too. Possibly you can work out which table these rows really belong to. It looks like this ought to be block 415664 of whatever table it belongs to (which would make it block 22448 of the xxx.3 file of that table, if I did the math right). So the diagnosis is that somebody wrote a data block to the wrong offset in the wrong file. Whether this is the fault of Postgres, the kernel, or the disk drive is difficult to say. We've seen a number of cases in which table pages got overwritten with data that was obviously of non-Postgres origin, and in those cases we could blame the kernel or disk drive with a clear conscience. In this case, since the bogus data is Postgres data, it could be that it's a bug lurking within Postgres itself --- or it could be that it's like those past cases. It might be worth your while to run some memory and disk drive tests. There's no particular reason to suspect a hardware fault more than a software one, but this is at least something simple to do. Check for availability of kernel updates, too. regards, tom lane
On 8/9/05 3:46 pm, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Adam Witney <awitney@sgul.ac.uk> writes: >> Here you go.... > >> pg_filedump-3.0/pg_filedump -i -f -R 34318 34320 134401986.1 > > Thanks. What it looks like to me is that block 34320 (really 165392) > is data from some other file altogether. It's evidently still Postgres > heap data, but instead of having 3 non-null columns as any toast row > ought to have, these rows have 77 columns many of which are nulls. > They've got OIDs, too. Possibly you can work out which table these > rows really belong to. It looks like this ought to be block 415664 > of whatever table it belongs to (which would make it block 22448 of > the xxx.3 file of that table, if I did the math right). I only have one file with a .3 in that database... Or could it be from a different database altogether? (although none of the others get much updates at all) Thanks again adam -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On Thu, Sep 08, 2005 at 04:26:16PM +0100, Adam Witney wrote: > On 8/9/05 3:46 pm, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > > > Adam Witney <awitney@sgul.ac.uk> writes: > >> Here you go.... > > > >> pg_filedump-3.0/pg_filedump -i -f -R 34318 34320 134401986.1 > > > > Thanks. What it looks like to me is that block 34320 (really 165392) > > is data from some other file altogether. It's evidently still Postgres > > heap data, but instead of having 3 non-null columns as any toast row > > ought to have, these rows have 77 columns many of which are nulls. > > They've got OIDs, too. Possibly you can work out which table these > > rows really belong to. It looks like this ought to be block 415664 > > of whatever table it belongs to (which would make it block 22448 of > > the xxx.3 file of that table, if I did the math right). > > I only have one file with a .3 in that database... How many columns does that table have? > Or could it be from a different database altogether? (although none of > the others get much updates at all) It's not impossible ... To Tom: could this be caused by a WAL recovery that wrote a page image to the wrong table? I guess it is very unlikely because the CRC of the WAL record would likely not match, but it's an idea. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "At least to kernel hackers, who really are human, despite occasional rumors to the contrary" (LWN.net)
> How many columns does that table have? It has 77 columns. INSERTS are always done on both this table and the table that had the toast table error within the same transaction if that matters? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > To Tom: could this be caused by a WAL recovery that wrote a page image > to the wrong table? I guess it is very unlikely because the CRC of the > WAL record would likely not match, but it's an idea. I don't see any reason to think that WAL recovery would be more likely to cause this than normal operation. If it is a Postgres bug, my thoughts were pointing in the direction of a race condition that somehow allows the tag (identifier) of a buffer to be changed before it gets written out. I don't see any way that could happen ... but given the very small number of reports of such problems (i.e. 1) it would definitely have to be induced by some really low-probability event, such as a race condition with a very narrow window. Or it might not be our bug. We've certainly seen previous cases in which either the kernel or the disk drive wrote the wrong data --- it could be that this is the same thing, only it happened that the "wrong data" was some other Postgres data that happened to be passing through the system at about the same time. regards, tom lane