Обсуждение: Re: pg_dump error... Follow up

Поиск
Список
Период
Сортировка

Re: pg_dump error... Follow up

От
Tom Lane
Дата:
Adam Witney <awitney@sgul.ac.uk> writes:
> Here you go....

> pg_filedump-3.0/pg_filedump -i -f -R 34318 34320 134401986.1

Thanks.  What it looks like to me is that block 34320 (really 165392)
is data from some other file altogether.  It's evidently still Postgres
heap data, but instead of having 3 non-null columns as any toast row
ought to have, these rows have 77 columns many of which are nulls.
They've got OIDs, too.  Possibly you can work out which table these
rows really belong to.  It looks like this ought to be block 415664
of whatever table it belongs to (which would make it block 22448 of
the xxx.3 file of that table, if I did the math right).

So the diagnosis is that somebody wrote a data block to the wrong offset
in the wrong file.  Whether this is the fault of Postgres, the kernel,
or the disk drive is difficult to say.  We've seen a number of cases in
which table pages got overwritten with data that was obviously of
non-Postgres origin, and in those cases we could blame the kernel or
disk drive with a clear conscience.  In this case, since the bogus data
is Postgres data, it could be that it's a bug lurking within Postgres
itself --- or it could be that it's like those past cases.

It might be worth your while to run some memory and disk drive tests.
There's no particular reason to suspect a hardware fault more than a
software one, but this is at least something simple to do.  Check for
availability of kernel updates, too.

            regards, tom lane

Re: pg_dump error... Follow up

От
Adam Witney
Дата:
On 8/9/05 3:46 pm, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Adam Witney <awitney@sgul.ac.uk> writes:
>> Here you go....
>
>> pg_filedump-3.0/pg_filedump -i -f -R 34318 34320 134401986.1
>
> Thanks.  What it looks like to me is that block 34320 (really 165392)
> is data from some other file altogether.  It's evidently still Postgres
> heap data, but instead of having 3 non-null columns as any toast row
> ought to have, these rows have 77 columns many of which are nulls.
> They've got OIDs, too.  Possibly you can work out which table these
> rows really belong to.  It looks like this ought to be block 415664
> of whatever table it belongs to (which would make it block 22448 of
> the xxx.3 file of that table, if I did the math right).

I only have one file with a .3 in that database... Or could it be from a
different database altogether? (although none of the others get much updates
at all)

Thanks again

adam


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Re: pg_dump error... Follow up

От
Alvaro Herrera
Дата:
On Thu, Sep 08, 2005 at 04:26:16PM +0100, Adam Witney wrote:
> On 8/9/05 3:46 pm, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>
> > Adam Witney <awitney@sgul.ac.uk> writes:
> >> Here you go....
> >
> >> pg_filedump-3.0/pg_filedump -i -f -R 34318 34320 134401986.1
> >
> > Thanks.  What it looks like to me is that block 34320 (really 165392)
> > is data from some other file altogether.  It's evidently still Postgres
> > heap data, but instead of having 3 non-null columns as any toast row
> > ought to have, these rows have 77 columns many of which are nulls.
> > They've got OIDs, too.  Possibly you can work out which table these
> > rows really belong to.  It looks like this ought to be block 415664
> > of whatever table it belongs to (which would make it block 22448 of
> > the xxx.3 file of that table, if I did the math right).
>
> I only have one file with a .3 in that database...

How many columns does that table have?

> Or could it be from a different database altogether? (although none of
> the others get much updates at all)

It's not impossible ...

To Tom: could this be caused by a WAL recovery that wrote a page image
to the wrong table?  I guess it is very unlikely because the CRC of the
WAL record would likely not match, but it's an idea.

--
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)

Re: pg_dump error... Follow up

От
Adam Witney
Дата:
> How many columns does that table have?

It has 77 columns. INSERTS are always done on both this table and the table
that had the toast table error within the same transaction if that matters?




--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Re: pg_dump error... Follow up

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> To Tom: could this be caused by a WAL recovery that wrote a page image
> to the wrong table?  I guess it is very unlikely because the CRC of the
> WAL record would likely not match, but it's an idea.

I don't see any reason to think that WAL recovery would be more likely
to cause this than normal operation.

If it is a Postgres bug, my thoughts were pointing in the direction of a
race condition that somehow allows the tag (identifier) of a buffer to
be changed before it gets written out.  I don't see any way that could
happen ... but given the very small number of reports of such problems
(i.e. 1) it would definitely have to be induced by some really
low-probability event, such as a race condition with a very narrow
window.

Or it might not be our bug.  We've certainly seen previous cases in
which either the kernel or the disk drive wrote the wrong data --- it
could be that this is the same thing, only it happened that the "wrong
data" was some other Postgres data that happened to be passing through
the system at about the same time.

            regards, tom lane