Обсуждение: Losing data from Postgres

Поиск
Список
Период
Сортировка

Losing data from Postgres

От
Paul Breen
Дата:
Hello everyone,

Can anyone help us?

We are using Postgres in a hotspare configuration, that is, we have 2
separate boxes both running identical versions of Postgres and everytime
we insert|update|delete from the database we write to both boxes (at the
application level).  All communications to the databases are in
transaction blocks and if we cannot commit to both databases then we
rollback.

Recently we have discovered that some records are missing from the main
database table on the master box but are present in the table on the
spare.  We know its not a transaction problem (rolling back on master but
not spare) because reports printed at the time of the records being
written show the data in the master.  It seems that some time afterwards
the records just disappear from the master!  Both boxes are raided (Raid
5), don't know if this is significant.

Originally we were vacuuming twice a day but because some of the reports
we produce regularly were taking too long as the database grew, we added
multiple indexes onto the key tables and began vacuuming every hour.  It's
only after doing this that we noticed the data loss - don't know if this
is coincidental or not.  Yesterday we went back to vacuuming only twice a
day.

Looking in Postgres' server log we found some worrying error messages.
Again, don't know if these are significant:

1000822.00:39:44.676 [5182] NOTICE:  LockRelease: locktable lookup failed,
no lock

515 instances on the master, 7 on the spare.

1000929.15:33:40.454 [20007] NOTICE:  LockReplace: xid table corrupted

80 instances on the master, 0 on the spare - is this the problem?

1001027.11:55:04.071 [4847] NOTICE:  Index dttmprod_ndx: NUMBER OF INDEX'
TUPLES (33586) IS NOT THE SAME AS HEAP' (33578)

1065 instances on the master, 20 on the spare.  The reason there is so
many of these messages is that we didn't spot this error until after we
had done a lot of vacuums.  Since rebuilding the indexes (dropping &
re-creating) we haven't seen this one again.

We are using Postgres version 6.5.3.  I know its old but this is the
customer standard and we are loathe to upgrade unless its a problem
inherent with this version.  From a cost-of-ownership viewpoint, its
easier to support a single version than multiple versions.  The OS is SuSE
Linux 6.3 (kernel 2.2.13), the control system programs talk to Postgres
via libpq, the web interface is PHP3 and the Tcl/Tk interface is pgtksh.
The Raid controller is a DPT SmartCache 4 (28 MB RAM).   Both the master
and the spare boxes are Intel Pentium III 550Mhz with 256MB RAM.

If we've left anything out, write and we'll supply more information.

Any help, especially around Postgres' server log messages would be greatly
appreciated.


Paul M. Breen, Software Engineer - Computer Park Ltd.

Tel:   (01536) 417155
Email: pbreen@computerpark.co.uk


Re: Losing data from Postgres

От
Jean-Marc Pigeon
Дата:
Bonjour Paul Breen
>
> Hello everyone,
>
> Can anyone help us?
>
> We are using Postgres in a hotspare configuration, that is, we have 2
> separate boxes both running identical versions of Postgres and everytime
> we insert|update|delete from the database we write to both boxes (at the
> application level).  All communications to the databases are in
> transaction blocks and if we cannot commit to both databases then we
> rollback.
[...]
> Originally we were vacuuming twice a day but because some of the reports
> we produce regularly were taking too long as the database grew, we added
> multiple indexes onto the key tables and began vacuuming every hour.  It's
> only after doing this that we noticed the data loss - don't know if this
> is coincidental or not.  Yesterday we went back to vacuuming only twice a
> day.

    We found something similar on our application.
    Seems to be a vacuum+index problem, the index do
    not refer to ALL data after the vacuum!.

    If I am right, drop the index, create the index again
    and your data should be found again...

    On our side now, before to do vacuum we drop the index
    do vacuum, rebuild the index. The overall time is
    the same as doing a 'simple' vacuum.

    Hoping that help...


A bientot
==========================================================================
Jean-Marc Pigeon              Internet:   Jean-Marc.Pigeon@safe.ca
SAFE Inc.                Phone: (514) 493-4280  Fax: (514) 493-1946
       REGULUS,  a real time accounting/billing package for ISP
           REGULUS' Home base <"http://www.regulus.safe.ca">
==========================================================================

Re: Losing data from Postgres

От
Serge Canizares
Дата:

>Paul Breen wrote:

>Both boxes are raided (Raid
>5), don't know if this is significant.

Not sure about the data loss problem, but from a performance perspective, RAID
5 doesn't sound to me like the way to go for a high performance db file
system.  You are dividing up writes (and reads) between disks plus calculating
checksums during writes.  I think that a better idea would be to use RAID
1+0.  Take a few drives and stripe then in a RAID 0 config to get the storage
space you need, then duplicate the set and mirror the RAID 0 (hence RAID
1+0.)  Your reads will go a lot faster that way, and although you have to
double up on writes, at least the blocks are contiguous on the drives, and
your controller doesn't have to calculate checksums.

Of course, if someone sees a reason that RAID 5 would be better than RAID 1+0,
I'd appreciate an explanation!

-Serge


Re: Losing data from Postgres

От
Alfred Perlstein
Дата:
* Serge Canizares <serge@ephilosopher.com> [001115 08:23] wrote:
>
> Of course, if someone sees a reason that RAID 5 would be better than RAID 1+0,
> I'd appreciate an explanation!

Cost. :)

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

Re: Losing data from Postgres

От
Alfred Perlstein
Дата:
* Paul Breen <pbreen@computerpark.co.uk> [001115 07:46] wrote:
> Hello everyone,
>
[snip]
>
> Recently we have discovered that some records are missing from the main
> database table on the master box but are present in the table on the
> spare.  We know its not a transaction problem (rolling back on master but
> not spare) because reports printed at the time of the records being
> written show the data in the master.  It seems that some time afterwards
> the records just disappear from the master!  Both boxes are raided (Raid
> 5), don't know if this is significant.
>
[snip]
>
> We are using Postgres version 6.5.3.  I know its old but this is the
> customer standard and we are loathe to upgrade unless its a problem
> inherent with this version.  From a cost-of-ownership viewpoint, its
> easier to support a single version than multiple versions.  The OS is SuSE
> Linux 6.3 (kernel 2.2.13), the control system programs talk to Postgres
> via libpq, the web interface is PHP3 and the Tcl/Tk interface is pgtksh.
> The Raid controller is a DPT SmartCache 4 (28 MB RAM).   Both the master
> and the spare boxes are Intel Pentium III 550Mhz with 256MB RAM.

My suggestion would be to try the latest version, 7.0.3, it's fixed
numerous bugs for us and performs a lot better.  We used to have a
bunch of problems with 6.5.3 and upgrading to 7.0.2 solved a lot
of them, 7.0.3 solved the rest of the problems we've seen in 7.0.2.

best of luck,
-Alfred

Re: Losing data from Postgres

От
"Hossein S. Zadeh"
Дата:
On Wed, 15 Nov 2000, Alfred Perlstein wrote:

> * Serge Canizares <serge@ephilosopher.com> [001115 08:23] wrote:
> >
> > Of course, if someone sees a reason that RAID 5 would be better than RAID 1+0,
> > I'd appreciate an explanation!
>
> Cost. :)
>

A little bit more explanation: :-)

RAID 1+0 gives you only half of the installed space but gives you (n/2)
times speed of individual disks. For example for 4 hard disks 1G each, you
get only 2G of space but double the speed of individual disks.

RAID 5 gives you space equal to (n-1) times individual disks. This is far
better than RAID 1+0. For example for 4 hard disks 1G each, you get 3G
space (this is only 25% waste compared to 50% for RAID 1+0). As you add
hard disks to the array, the 25% ratio of RAID 5 get lower and lower, but
that of RAID 1+0 stays at 50%. For 10 hard disks for example, the ratio
gets down to 10%.
Speed of RAID 5 however is very much dependant of a few factors: speed of
the controller (or CPU speed in case of software RAID), type of data, and
how the array is setup (how many blocks of data per strip, etc.). In
theory, it can exceed speed of RAID 1+0, but I have never seen it in real
life (but it does approach that of RAID 1+0 if you spend $$$ on the
controller or CPU).


Hope it helps,
Hossein



Re: Losing data from Postgres

От
Alfred Perlstein
Дата:
* Hossein S. Zadeh <hossein@hossein.bf.rmit.edu.au> [001115 22:26] wrote:
> On Wed, 15 Nov 2000, Alfred Perlstein wrote:
>
> > * Serge Canizares <serge@ephilosopher.com> [001115 08:23] wrote:
> > >
> > > Of course, if someone sees a reason that RAID 5 would be better than RAID 1+0,
> > > I'd appreciate an explanation!
> >
> > Cost. :)
> >
>
> A little bit more explanation: :-)
>
> RAID 1+0 gives you only half of the installed space but gives you (n/2)
> times speed of individual disks. For example for 4 hard disks 1G each, you
> get only 2G of space but double the speed of individual disks.

The problem with mirroring is potentially saturating the bus much
earlier than RAID 5 because of doubling of writes that much occur.

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

RE: Losing data from Postgres

От
Thomas Sonne Olesen
Дата:

IMHO the biggest problem with RAID 5 is the performance. Under
normal conditions RAID 0+1 have double read performance (since each mirror
can be accessed indepently) and in case of an error things get really bad
for RAID 5.

If a disc fails iin a RAID 5 every read to the failed disc (or disc-stripe), requires read
from all other discs to reconstruct data, which decreases the bandwidth to the
discsystem dramatically.
Since most systems do most reads, this is really serious.

On a RAID 0+1 (some times called RAID 10) only the disc that has lost it mirror
decreases its bandwidth to "normal" single disc speed. All others keep running
a double speed. And actually write performance increases since the failed
mirror can be updated.

Another point is reliability. If disc number 2 fails in a RIAD 5 we goes down,
if we eg. have a 2 x 4 RAID 10  system, with a single disc failiur, the chances
that the second disc fail will cause break down id on 1/7, since 3/4 of the
disc system is still redundant.

Double failiur is not happening often, but has happend if all discs is installed
at the same time.

If I should set up an important server I will go for RAID 10, mostly because
on a heavy loaded server, the disc error bandwidth decrease in RAID 5 is the same as
not availably. 

/Thomas

    -----Original Message-----
    From:   Hossein S. Zadeh [SMTP:hossein@hossein.bf.rmit.edu.au]
    Sent:   Thursday, November 16, 2000 7:17 AM
    To:     admin
    Subject:        Re: [ADMIN] Losing data from Postgres

    On Wed, 15 Nov 2000, Alfred Perlstein wrote:

    > * Serge Canizares <serge@ephilosopher.com> [001115 08:23] wrote:
    > >
    > > Of course, if someone sees a reason that RAID 5 would be better than RAID 1+0,
    > > I'd appreciate an explanation!
    >
    > Cost. :)
    >

    A little bit more explanation: :-)

    RAID 1+0 gives you only half of the installed space but gives you (n/2)
    times speed of individual disks. For example for 4 hard disks 1G each, you
    get only 2G of space but double the speed of individual disks.

    RAID 5 gives you space equal to (n-1) times individual disks. This is far
    better than RAID 1+0. For example for 4 hard disks 1G each, you get 3G
    space (this is only 25% waste compared to 50% for RAID 1+0). As you add
    hard disks to the array, the 25% ratio of RAID 5 get lower and lower, but
    that of RAID 1+0 stays at 50%. For 10 hard disks for example, the ratio
    gets down to 10%.
    Speed of RAID 5 however is very much dependant of a few factors: speed of
    the controller (or CPU speed in case of software RAID), type of data, and
    how the array is setup (how many blocks of data per strip, etc.). In
    theory, it can exceed speed of RAID 1+0, but I have never seen it in real
    life (but it does approach that of RAID 1+0 if you spend $$$ on the
    controller or CPU).

    Hope it helps,
    Hossein

Re: Losing data from Postgres

От
Paul Breen
Дата:
Bonjour Jean-Marc,

Yeah, we get the feeling that it may be a vacuum+index related problem,
not sure though?  As I said, we've gone back to only vacuuming twice a day
and the problem (we hope) has gone away.  It leaves us feeling uneasy
though, when we fix a problem we like to understand why!

Basically we are going to monitor it for the next few weeks and if there
is no occurrence of the data loss, we will - grudgingly - consider it no
longer a problem.  I'd still like to know what the Postgres backend
messages mean in the log, especially the one about "xid table corrupted"??

Anyway, thanks to everyone for their help & support, it is greatly
appreciated.  If we have any break-throughs on this thorny subject we will
mail the list with our findings - cheers.

Paul M. Breen, Software Engineer - Computer Park Ltd.

Tel:   (01536) 417155
Email: pbreen@computerpark.co.uk

On Wed, 15 Nov 2000, Jean-Marc Pigeon wrote:

> Bonjour Paul Breen
> >
> > Hello everyone,
> >
> > Can anyone help us?
> >
> > We are using Postgres in a hotspare configuration, that is, we have 2
> > separate boxes both running identical versions of Postgres and everytime
> > we insert|update|delete from the database we write to both boxes (at the
> > application level).  All communications to the databases are in
> > transaction blocks and if we cannot commit to both databases then we
> > rollback.
> [...]
> > Originally we were vacuuming twice a day but because some of the reports
> > we produce regularly were taking too long as the database grew, we added
> > multiple indexes onto the key tables and began vacuuming every hour.  It's
> > only after doing this that we noticed the data loss - don't know if this
> > is coincidental or not.  Yesterday we went back to vacuuming only twice a
> > day.
>
>     We found something similar on our application.
>     Seems to be a vacuum+index problem, the index do
>     not refer to ALL data after the vacuum!.
>
>     If I am right, drop the index, create the index again
>     and your data should be found again...
>
>     On our side now, before to do vacuum we drop the index
>     do vacuum, rebuild the index. The overall time is
>     the same as doing a 'simple' vacuum.
>
>     Hoping that help...
>
>
> A bientot
> ==========================================================================
> Jean-Marc Pigeon              Internet:   Jean-Marc.Pigeon@safe.ca
> SAFE Inc.                Phone: (514) 493-4280  Fax: (514) 493-1946
>        REGULUS,  a real time accounting/billing package for ISP
>            REGULUS' Home base <"http://www.regulus.safe.ca">
> ==========================================================================
>