Обсуждение: [GENERAL] PostgreSQL corruption

Поиск
Список
Период
Сортировка

[GENERAL] PostgreSQL corruption

От
James Sewell
Дата:
Hello All,

I am working with a client who is facing issues with database corruption after a physical hard power off (the machines are at remote sites, this could be a power outage or user error).

They have an environment made up of many of the following consumer grade stand alone machines:
  • Windows 7 SP1
  • PostgreSQL 9.2.4
  • Integrated Raid Controller
    • Managed by Intel Rapid Storage Technology
    • RAID 1 over two disks
    • Disk caching disabled
    • Not battery backed
    • Disk cache disabled
  • 2x Seagate SATA disk drives (st500lm021-1kj152)
PostgreSQL is configured as follows:
  • fsync=on
  • full_page_writes=on
  • wal_sync_method=fsync_writethrough
Windows is configured as follows:
  • Disk caching disabled for the RAID1 set

They have currently proven that the corruption is repeatable in a testbed with and without OS/RAID controller caching - but I am working with them to make this process a little more detailed. 

The new process will be:
  1. Power on machine
  2. If PostgreSQL doesn't start archive $PGDATA and initdb
  3. Perform a pg_dumpall to test for corruption
  4. If pg_dumpall fails then archive $PGDATA and initdb
  5. Start test suite (which mimics high load from their application), which INSERTS and DELETES records in and out of transaction
  6. After 15 minutes cut power and repeat process
We are hoping to get about 20 machines in this testbed, giving us around 1500 power cycles per day.

Test scenarios which have been floated so far:
  • As described above, all caching off
  • As described above, all caching off, 9.2 stable
  • As described above, all caching off, 9.5 stable with checksums
Can anyone think of anything else we should be considering / testing / factoring in? 
 
Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect 

 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009


The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.

Re: [GENERAL] PostgreSQL corruption

От
Scott Marlowe
Дата:
On Mon, Feb 13, 2017 at 9:21 PM, James Sewell <james.sewell@jirotech.com> wrote:
>
> Hello All,
>
> I am working with a client who is facing issues with database corruption after a physical hard power off (the
machinesare at remote sites, this could be a power outage or user error). 
>
> They have an environment made up of many of the following consumer grade stand alone machines:
>
> Windows 7 SP1
> PostgreSQL 9.2.4
> Integrated Raid Controller
>
> Managed by Intel Rapid Storage Technology
> RAID 1 over two disks
> Disk caching disabled
> Not battery backed
> Disk cache disabled

Some part of your OS or hardware is lying to postgres about fsyncs.
There are a few test suites out there that can test this independent
of postgresql btw, but it's been many years since I cranked one up.
Here's a web page from 2005 describing the problem and using a fsync
tester written in perl.

Try to see if you can get the same types of fsync errors out of your
hardware. If you can, stop, figure how to fix that, and then get back
in the game etc. Til then try not to lose power under load.


Re: [GENERAL] PostgreSQL corruption

От
Scott Marlowe
Дата:
On Mon, Feb 13, 2017 at 9:41 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 9:21 PM, James Sewell <james.sewell@jirotech.com> wrote:
>>
>> Hello All,
>>
>> I am working with a client who is facing issues with database corruption after a physical hard power off (the
machinesare at remote sites, this could be a power outage or user error). 
>>
>> They have an environment made up of many of the following consumer grade stand alone machines:
>>
>> Windows 7 SP1
>> PostgreSQL 9.2.4
>> Integrated Raid Controller
>>
>> Managed by Intel Rapid Storage Technology
>> RAID 1 over two disks
>> Disk caching disabled
>> Not battery backed
>> Disk cache disabled
>
> Some part of your OS or hardware is lying to postgres about fsyncs.
> There are a few test suites out there that can test this independent
> of postgresql btw, but it's been many years since I cranked one up.
> Here's a web page from 2005 describing the problem and using a fsync
> tester written in perl.
>
> Try to see if you can get the same types of fsync errors out of your
> hardware. If you can, stop, figure how to fix that, and then get back
> in the game etc. Til then try not to lose power under load.

http://brad.livejournal.com/2116715.html


Re: [GENERAL] PostgreSQL corruption

От
Magnus Hagander
Дата:
On Tue, Feb 14, 2017 at 5:21 AM, James Sewell <james.sewell@jirotech.com> wrote:
Hello All,

I am working with a client who is facing issues with database corruption after a physical hard power off (the machines are at remote sites, this could be a power outage or user error).

They have an environment made up of many of the following consumer grade stand alone machines:
  • Windows 7 SP1
  • PostgreSQL 9.2.4

If you're using 9.2.4, you are missing about 4 years worth of bugfixes. While what you're talking aobut sounds like other issues, you should really upgrade that to something that doesn't have loads of known bugs and then re-run the tests.

--

Re: [GENERAL] PostgreSQL corruption

От
James Sewell
Дата:
That's the plan, but it's essentially a client managed embedded database so small steps needed. If I can prove it's the hardware first that would be preferable. 

It looks like diskcheck.pl doesn't work on Windows (no IO::Handle::sync) - does anybody know of an alternative testkit. A C one would be the best I suppose as it could exactly mimic PostgreSQL.

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect 

 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009

On Wed, Feb 15, 2017 at 4:10 AM, Magnus Hagander <magnus@hagander.net> wrote:
On Tue, Feb 14, 2017 at 5:21 AM, James Sewell <james.sewell@jirotech.com> wrote:
Hello All,

I am working with a client who is facing issues with database corruption after a physical hard power off (the machines are at remote sites, this could be a power outage or user error).

They have an environment made up of many of the following consumer grade stand alone machines:
  • Windows 7 SP1
  • PostgreSQL 9.2.4

If you're using 9.2.4, you are missing about 4 years worth of bugfixes. While what you're talking aobut sounds like other issues, you should really upgrade that to something that doesn't have loads of known bugs and then re-run the tests.

--



The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.

Re: [GENERAL] PostgreSQL corruption

От
James Sewell
Дата:
OK,

So with some help from the IRC channel (thanks macdice and JanniCash)  it's come to light that my RAID1 comprised of 2 * 7200RPM disks is reporting ~500 ops/sec in pg_test_fsync.

This is higher than the ~120 ops/sec which you would expect from 720RPM disks - therefore something is lying.

Breaking up the RAID and re-imaging with JBOD dropped this to 50 ops/sec - another question but still looking like a real result.

So in this case it looks like the RAID controller wasn't disabling caching as advertised.

Cheers,



James Sewell,
PostgreSQL Team Lead / Solutions Architect 

 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009

On Wed, Feb 15, 2017 at 10:29 AM, James Sewell <james.sewell@jirotech.com> wrote:
That's the plan, but it's essentially a client managed embedded database so small steps needed. If I can prove it's the hardware first that would be preferable. 

It looks like diskcheck.pl doesn't work on Windows (no IO::Handle::sync) - does anybody know of an alternative testkit. A C one would be the best I suppose as it could exactly mimic PostgreSQL.

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect 

 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009

On Wed, Feb 15, 2017 at 4:10 AM, Magnus Hagander <magnus@hagander.net> wrote:
On Tue, Feb 14, 2017 at 5:21 AM, James Sewell <james.sewell@jirotech.com> wrote:
Hello All,

I am working with a client who is facing issues with database corruption after a physical hard power off (the machines are at remote sites, this could be a power outage or user error).

They have an environment made up of many of the following consumer grade stand alone machines:
  • Windows 7 SP1
  • PostgreSQL 9.2.4

If you're using 9.2.4, you are missing about 4 years worth of bugfixes. While what you're talking aobut sounds like other issues, you should really upgrade that to something that doesn't have loads of known bugs and then re-run the tests.

--




The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.

Re: [GENERAL] PostgreSQL corruption

От
Merlin Moncure
Дата:
On Tue, Feb 14, 2017 at 7:23 PM, James Sewell <james.sewell@jirotech.com> wrote:
OK,

So with some help from the IRC channel (thanks macdice and JanniCash)  it's come to light that my RAID1 comprised of 2 * 7200RPM disks is reporting ~500 ops/sec in pg_test_fsync.

This is higher than the ~120 ops/sec which you would expect from 720RPM disks - therefore something is lying.

Breaking up the RAID and re-imaging with JBOD dropped this to 50 ops/sec - another question but still looking like a real result.

So in this case it looks like the RAID controller wasn't disabling caching as advertised.


yup -- that's the thing.  Performance numbers really tell the whole (or at least most-) of the story.  If it's too good to be true, it is.  These days, honestly I'd just throw out the raid controller and install some intel ssd drives.

merlin

Re: [GENERAL] PostgreSQL corruption

От
James Sewell
Дата:
Sadly this is for a customer who has 3000 of these in the field, the raid controller is on the motherboard. 

At least they know where to point the finger now!

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect 

 

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009

On Fri, Feb 17, 2017 at 1:25 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Feb 14, 2017 at 7:23 PM, James Sewell <james.sewell@jirotech.com> wrote:
OK,

So with some help from the IRC channel (thanks macdice and JanniCash)  it's come to light that my RAID1 comprised of 2 * 7200RPM disks is reporting ~500 ops/sec in pg_test_fsync.

This is higher than the ~120 ops/sec which you would expect from 720RPM disks - therefore something is lying.

Breaking up the RAID and re-imaging with JBOD dropped this to 50 ops/sec - another question but still looking like a real result.

So in this case it looks like the RAID controller wasn't disabling caching as advertised.


yup -- that's the thing.  Performance numbers really tell the whole (or at least most-) of the story.  If it's too good to be true, it is.  These days, honestly I'd just throw out the raid controller and install some intel ssd drives.

merlin



The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.

Re: [GENERAL] PostgreSQL corruption

От
John R Pierce
Дата:
On 2/16/2017 6:48 PM, James Sewell wrote:
> Sadly this is for a customer who has 3000 of these in the field, the
> raid controller is on the motherboard.

if thats the usual Intel "Matrix" raid, thats not  actually a raid
controller.  its intel sata in fake raid mode, the raid is entirely done
in host software.



--
john r pierce, recycling bits in santa cruz