Обсуждение: Data corruption after SAN snapshot

Поиск
Список
Период
Сортировка

Data corruption after SAN snapshot

От
Terry Schmitt
Дата:
Hi All,

I have a pretty strange issue that I'm looking for ideas on.
I'm using Postgres Plus Advanced Server 9.1, but I believe this problem is relevant to Postgres Community. It is certainly possible to be a EDB bug and I am already working with them on this.

We are migrating 1TB+ from Oracle to PPAS. Our new environment consists of a primary server with two "read-only" clones. We use NetApps SAN storage and execute a NetApps consistent snapshot on the primary server and then use flex clones for the read-only servers. The clones power up, do a short recovery and all should be well. We have been doing this method for two years except using PPAS 8.4 and physical servers and ext4.
The new environment is RHEL 6.x guests running inside Redhat Virtualization using XFS and LVM.

The problem is that after the data load, we take a warm snapshot and the cloned database are coming up corrupt.
A classic example is: ERROR:  could not read block 1 in file "base/18511/13872": read only 0 of 8192 bytes. Looking at the data file, it is 8k in size, so obviously we are missing block 1 from the file. So far I identified indexes and sequences as corrupt, but I believe it could be any object.
Since the snapshot is essentially a crash, this system is not crash resistant either.

Looking through the timeline of events, it is clear that data exists in RAM on the primary server, but is not being written out to the SAN for the snapshot and hence is missing when the clone starts up. My first thought is that fsync is not working. PPAS has fsync on and is using fdatasync.

I run a rudimentary test using: dd if=/dev/zero of=dd_test2 bs=8k count=128k conv=fdatasync and crash the server immediate after dd completes.
Everything behaves as expected. with fsync or fdatasync, the file exists after the crash and reboot. Leaving out the sync results in a missing file after the crash/reboot, but that is expected. This simple tests shows that fdatasync is working, but does not prove this under load.

So, at this point, I don't know if the fdatasync is being issued, but not honored by the OS stack, or if PPAS is even issuing the sync at all.

Anyone have a solid method to test if fdatasync is working correctly or thoughts on troubleshooting this? It is extremely time consuming to replicate the problem, but even then the corruption moves around, so it's hard to know immediately if there is corruption at all. I'm hoping to utilize a tool set outside of Postgres to positively eliminate the OS stack.

Sorry for the lengthy post, but hopefully it's clear what is going on.

Thanks!
T

Re: Data corruption after SAN snapshot

От
Simon Riggs
Дата:
On 7 August 2012 23:23, Terry Schmitt <tschmitt@schmittworks.com> wrote:

> I have a pretty strange issue that I'm looking for ideas on.
> I'm using Postgres Plus Advanced Server 9.1, but I believe this problem is
> relevant to Postgres Community. It is certainly possible to be a EDB bug and
> I am already working with them on this.

I think its important to say that as soon as someone patches/forks
Postgres code then the onus is really on whoever did that to show the
problem wasn't created by them when they did that. That is very much
the reason why my company and many others don't market a forked
product and stick closely to the community distributed binaries.

I have much respect for your vendors ability to diagnose faults and
recreate them on stock Postgres if appropriate, so I think we should
wait for that to happen rather than attempt to answer this just yet.
Or alternatively, please submit a test case that fails on Postgres if
you doubt that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Data corruption after SAN snapshot

От
Terry Schmitt
Дата:
Simon,
While I agree with your reply in general and am working that angle and more, I'm hoping to add to my personal tool kit and gain more insight into methods to test fsync and prove without a doubt that it is functioning properly on any given system no matter what type of database I'm running.

T


On Tue, Aug 7, 2012 at 4:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 7 August 2012 23:23, Terry Schmitt <tschmitt@schmittworks.com> wrote:

> I have a pretty strange issue that I'm looking for ideas on.
> I'm using Postgres Plus Advanced Server 9.1, but I believe this problem is
> relevant to Postgres Community. It is certainly possible to be a EDB bug and
> I am already working with them on this.

I think its important to say that as soon as someone patches/forks
Postgres code then the onus is really on whoever did that to show the
problem wasn't created by them when they did that. That is very much
the reason why my company and many others don't market a forked
product and stick closely to the community distributed binaries.

I have much respect for your vendors ability to diagnose faults and
recreate them on stock Postgres if appropriate, so I think we should
wait for that to happen rather than attempt to answer this just yet.
Or alternatively, please submit a test case that fails on Postgres if
you doubt that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Data corruption after SAN snapshot

От
Craig Ringer
Дата:
On 08/08/2012 06:23 AM, Terry Schmitt wrote:

> Anyone have a solid method to test if fdatasync is working correctly or
> thoughts on troubleshooting this?

Try diskchecker.pl

https://gist.github.com/3177656

The other obvious step is that you've changed three things, so start
isolation testing.

- Test Postgres Plus Advanced Server 8.4, which you knew worked, on your
new file system and OS.

- Test PP9.1 on your new OS but with ext3, which you knew worked

- Test PP9.1 on your new OS but with ext4, which should work if ext3 did

- Test PP9.1 on a copy of your *old* OS with the old file system setup.

- Test mainline PostgreSQL 9.1 on your new setup to see if it's PP specific.

Since each test sounds moderately time consuming, you'll probably need
to find a way to automate. I'd first see if I could reproduce the
problem when running PgBench against the same setup that's currently
failing, and if that reproduces the fault you can use PgBench with the
other tests.

--
Craig Ringer


Re: Data corruption after SAN snapshot

От
Stephen Frost
Дата:
Terry,

* Terry Schmitt (tschmitt@schmittworks.com) wrote:
> The new environment is RHEL 6.x guests running inside Redhat Virtualization
> using XFS and LVM.

That's quite the shift, yet you left out any details on this piece..

How is the VM connected to the NetApp LUN?  What kind of options have
been chosen wrt cacheing?  Have you tried your 8.4 "known good" S/W
stack under this virtualized environment?

    Thanks,

        Stephen

Вложения

Re: Data corruption after SAN snapshot

От
Terry Schmitt
Дата:
Thanks Craig.

"# Brad's el-ghetto do-our-storage-stacks-lie?-script" I like it already :)

I may play around with that. Looks interesting. For everyone else, here's a post describing the use of  diskchecker: http://brad.livejournal.com/2116715.html
I experimented with sysbench today, which was somewhat enlightening and it clearly shows the impact that fsync/fdatasync has on the file system performance. It's pretty obvious that fsync is writing out to disk simply based on the throughput of each test.
Using pgbench is a good idea, as I can throw a high transaction rate at the database and take a snapshot during the test. So far, executing pg_dumpall seems to be fairly reliable for finding the corrupt objects after my initial data load, but unfortunately much of the corruption has been with indexes which pgdump will not expose.

Thanks for the input,
T




On Tue, Aug 7, 2012 at 6:11 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 08/08/2012 06:23 AM, Terry Schmitt wrote:

Anyone have a solid method to test if fdatasync is working correctly or
thoughts on troubleshooting this?

Try diskchecker.pl

https://gist.github.com/3177656

The other obvious step is that you've changed three things, so start isolation testing.

- Test Postgres Plus Advanced Server 8.4, which you knew worked, on your new file system and OS.

- Test PP9.1 on your new OS but with ext3, which you knew worked

- Test PP9.1 on your new OS but with ext4, which should work if ext3 did

- Test PP9.1 on a copy of your *old* OS with the old file system setup.

- Test mainline PostgreSQL 9.1 on your new setup to see if it's PP specific.

Since each test sounds moderately time consuming, you'll probably need to find a way to automate. I'd first see if I could reproduce the problem when running PgBench against the same setup that's currently failing, and if that reproduces the fault you can use PgBench with the other tests.

--
Craig Ringer


Re: Data corruption after SAN snapshot

От
Stephen Frost
Дата:
Terry,

* Terry Schmitt (tschmitt@schmittworks.com) wrote:
> So far, executing pg_dumpall
> seems to be fairly reliable for finding the corrupt objects after my
> initial data load, but unfortunately much of the corruption has been with
> indexes which pgdump will not expose.

Shouldn't be too hard to write a script that'll do a query against each
table using an ORDER BY that matches each index, at least for 'simple'
indexes, which'll typically cause an in-order index traversal.

Just a thought.

    Thanks,

        Stephen

Вложения

Re: Data corruption after SAN snapshot

От
Craig Ringer
Дата:
On 08/08/2012 09:39 AM, Stephen Frost wrote:
Terry,

* Terry Schmitt (tschmitt@schmittworks.com) wrote:
So far, executing pg_dumpall
seems to be fairly reliable for finding the corrupt objects after my
initial data load, but unfortunately much of the corruption has been with
indexes which pgdump will not expose.
Shouldn't be too hard to write a script that'll do a query against each
table using an ORDER BY that matches each index, at least for 'simple'
indexes, which'll typically cause an in-order index traversal.

I'd really like a "VERIFY" command for PostgreSQL, though a proper one isn't really possible without block checksums.

I'm currently working on a virtual plug pull tool that uses VMs to simulate abrupt crashes of the machine PostgreSQL is running on. One of the bigger challenges is that Pg doesn't offer any reliable way to detect even simple corruption.

Maybe a pg_read_relation(oid) that simply reads all blocks in an index or table would help. It could live in the `adminpack' module ( http://www.postgresql.org/docs/9.1/static/adminpack.html) or `pageinspect' module ( http://www.postgresql.org/docs/9.1/static/pageinspect.html).


It turns out I can use the pageinspect functions to do a rough kind of verify, but it's pretty slow and inconvenient. Eg:

WITH pages(page) AS (
    SELECT get_raw_page('tablename'::text, pageno)
    FROM generate_series(0, (SELECT relpages FROM pg_class WHERE relname = 'tablename')-1) AS pageno
)
SELECT page_header(page), heap_page_items(page) FROM pages;

takes 90ms when a 'SELECT * FROM tablename' takes 6.2ms . On a bigger table, the query takes 3939.912 vs 125.135ms for a table scan.

Of course, pageinspect is mostly interesting for indexes, where I'd do:

create or replace function scan_index(indexname text) returns setof record as $$
SELECT page_header(get_raw_page($1, 0));
WITH pagenumbers(pageno) AS (
    SELECT generate_series(1, (SELECT relpages FROM pg_class WHERE relname = $1)-1)
)
SELECT bt_page_items($1, pageno) FROM pagenumbers;
$$ language sql volatile;

SELECT scan_index('some_idx');

... but that's getting really quite slow and still hasn't touched the free space map or visibility map.

Of course, these checks prove nothing about subtle corruption or incorrect contents, they only make sure Pg can read them and they look vaguely sane. It doesn't do any kind of consistency checking between index and table.

--
Craig Ringer