Обсуждение: Stuck LSI 9650SE-12 RAID Controller

Поиск
Список
Период
Сортировка

Stuck LSI 9650SE-12 RAID Controller

От
Craig James
Дата:
Has anyone seen anything like this?

Our LSI 9650SE-12 RAID Controller dropped the main Postgres disk offline ... it just disappeared as though the disk wasn't there.  It was an 8-disk RAID10 unit. The other unit (RAID1 for Linux & pg_xlog) was still functional.

Using tw_cli, it showed the array as "DEGRADED" and claimed to be verifying it. One disk in the array was "DEGRADED". There was no /dev entry for the device; Linux couldn't see it at all.

There were two hot spares, but it didn't use them. Worse, there was nothing I could do to make it do anything. Every command reported "Failed" and no further explanation. Booting into the RAID BIOS gave the same problem: if I selected "rebuild" or "verify", it said "You must select an array..." even though I had selected the array. It was as though the array didn't exist, yet it was shown.

I shut off the computer, unplugged the BBU from the RAID card and plugged it back in, unplugged and reinserted all the SATA cables, and then restarted. Exact same symptoms.

I finally gave up trying to recover the database (we had a backup server). The RAID controller let me delete and recreate the degraded array, and now everything seems fine. I can rebuild the Postgres database on the new unit. But I've lost a HUGE amount of trust in the LSI 9650-SE RAID controller card.

Thanks,
Craig

Re: Stuck LSI 9650SE-12 RAID Controller

От
jayknowsunix@gmail.com
Дата:
Certainly doesn't sound like a PostgreSQL issue. Is there any sort of advanced diagnostics for the raid controller? I
certainlywould want to thrash it top to bottom before I trusted it enough to put it back in service. 
--
Jay

Sent from my iPad

> On Aug 5, 2014, at 12:00 PM, Craig James <cjames@emolecules.com> wrote:
>
> Has anyone seen anything like this?
>
> Our LSI 9650SE-12 RAID Controller dropped the main Postgres disk offline ... it just disappeared as though the disk
wasn'tthere.  It was an 8-disk RAID10 unit. The other unit (RAID1 for Linux & pg_xlog) was still functional. 
>
> Using tw_cli, it showed the array as "DEGRADED" and claimed to be verifying it. One disk in the array was "DEGRADED".
Therewas no /dev entry for the device; Linux couldn't see it at all. 
>
> There were two hot spares, but it didn't use them. Worse, there was nothing I could do to make it do anything. Every
commandreported "Failed" and no further explanation. Booting into the RAID BIOS gave the same problem: if I selected
"rebuild"or "verify", it said "You must select an array..." even though I had selected the array. It was as though the
arraydidn't exist, yet it was shown. 
>
> I shut off the computer, unplugged the BBU from the RAID card and plugged it back in, unplugged and reinserted all
theSATA cables, and then restarted. Exact same symptoms. 
>
> I finally gave up trying to recover the database (we had a backup server). The RAID controller let me delete and
recreatethe degraded array, and now everything seems fine. I can rebuild the Postgres database on the new unit. But
I'velost a HUGE amount of trust in the LSI 9650-SE RAID controller card. 
>
> Thanks,
> Craig
>


Re: Stuck LSI 9650SE-12 RAID Controller

От
Scott Whitney
Дата:
Unfortunately, yes, I have seen similar situations. On Adaptec, IBM ServeRAID and Perc cards.

I would replace that card, personally, with a new one. Likely the card itself is going flaky.

Usually when I have seen this, swapping the card for a like card and importing the RAID config from the drives resolves it, unless the card went REAL bad and actually damaged the RAID itself (which I have also seen).



Certainly doesn't sound like a PostgreSQL issue. Is there any sort of advanced diagnostics for the raid controller? I certainly would want to thrash it top to bottom before I trusted it enough to put it back in service.
--
Jay

Sent from my iPad

> On Aug 5, 2014, at 12:00 PM, Craig James <cjames@emolecules.com> wrote:
>
> Has anyone seen anything like this?
>
> Our LSI 9650SE-12 RAID Controller dropped the main Postgres disk offline ... it just disappeared as though the disk wasn't there.  It was an 8-disk RAID10 unit. The other unit (RAID1 for Linux & pg_xlog) was still functional.
>
> Using tw_cli, it showed the array as "DEGRADED" and claimed to be verifying it. One disk in the array was "DEGRADED". There was no /dev entry for the device; Linux couldn't see it at all.
>
> There were two hot spares, but it didn't use them. Worse, there was nothing I could do to make it do anything. Every command reported "Failed" and no further explanation. Booting into the RAID BIOS gave the same problem: if I selected "rebuild" or "verify", it said "You must select an array..." even though I had selected the array. It was as though the array didn't exist, yet it was shown.
>
> I shut off the computer, unplugged the BBU from the RAID card and plugged it back in, unplugged and reinserted all the SATA cables, and then restarted. Exact same symptoms.
>
> I finally gave up trying to recover the database (we had a backup server). The RAID controller let me delete and recreate the degraded array, and now everything seems fine. I can rebuild the Postgres database on the new unit. But I've lost a HUGE amount of trust in the LSI 9650-SE RAID controller card.
>
> Thanks,
> Craig
>


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Stuck LSI 9650SE-12 RAID Controller

От
Craig James
Дата:
On Tue, Aug 5, 2014 at 9:00 AM, Craig James <cjames@emolecules.com> wrote:
Has anyone seen anything like this?

Our LSI 9650SE-12 RAID Controller dropped the main Postgres disk offline ... it just disappeared as though the disk wasn't there.  It was an 8-disk RAID10 unit. The other unit (RAID1 for Linux & pg_xlog) was still functional.

Using tw_cli, it showed the array as "DEGRADED" and claimed to be verifying it. One disk in the array was "DEGRADED". There was no /dev entry for the device; Linux couldn't see it at all.

Aha. I found this. Check out the first item in the "bugs" section: "RAID-10 arrays going Inoperable/Verifying Mode (SCR-2278)".

A lesson ... keep a device's firmware up to date.

Craig