Обсуждение: IO Timeout

Поиск
Список
Период
Сортировка

IO Timeout

От
Alex Turner
Дата:
I have a question about IO timeouts:

We are using the 3ware escalade 9500S series of cards, and we had a
drive failure this morning.  Apparnetly the card waits 30 seconds for
the drive to respond, and if it doesn't, it put's the drive in a fail
state.  Postgres it seems didn't wait 30 seconds before it decided
that the system was upset, and put the database in maintainence mode.

Is there a way to increase to IO wait timeout so this doesn't happen?

Alex Turner
netEconomist

Re: IO Timeout

От
Tom Lane
Дата:
Alex Turner <armtuk@gmail.com> writes:
> I have a question about IO timeouts:
> We are using the 3ware escalade 9500S series of cards, and we had a
> drive failure this morning.  Apparnetly the card waits 30 seconds for
> the drive to respond, and if it doesn't, it put's the drive in a fail
> state.  Postgres it seems didn't wait 30 seconds before it decided
> that the system was upset, and put the database in maintainence mode.

> Is there a way to increase to IO wait timeout so this doesn't happen?

Postgres hasn't got any "IO timeouts".  Your concern would be better
directed to whatever kernel you're using; any sort of timeout on a disk
operation would be happening at the kernel level.

For that matter, Postgres hasn't got any concept of "putting the
database in maintainence mode", so you haven't described what happened
very accurately at all.

            regards, tom lane

Re: IO Timeout

От
Alex Turner
Дата:
Well - I am sort of trying to piece together exactly what happened.

Here's what I know.

Around 02:52 I get messages in my syslog stating that there were
problems writing to a controler channel:
Mar 10 02:52:29 tsunami kernel: 3w-9xxx: scsi1: WARNING:
(0x06:0x002C): Unit #1: Command (0x28) timed out, resetting card.
Mar 10 02:52:29 tsunami kernel: 3w-9xxx: scsi1: ERROR: (0x06:0x001F):
Microcontroller not ready during reset sequence.
Mar 10 02:52:29 tsunami kernel: 3w-9xxx: scsi1: AEN: INFO
(0x04:0x005E): <NULL>:unit=0.
...
Mar 10 02:58:41 tsunami kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0107):
Duplicate request ID:RequestID=23.
Mar 10 02:58:41 tsunami kernel: end_request: I/O error, dev sdd,
sector 282528903
Mar 10 02:58:41 tsunami kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0107):
Duplicate request ID:RequestID=23.
Mar 10 02:58:41 tsunami kernel: end_request: I/O error, dev sdd,
sector 282528903
Mar 10 02:58:41 tsunami kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0107):
Duplicate request ID:RequestID=23.
Mar 10 02:58:41 tsunami kernel: end_request: I/O error, dev sdc, sector 28837057


At around 07:30 all connections were failing giving the error:
InternalError: FATAL:  the database system is in recovery mode
(pygresql - similar error in PHP also)

I reboot the server, and one of the discs comes up as innaccesible
(it's part of a RAID 10), but other than that, everything restarts as
normal.

Nothing significant in /var/log/pg_log which is where I have it
logging to (my log leve is pretty low though).

Alex Turner
netEconomist


On Thu, 10 Mar 2005 19:04:29 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alex Turner <armtuk@gmail.com> writes:
> > I have a question about IO timeouts:
> > We are using the 3ware escalade 9500S series of cards, and we had a
> > drive failure this morning.  Apparnetly the card waits 30 seconds for
> > the drive to respond, and if it doesn't, it put's the drive in a fail
> > state.  Postgres it seems didn't wait 30 seconds before it decided
> > that the system was upset, and put the database in maintainence mode.
>
> > Is there a way to increase to IO wait timeout so this doesn't happen?
>
> Postgres hasn't got any "IO timeouts".  Your concern would be better
> directed to whatever kernel you're using; any sort of timeout on a disk
> operation would be happening at the kernel level.
>
> For that matter, Postgres hasn't got any concept of "putting the
> database in maintainence mode", so you haven't described what happened
> very accurately at all.
>
>                         regards, tom lane
>

Re: IO Timeout

От
Tom Lane
Дата:
Alex Turner <armtuk@gmail.com> writes:
> Well - I am sort of trying to piece together exactly what happened.
> Here's what I know.

> Around 02:52 I get messages in my syslog stating that there were
> problems writing to a controler channel:
> [ various hardware errors snipped ]

> At around 07:30 all connections were failing giving the error:
> InternalError: FATAL:  the database system is in recovery mode

I think what happened here is that Postgres got a write error on WAL,
which would probably cause a PANIC, and then the ensuing database reboot
got hung up trying to re-read WAL.  Client connection requests would be
refused with messages like the above until the recovery process
completed.  The fact that this was still going on 4+ hours later shows
that Postgres is *not* timing out on stuck disk operations ... very much
the reverse in fact.

You'd be best off to take the matter up with some kernel hackers.
If there's anything to be done to improve the behavior, it's at
the kernel device driver level.

            regards, tom lane

Re: IO Timeout

От
Alex Turner
Дата:
Thanks very much Tom for you input -

The guys at AMCC are suggesting that the firmware on the controller
card crashed, causing the card to basicaly stop IO operations. This
would explain why postgres could not recover and re-read WAL, because
/dev/sdc and sdd were inaccessible at that time.

I think this puzzle is mostly solved - all we need to do now is
figured out what the heck happened on the controller card!

Thanks,

Alex Turner


On Thu, 10 Mar 2005 23:09:07 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alex Turner <armtuk@gmail.com> writes:
> > Well - I am sort of trying to piece together exactly what happened.
> > Here's what I know.
>
> > Around 02:52 I get messages in my syslog stating that there were
> > problems writing to a controler channel:
> > [ various hardware errors snipped ]
>
> > At around 07:30 all connections were failing giving the error:
> > InternalError: FATAL:  the database system is in recovery mode
>
> I think what happened here is that Postgres got a write error on WAL,
> which would probably cause a PANIC, and then the ensuing database reboot
> got hung up trying to re-read WAL.  Client connection requests would be
> refused with messages like the above until the recovery process
> completed.  The fact that this was still going on 4+ hours later shows
> that Postgres is *not* timing out on stuck disk operations ... very much
> the reverse in fact.
>
> You'd be best off to take the matter up with some kernel hackers.
> If there's anything to be done to improve the behavior, it's at
> the kernel device driver level.
>
>                         regards, tom lane
>
>