Re: emergency outage requiring database restart

Поиск

Список

Период

Сортировка

От	Merlin Moncure
Тема	Re: emergency outage requiring database restart
Дата	1 ноября 2016 г. 18:05:18
Msg-id	CAHyXU0yjuFNeHv2NDhxTQmrpabNT4HG54mTZ=E9UkWbbumFzWQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: emergency outage requiring database restart (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>>> Your production system's postgres backends probably have a lot more open
>>> files associated with them than the simple test case does.  Since Postgres
>>> likes to keep files open as long as possible and only closes them when you
>>> need to free up fds to open new files, it's possible that your production
>>> backends have almost all allowed fds used when you execute your pl/sh
>>> function.
>>>
>>> If that's the case, the sqsh process that's executed may not have enough fds
>>> to do what it wanted to do and because of busted error handling could end up
>>> writing to fds that were opened by Postgres and point to $PGDATA files.
>
>> Does that apply?  the mechanics are a sqsh function that basically does:
>> cat foo.sql  | sqsh <args>
>> pipe redirection opens a new process, right?
>
> Yeah, but I doubt that either level of the shell would attempt to close
> inherited file handles.
>
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether.  That seems kind of
> unlikely.  Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.
>
> Maybe the issue is not with open() failures, but with the resulting
> FD numbers being much larger than sqsh is expecting.  It would be
> weird to try to store an FD in something narrower than int, but
> I could see a use of select() being unprepared for large FDs.
> Still, it's hard to translate that idea into scribbling on the
> wrong file...

Looking at the sqsh code, nothing really stands out.  It's highly
developed and all obvious errors are checked.  There certainly could
be a freak bug in there (or in libfreetds which sqsh links to) doing
the damage though.  In the meantime I'll continue to try and work a
reliable reproduction. This particular routine only gets called in
batches on a quarterly basis so things have settled down.

Just a thought; could COPY be tricked into writing into the wrong file
descriptor?  For example, if a file was killed with a rm -rf and the
fd pressured backend reopened the fd immediately?

merlin

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Tomas Vondra
Дата: 01 ноября 2016 г., 17:58:10
Сообщение: Re: auto_explain vs. parallel query

Следующее

От: Mithun Cy
Дата: 01 ноября 2016 г., 18:21:50
Сообщение: Re: Patch: Implement failover on libpq connect level.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: emergency outage requiring database restart

Предыдущее

Следующее