Re: emergency outage requiring database restart

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема Re: emergency outage requiring database restart
Дата
Msg-id CAHyXU0yjuFNeHv2NDhxTQmrpabNT4HG54mTZ=E9UkWbbumFzWQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: emergency outage requiring database restart  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>>> Your production system's postgres backends probably have a lot more open
>>> files associated with them than the simple test case does.  Since Postgres
>>> likes to keep files open as long as possible and only closes them when you
>>> need to free up fds to open new files, it's possible that your production
>>> backends have almost all allowed fds used when you execute your pl/sh
>>> function.
>>>
>>> If that's the case, the sqsh process that's executed may not have enough fds
>>> to do what it wanted to do and because of busted error handling could end up
>>> writing to fds that were opened by Postgres and point to $PGDATA files.
>
>> Does that apply?  the mechanics are a sqsh function that basically does:
>> cat foo.sql  | sqsh <args>
>> pipe redirection opens a new process, right?
>
> Yeah, but I doubt that either level of the shell would attempt to close
> inherited file handles.
>
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether.  That seems kind of
> unlikely.  Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.
>
> Maybe the issue is not with open() failures, but with the resulting
> FD numbers being much larger than sqsh is expecting.  It would be
> weird to try to store an FD in something narrower than int, but
> I could see a use of select() being unprepared for large FDs.
> Still, it's hard to translate that idea into scribbling on the
> wrong file...

Looking at the sqsh code, nothing really stands out.  It's highly
developed and all obvious errors are checked.  There certainly could
be a freak bug in there (or in libfreetds which sqsh links to) doing
the damage though.  In the meantime I'll continue to try and work a
reliable reproduction. This particular routine only gets called in
batches on a quarterly basis so things have settled down.

Just a thought; could COPY be tricked into writing into the wrong file
descriptor?  For example, if a file was killed with a rm -rf and the
fd pressured backend reopened the fd immediately?

merlin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: auto_explain vs. parallel query
Следующее
От: Mithun Cy
Дата:
Сообщение: Re: Patch: Implement failover on libpq connect level.