On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>>> Your production system's postgres backends probably have a lot more open
>>> files associated with them than the simple test case does. Since Postgres
>>> likes to keep files open as long as possible and only closes them when you
>>> need to free up fds to open new files, it's possible that your production
>>> backends have almost all allowed fds used when you execute your pl/sh
>>> function.
>>>
>>> If that's the case, the sqsh process that's executed may not have enough fds
>>> to do what it wanted to do and because of busted error handling could end up
>>> writing to fds that were opened by Postgres and point to $PGDATA files.
>
>> Does that apply? the mechanics are a sqsh function that basically does:
>> cat foo.sql | sqsh <args>
>> pipe redirection opens a new process, right?
>
> Yeah, but I doubt that either level of the shell would attempt to close
> inherited file handles.
>
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether. That seems kind of
> unlikely. Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.
>
> Maybe the issue is not with open() failures, but with the resulting
> FD numbers being much larger than sqsh is expecting. It would be
> weird to try to store an FD in something narrower than int, but
> I could see a use of select() being unprepared for large FDs.
> Still, it's hard to translate that idea into scribbling on the
> wrong file...
Looking at the sqsh code, nothing really stands out. It's highly
developed and all obvious errors are checked. There certainly could
be a freak bug in there (or in libfreetds which sqsh links to) doing
the damage though. In the meantime I'll continue to try and work a
reliable reproduction. This particular routine only gets called in
batches on a quarterly basis so things have settled down.
Just a thought; could COPY be tricked into writing into the wrong file
descriptor? For example, if a file was killed with a rm -rf and the
fd pressured backend reopened the fd immediately?
merlin