On 12/2/19 7:35 AM, Michael Paquier wrote:
> On Sat, Nov 30, 2019 at 03:09:39PM +0300, Grigory Smolkin wrote:
>> I`ve digged a bit into this problem, and it`s turned out that in
>> SaveSlotToPath() temp file for replication slot is opened with 'O_CREAT |
>> O_EXCL' flags, which makes this routine as not very reentrant.
> What did you see as I/O problem before facing the actual error
> reported here? Was it just ENOSPC, a fsync failure, or just a failure
> in closing the fd? The first pattern is mostly what I guess happened,
> still a fsync failure would not trigger a PANIC here (actually we
> really should do that!), but I am raising a different thread about
> that issue.
Hello!
I didn`t see the very first error that left behind the temp file.
I`ve requested it just now, but it will take some time to get it (there
are several terabytes of text log).
But I assume that it was out of space error, which, by the look of the
code, should produce ERROR and leave temp file behind, just as it
happened in aforementioned case.
>
>> Since an exclusive lock is taken before temp file creation, I think it
>> should be safe to replace O_EXCL with O_TRUNC.
>> Script to reproduce and patch are attached.
> Agreed. I prefer the O_TRUNC option because that's less code churn.
> Also, as it can still be useful to have a look at the temporary state
> file after a crash or a failure, doing unlink() in the error code
> paths is no good option IMO.
I`m sorry, but it was an production system, so, as I understand it,
stale temp file was hastily deleted without long considerations.
Thank you for your interest in this topic.
>
> Have others thoughts or objections to share?
> --
> Michael
--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company