On Fri, 29 Oct 2010, Robert Haas wrote:
> On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> James Mansion <james@mansionfamily.plus.com> writes:
>>> Tom Lane wrote:
>>>> The other and probably worse problem is that there's no application
>>>> control over how soon changes to mmap'd pages get to disk. An msync
>>>> will flush them out, but the kernel is free to write dirty pages sooner.
>>>> So if they're depending for consistency on writes not happening until
>>>> msync, it's broken by design. (This is one of the big reasons we don't
>>>> use mmap'd space for Postgres disk buffers.)
>>
>>> Well, I agree that it sucks for the reason you give - but you use
>>> write and that's *exactly* the same in terms of when it gets written,
>>> as when you update a byte on an mmap'd page.
>>
>> Uh, no, it is not. The difference is that we can update a byte in a
>> shared buffer, and know that it *isn't* getting written out before we
>> say so. If the buffer were mmap'd then we'd have no control over that,
>> which makes it mighty hard to obey the WAL "write log before data"
>> paradigm.
>>
>> It's true that we don't know whether write() causes an immediate or
>> delayed disk write, but we generally don't care that much. What we do
>> care about is being able to ensure that a WAL write happens before the
>> data write, and with mmap we don't have control over that.
>
> Well, we COULD keep the data in shared buffers, and then copy it into
> an mmap()'d region rather than calling write(), but I'm not sure
> there's any advantage to it. Managing address space mappings is a
> pain in the butt.
keep in mind that you have no way of knowing what order the data in the
mmap region gets written out to disk.
David Lang