Re: Re: Faster CREATE DATABASE by delaying fsync

Поиск
Список
Период
Сортировка
От Mark Mielke
Тема Re: Re: Faster CREATE DATABASE by delaying fsync
Дата
Msg-id 4B78906A.7020309@mark.mielke.cc
обсуждение исходный текст
Ответ на Re: Re: Faster CREATE DATABASE by delaying fsync  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On 02/14/2010 03:49 PM, Andres Freund wrote:
> On Sunday 14 February 2010 21:41:02 Mark Mielke wrote:
>    
>> The widely reported problems, though, did not tend to be a problem with
>> directory changes written too late - but directory changes being written
>> too early. That is, the directory change is written to disk, but the
>> file content is not. This is likely because of the "ordered journal"
>> mode widely used in ext3/ext4 where metadata changes are journalled, but
>> file pages are not journalled. Therefore, it is important for some
>> operations, that the file pages are pushed to disk using fsync(file),
>> before the metadata changes are journalled.
>>      
> Well, but thats not a problem with pg as it fsyncs the file contents.
>    

Exactly. Not a problem.

>> If you are concerned, enable dirsync.
>>      
> If the filesystem already behaves that way a fsync on it should be fairly
> cheap. If it doesnt behave that way doing it is correct...
>    

Well, I disagree, as the whole point of this thread is that fsync() is 
*not* cheap. :-)

> Besides there is no reason to fsync the directory before the checkpoint, so
> dirsync would require a higher cost than doing it correctly.
>    

Using "ordered" metadata journaling has approximately the same effect. 
Provided that the data is fsync()'d before the metadata is required, 
either the metadata is recorded in the journal, in which case the data 
is accessible, or the metadata is NOT recorded in the journal, in which 
case, the files will appear missing. The races that theoretically exist 
would be in situations where the data of one file references a separate 
file that does not yet exist.

You said you would try and reproduce - are you going to try and 
reproduce on ext3/ext4 with ordered journalling enabled? I think 
reproducing outside of a case such as CREATE DATABASE would be 
difficult. It would have to be something like:
    open(O_CREAT)/write()/fsync()/close() of new data file, where data 
gets written, but directory data is not yet written out to journal    open()/.../write()/fsync()/close() of existing
fileto point to new 
 
data file, but directory data is still not yet written out to journal    crash

In this case, "dirsync" should be effective at closing this hole.

As for cost? Well, most PostgreSQL data is stored within file content, 
not directory metadata. I think "dirsync" might slow down some 
operations like CREATE DATABASE or "rm -fr", but I would not expect it 
to effect day-to-day performance of the database under real load. Many 
operating systems enable the equivalent of "dirsync" by default. I 
believe Solaris does this, for example, and other than slowing down "rm 
-fr", I don't recall any real complaints about the cost of "dirsync".

After writing the above, I'm seriously considering adding "dirsync" to 
my /db mounts that hold PostgreSQL and MySQL data.

Cheers,
mark

-- 
Mark Mielke<mark@mielke.cc>



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Следующее
От: Greg Stark
Дата:
Сообщение: Explain buffers display units.