Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES
Дата
Msg-id 200109282137.f8SLbpm01890@candle.pha.pa.us
обсуждение исходный текст
Ответ на PERFORMANCE IMPROVEMENT by mapping WAL FILES  (Janardhana Reddy <jana-reddy@mediaring.com.sg>)
Список pgsql-hackers
> > Hi all,
> >           By   mapping  the WAL files by each backend  in to its address
> > space  using "mmap"  system call , there  will be  big
> >  improvements  in performance  from the following point of view:
> >               1.   Each  backend directly writes in to the address
> > space  which is  obtained by  maping  the WAL files.
> >                       this  saves  the write system call at the  end of
> > every transaction  which transfres  8k of
> >                       data from user space to kernel.
> >                2.   since every transaction does not modify all the 8k
> > content of WAL page , so by issuing the
> >                        fsync , the  kernel  only   transfers only the
> > kernel pages which are modified , which is  4k for
> >                         linux so fsync time  is saved by  twice.
> > Any comments ?.
> 
> This is interesting.  We are concerned about using mmap() for all I/O
> because we could eat up quite a bit of address space for big tables, but
> WAL seems like an ideal use for mmap().

OK, I have talked to Tom Lane about this on the phone and we have a few
ideas.

Historically, we have avoided mmap() because of portability problems,
and because using mmap() to write to large tables could consume lots of
address space with little benefit.  However, I perhaps can see WAL as
being a good use of mmap.

First, there is the issue of using mmap().  For OS's that have the
mmap() MAP_SHARED flag, different backends could mmap the same file and
each see the changes.  However, keep in mind we still have to fsync()
WAL, so we need to use msync().

So, looking at the benefits of using mmap(), we have overhead of
different backends having to mmap something that now sits quite easily
in shared memory.  Now, I can see mmap reducing the copy from user to
kernel, but there are other ways to fix that.  We could modify the
write() routines to write() 8k on first WAL page write and later write
only the modified part of the page to the kernel buffers.  The old
kernel buffer is probably still around so it is unlikely to require a
read from the file system to read in the rest of the page.  This reduces
the write from 8k to something probably less than 4k which is better
than we can do with mmap.

I will add a TODO item to this effect.

As far as reducing the write to disk from 8k to 4k, if we have to
fsync/msync, we have to wait for the disk to spin to the proper location
and at that point writing 4k or 8k doesn't seem like much of a win.

In summary, I think it would be nice to reduce the 8k transfer from user
to kernel on secondary page writes to only the modified part of the
page.  I am uncertain if mmap() or anything else will help the physical
write to the disk.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Spinlock performance improvement proposal
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Spinlock performance improvement proposal