Re: BBU Cache vs. spindles

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: BBU Cache vs. spindles
Дата
Msg-id 201012010307.oB137IA19179@momjian.us
обсуждение исходный текст
Ответ на Re: BBU Cache vs. spindles  (Greg Smith <greg@2ndquadrant.com>)
Ответы Re: BBU Cache vs. spindles  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-performance
Greg Smith wrote:
> Kevin Grittner wrote:
> > I assume that we send a full
> > 8K to the OS cache, and the file system writes disk sectors
> > according to its own algorithm.  With either platters or BBU cache,
> > the data is persisted on fsync; why do you see a risk with one but
> > not the other
>
> I'd like a 10 minute argument please.  I started to write something to
> refute this, only to clarify in my head the sequence of events that
> leads to the most questionable result, where I feel a bit less certain
> than I did before of the safety here.  Here is the worst case I believe
> you're describing:
>
> 1) Transaction is written to the WAL and sync'd; client receives
> COMMIT.  Since full_page_writes is off, the data in the WAL consists
> only of the delta of what changed on the page.
> 2) 8K database page is written to OS cache
> 3) PG calls fsync to force the database block out
> 4) OS writes first 4K block of the change to the BBU write cache.  Worst
> case, this fills the cache, and it takes a moment for some random writes
> to process before it has space to buffer again (makes this more likely
> to happen, but it's not required to see the failure case here)
> 5) Sudden power interruption, second half of the page write is lost
> 6) Server restarts
> 7) That 4K write is now replayed from the battery's cache
>
> At this point, you now have a torn 8K page, with 1/2 old and 1/2 new

Based on this report, I think we need to update our documentation and
backpatch removal of text that says that BBU users can safely turn off
full-page writes.  Patch attached.

I think we have fallen into a trap I remember from the late 1990's where
I was assuming that an 8k-block based file system would write to the
disk atomically in 8k segments, which of course it cannot.  My bet is
that even if you write to the kernel in 8k pages, and have an 8k file
system, the disk is still accessed via 512-byte blocks, even with a BBU.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index a2724fa..1e67bbd 100644
*** /tmp/pgrevert.14281/7sLqTb_wal.sgml    Tue Nov 30 21:57:17 2010
--- doc/src/sgml/wal.sgml    Tue Nov 30 21:56:49 2010
***************
*** 164,173 ****
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have a battery-backed disk
!    controller or file-system software that prevents partial page writes
!    (e.g., ZFS),  you can turn off this page imaging by turning off the
!    <xref linkend="guc-full-page-writes"> parameter.
    </para>
   </sect1>

--- 164,175 ----
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have file-system software
!    that prevents partial page writes (e.g., ZFS),  you can turn off
!    this page imaging by turning off the <xref
!    linkend="guc-full-page-writes"> parameter. Battery-Backed unit
!    (BBU) disk controllers do not prevent partial page writes unless
!    they guarantee that data is written to the BBU as full (8kB) pages.
    </para>
   </sect1>


В списке pgsql-performance по дате отправления:

Предыдущее
От: "Joshua D. Drake"
Дата:
Сообщение: Re: SELECT INTO large FKyed table is slow
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: BBU Cache vs. spindles