Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)

Поиск
Список
Период
Сортировка
От Graeme B. Bell
Тема Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
Дата
Msg-id C2A63334-1CC7-4530-B79A-E787727A6C5C@skogoglandskap.no
обсуждение исходный текст
Ответ на Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-admin

Hi Bruce,

> It is my understanding that write-through is always safe as it writes
> through to the layer below and waits for acnoledgement.  Write-back
> doesn't, so when you say:

I said WT twice, and I assure you for a third time - in the tests I've carried out, Crucial SSD disks, **WT** was not
safeor reliable whereas BATTERY-BACKED WB was. 

I believe it is confusing/surprising to you because your beliefs about WT and WB and about persistence of fsyncs to SSD
disksare inconsistent with what happens in reality.  
Specifically: ssds tell lies to the sata/raid controller about what they're really doing, but unfortunately WT mode
truststhe ssd to be honest. 

Hopefully this will help explain what's happening:

1. Historically, RAID writeback caches were unreliable for fsyncs because with no battery to persist data (and no
nonvolatileWB cache, as we see on modern raid controllers), anything in the WB cache would be lost during a power fail.
So,historically your safe options were: WB cache with battery (safe since the cache 'never loses power'), or WT to disk
(safeif the disk can be trusted to persist writes through a power loss). 

2. If you use WT at the raid controller level, then 'in principle' an fsync call should not return until the data is
safelyon the drive. For postgres, the fsync call is the most important thing. Regular data writes can fail and crash
recoveryis possible via WAL replay. But if fsync's don't work properly, crash recovery is probably not possible.  

3. In reality, on most SSDs, if you use WT on the RAID controller to go direct to disk, and make an fsync call, then
thedrive's controller tells you the data is persisted while behind the scenes the data is actually held in a volatile
writebackcache and is vulnerable to catastrophic loss during a power failure.  

4. So basically: the SSD drive's controller lies about fsync and does not honor the 'fsync contract'.

5. It's not just SSDs. If you are using e.g. a NAS system , perhaps storing your DB over NFS, then chances are the NAS
isalmost certainly lying to you when you make fsync calls. 

6. It's not just SSDs and NAS systems. If you use a virtual server / VMware, then to assist performance, the
virtualiseddisk may be lying about fsync persistence.  

7. Why do SSDs, NAS systems and VMs lie about fsync persistence? Because it improves the apparent performance,
benchmarksand so on, at the expense of corruption which happens only rarely during power failures. In some applications
thisis acceptable e.g. network file shares for powerpoints and word documents, mailservers... but for postgres, it's
notacceptable.  

8. The only way to get a real idea about WT and WB and what is happening under the hood, is to fire up diskchecker.pl
andmeasure the result in the real world when you pull the plug (many times). Then you'll know for sure what happens in
termsof performance and write persistence. You should not trust anything you read online written by other people
especiallyby drive manufacturers - test for yourself if your database matters to you. Remember to use a 'real' plug
pull- yank the cord out of the back - don't simply power off with the button or you'll get incorrect results. That
said,the Intel drives have a great reputation and every report I've read indicates they work correctly with WT or
directconnections.  

9.That said - with non-Intel SSD drives, you may be able to use WT if you turn off all caching on the SSD, which may
stopthe SSD lying about what it's doing. On the disks I've tested, this will allow fsync writes to hit the disk
correctlyin WT mode. 

However! the costs are enormous - substantial increase in ssd disk wear (e.g. much earlier failure) and massive
decreasein performance (96% drop e.g. you get 4% of the performance out of your disk that you expect - it can actually
beslower than an HDD!).  

Example (from memory but I'm very certain about these numbers): I measured ~100 disk operations per second with
diskchecker.pland crucial M500s when all disk cache was disabled and WT used. This compared with ~2000-2500 diskchecker
operationsper second with a *battery-backed* WB cache and disk cache enabled. In both cases, data persisted correctly
andwas not corrupted during repeated plug-pull testing.  



For what it's worth: A similar problem comes up with raid controllers and raid and probabiities of raid failure, and
themassive gap between theory and practice.  

You can read any statistic you like about the likelihood of e.g. RAID6 failing due to consecutive disk failures (e.g.
thisACM paper taking into account UBE errors with RAID5/6 suggests 0.00639% risk of loss per year,
http://queue.acm.org/detail.cfm?id=1670144),but in practice what kills your database is probably the raid controller
cardfailing or having a bug or unreliable battery.... (e.g. http://www.webhostingtalk.com/showthread.php?t=1075803,
e.g.25% failure rates on some models!).  

Graeme Bell



В списке pgsql-admin по дате отправления:

Предыдущее
От: Thomas SIMON
Дата:
Сообщение: Re: Performances issues with SSD volume ?
Следующее
От: "Guo, Yun"
Дата:
Сообщение: Re: pg_dump not dumping some schemas