Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
От | Graeme B. Bell |
---|---|
Тема | Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?) |
Дата | |
Msg-id | C2A63334-1CC7-4530-B79A-E787727A6C5C@skogoglandskap.no обсуждение исходный текст |
Ответ на | Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?) (Bruce Momjian <bruce@momjian.us>) |
Список | pgsql-admin |
Hi Bruce, > It is my understanding that write-through is always safe as it writes > through to the layer below and waits for acnoledgement. Write-back > doesn't, so when you say: I said WT twice, and I assure you for a third time - in the tests I've carried out, Crucial SSD disks, **WT** was not safeor reliable whereas BATTERY-BACKED WB was. I believe it is confusing/surprising to you because your beliefs about WT and WB and about persistence of fsyncs to SSD disksare inconsistent with what happens in reality. Specifically: ssds tell lies to the sata/raid controller about what they're really doing, but unfortunately WT mode truststhe ssd to be honest. Hopefully this will help explain what's happening: 1. Historically, RAID writeback caches were unreliable for fsyncs because with no battery to persist data (and no nonvolatileWB cache, as we see on modern raid controllers), anything in the WB cache would be lost during a power fail. So,historically your safe options were: WB cache with battery (safe since the cache 'never loses power'), or WT to disk (safeif the disk can be trusted to persist writes through a power loss). 2. If you use WT at the raid controller level, then 'in principle' an fsync call should not return until the data is safelyon the drive. For postgres, the fsync call is the most important thing. Regular data writes can fail and crash recoveryis possible via WAL replay. But if fsync's don't work properly, crash recovery is probably not possible. 3. In reality, on most SSDs, if you use WT on the RAID controller to go direct to disk, and make an fsync call, then thedrive's controller tells you the data is persisted while behind the scenes the data is actually held in a volatile writebackcache and is vulnerable to catastrophic loss during a power failure. 4. So basically: the SSD drive's controller lies about fsync and does not honor the 'fsync contract'. 5. It's not just SSDs. If you are using e.g. a NAS system , perhaps storing your DB over NFS, then chances are the NAS isalmost certainly lying to you when you make fsync calls. 6. It's not just SSDs and NAS systems. If you use a virtual server / VMware, then to assist performance, the virtualiseddisk may be lying about fsync persistence. 7. Why do SSDs, NAS systems and VMs lie about fsync persistence? Because it improves the apparent performance, benchmarksand so on, at the expense of corruption which happens only rarely during power failures. In some applications thisis acceptable e.g. network file shares for powerpoints and word documents, mailservers... but for postgres, it's notacceptable. 8. The only way to get a real idea about WT and WB and what is happening under the hood, is to fire up diskchecker.pl andmeasure the result in the real world when you pull the plug (many times). Then you'll know for sure what happens in termsof performance and write persistence. You should not trust anything you read online written by other people especiallyby drive manufacturers - test for yourself if your database matters to you. Remember to use a 'real' plug pull- yank the cord out of the back - don't simply power off with the button or you'll get incorrect results. That said,the Intel drives have a great reputation and every report I've read indicates they work correctly with WT or directconnections. 9.That said - with non-Intel SSD drives, you may be able to use WT if you turn off all caching on the SSD, which may stopthe SSD lying about what it's doing. On the disks I've tested, this will allow fsync writes to hit the disk correctlyin WT mode. However! the costs are enormous - substantial increase in ssd disk wear (e.g. much earlier failure) and massive decreasein performance (96% drop e.g. you get 4% of the performance out of your disk that you expect - it can actually beslower than an HDD!). Example (from memory but I'm very certain about these numbers): I measured ~100 disk operations per second with diskchecker.pland crucial M500s when all disk cache was disabled and WT used. This compared with ~2000-2500 diskchecker operationsper second with a *battery-backed* WB cache and disk cache enabled. In both cases, data persisted correctly andwas not corrupted during repeated plug-pull testing. For what it's worth: A similar problem comes up with raid controllers and raid and probabiities of raid failure, and themassive gap between theory and practice. You can read any statistic you like about the likelihood of e.g. RAID6 failing due to consecutive disk failures (e.g. thisACM paper taking into account UBE errors with RAID5/6 suggests 0.00639% risk of loss per year, http://queue.acm.org/detail.cfm?id=1670144),but in practice what kills your database is probably the raid controller cardfailing or having a bug or unreliable battery.... (e.g. http://www.webhostingtalk.com/showthread.php?t=1075803, e.g.25% failure rates on some models!). Graeme Bell
В списке pgsql-admin по дате отправления: