Protecting against unexpected zero-pages: proposal
От | Gurjeet Singh |
---|---|
Тема | Protecting against unexpected zero-pages: proposal |
Дата | |
Msg-id | AANLkTimt2xZDDUiRqMS3aTTRxVQY6ALZNhF5ou1_736w@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Protecting against unexpected zero-pages: proposal
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
<div dir="ltr"><span style="font-family: courier new,monospace;">A customer of ours is quite bothered about finding zeropages in an index after</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">asystem crash. The task now is to improve the diagnosability of such an issue</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">and be able to definitivelypoint to the source of zero pages.</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The proposed solution belowhas been vetted in-house at EnterpriseDB and am</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">posting here to see any possible problems we missed, and also if the community</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">wouldbe interested in incorporating this capability.</span><br style="font-family: courier new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">Background:</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">-----------</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">SUSELinux, ATCA board, 4 dual core CPUs => 8 cores, 24 GB RAM, 140 GB disk,</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">PG 8.3.11. RAID-1 SAS withSCSIinfo reporting that write-caching is disabled.</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The corrupted index's filecontents, based on hexdump:</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> It has a total of 525 pages (cluster block size is8K: per pg_controldata)</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> Blocks 0 to 278 look sane.</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;"> Blocks 279 to 518 are full of zeroes.</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;"> Block 519 to 522 look sane.</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> Block 523 is filled withzeroes.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> Block 524 looks sane.</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The tail end of blocks 278and 522 have some non-zero data, meaning that those</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">index pages have some valid 'Special space' contents. Also, head of blocks 519</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">and 524 looksane. These two findings imply that the zeroing action happened at</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">8K page boundary. This is a standard ext3 FS with 4K blocksize, so this raises</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">questionas to how we can ascertain that this was indeed a hardware/FS</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;">malfunction. And if it was a hardware/FS problem,then why didn't we see zeroes</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;">at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">ext3 FS block size) whichdoes not align with an 8 K boundary.</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The backup from before thecrash does not have these zero-pages.</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">Disk Page Validity Check UsingMagic Number</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">===========================================</span><brstyle="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">Requirement: </span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">------------ </span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">We have encountered quitea few zero pages in an index after a machine crash,</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">causing this index to be unusable. Although REINDEX is an option but we have</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">no way of tellingif these zero pages were caused by hardware or filesystem or</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">by Postgres. Postgres code analysis shows that Postgresbeing the culprit is a</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;">very low probablity, and similarly, since our hardware is also considered of</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">good quality with hardwarelevel RAID-1 over 2 disks, it is difficult to consider</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;">the hardware to be a problem. The ext3 filesystem being used is also quitea</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">time-testedpiece of software, hence it becomes very difficult to point fingers</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">at any of these 3 componentsfor this corruption.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">Postgres is being deployed as a component of a carrier-gradeplatform, and it is</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;">required to run unattended as much as possible. There is a High Availability</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">monitoring component thatis tasked with performing switchover to a standby node</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;">in the event of any problem with the primary node. This HA component needsto</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">performregular checks on health of all the other components, including Postgres,</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">and take corrective actions.</span><brstyle="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">With the zero pages comes the difficulty of ascertaining whether these are</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">legitimate zeropages, (since Postgres considers zero pages as valid (maybe</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;">leftover from previous extend-file followed by a crash)), or are thesezero pages</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">aresult of FS/hardware failure.</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">We are required to definitivelydifferentiate between zero pages from Postgres</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;">vs. zero pages caused by hardware failure. Obviously this is not possibleby the</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">verynature of the problem, so we explored a few ideas, including per-block</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">checksums in-block or in checksum-fork,S.M.A.R.T monitoring of disk drives,</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">PageInit() before smgrextend() in ReadBuffer_common(), and additional member in</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">PageHeader fora magic number.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">Following is an approach which we think is least invasive,and does not threaten</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;">code-breakage, yet provides a definitive detection of corruption/data-loss</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">outside Postgres with leastperformance penalty.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">Implementation:</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;">---------------</span><br style="font-family:courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">.) The basic idea is to have a magic number in every PageHeader before it is</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">written to disk,and check for this magic number when performing page validity</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">checks.</span><br style="font-family: courier new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">.) Toavoid adding a new field to PageHeader, and any code breakage, we reuse </span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> an existing member of the structure.</span><br style="font-family:courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">.) We exploit the following facts and assumptions:</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> -) Relations/files are extended8 KB (BLCKSZ) at a time.</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;"> -) Every I/O unit contains PageHeader structure (table/index/fork files),</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> which in turn containspd_lsn as the first member.</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;"> -) Every newly written block is considered to be zero filled.</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;"> -) PageIsNew() assumes that if pd_upper is 0then the page is zero.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> -) PageHeaderIsValid() allows zero filled pages to be considered valid.</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;"> -) Anyone wishing to use a new page has to doPageInit() on the page.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> -) PageInit() does a MemSet(0) on the whole page.</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> -) XLogRecPtr={x,0} is considered invalid</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> -) XLogRecPtr={x,~((uint32)0)} is not valid either (i.e. last byte of an xlog</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> file (not segment)); we'll use this as the magicnumber.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;"> ... Above is my assumption, since it is not mentioned anywhere inthe code.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> The XLogFileSize calculation seems to support this assumptiopn.</span><br style="font-family: couriernew,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> ... If this assumption doesn't hold good, then the previous assumption {x,0}</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> can also be used toimplement this magic number (with x > 0).</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;"> -) There's only one implementation of Storage Manager, i.e. md.c.</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> -) smgr_extend()-> mdextend() is the only place where a relation is extended.</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> -) Writing beyond EOF in a file causes the intermediatespace to become a hole,</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;"> and any reads from such a hole returns zero filled pages.</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;"> -) Anybody trying to extend a file makes surethat there's no cuncurrent</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;"> extension going on from somewhere else.</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> ... This is ensured either by implicit nature ofthe calling code, or by</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> calling LockRelationForExtension().</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">.) In mdextend(), if the bufferbeing written is zero filled, then we write the</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;"> magic number in that page's pd_lsn.</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> ... This check can be optimized to just check sizeof(pd_lsn)worth of buffer.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">.) In mdextend(), if the buffer is being written beyondcurrent EOF, then we</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> forcibly write the intermediate blocks too, and write the magic number in</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> each of those.</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> ... Thisneeds an _mdnblocks() call and FileSeek(SEEK_END)+FileWrite() calls</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> for every block in the hole.</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> </span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> ... Creation of holes isbeing assumed to be a very limited corner case,</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;"> hence this performace hit is acceptable in these rare corner cases. Tests are</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> being plannedusing real application, to check how many times this occurs.</span><br style="font-family: courier new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">.) PageHeaderIsValid()needs to be modified to allow MagicNumber-followed-by-zeroes</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> as a valid page (rather than a completely zero page)</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> ... Ifthe page is completely filled with zeroes, this confirms the fact that</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> either the filesystem or the disk storage zeroed thesepages, since Postgres</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> never wrote zero pages to disk.</span><br style="font-family: courier new,monospace;" /><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">.) PageInit() and PageIsNew()require no change.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">.) XLByteLT(), XLByteLE() and XLByteEQ() may be changedto contain</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> AssertMacro( !MagicNumber(a) && !MagicNumber(b) )</span><br style="font-family: courier new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">.) Ihaven't analyzed the effects of this change on the recovery code, but I</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> have a feeling that we might not need to change anythingthere.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;">.) We can create a contrib module (standalone binary or a loadable module)that</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> goes through each disk page and checks it for being zero filled, and raises</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> alarm if it finds any.</span><brstyle="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">Thoughts welcome.</span><br style="font-family: courier new,monospace;" />-- <br/>gurjeet.singh<br />@ EnterpriseDB - The Enterprise Postgres Company<br /><a href="http://www.EnterpriseDB.com">http://www.EnterpriseDB.com</a><br/><br />singh.gurjeet@{ gmail | yahoo }.com<br />Twitter/Skype:singh_gurjeet<br /><br />Mail sent from my BlackLaptop device<br /></div>
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Heikki LinnakangasДата:
Сообщение: Re: "Make" versus effective stack limit in regression tests
Следующее
От: Martijn van OosterhoutДата:
Сообщение: Re: temporary functions (and other object types)