Protecting against unexpected zero-pages: proposal

Поиск
Список
Период
Сортировка
От Gurjeet Singh
Тема Protecting against unexpected zero-pages: proposal
Дата
Msg-id AANLkTimt2xZDDUiRqMS3aTTRxVQY6ALZNhF5ou1_736w@mail.gmail.com
обсуждение исходный текст
Ответы Re: Protecting against unexpected zero-pages: proposal  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
<div dir="ltr"><span style="font-family: courier new,monospace;">A customer of ours is quite bothered about finding
zeropages in an index after</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">asystem crash. The task now is to improve the diagnosability of such an issue</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">and be able to
definitivelypoint to the source of zero pages.</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The proposed solution
belowhas been vetted in-house at EnterpriseDB and am</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">posting here to see any possible problems we missed, and also if the
community</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">wouldbe interested in incorporating this capability.</span><br style="font-family: courier
new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">Background:</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">-----------</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">SUSELinux, ATCA board, 4 dual core CPUs => 8 cores, 24 GB RAM, 140 GB disk,</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">PG 8.3.11. RAID-1 SAS
withSCSIinfo reporting that write-caching is disabled.</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The corrupted index's
filecontents, based on hexdump:</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">    It has a total of 525 pages (cluster block size
is8K: per pg_controldata)</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">   Blocks 0 to 278 look sane.</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">    Blocks 279 to 518 are full of zeroes.</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;">    Block 519 to 522 look sane.</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">    Block 523 is filled
withzeroes.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">   Block 524 looks sane.</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The tail end of blocks
278and 522 have some non-zero data, meaning that those</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">index pages have some valid 'Special space' contents. Also, head of blocks
519</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">and 524
looksane. These two findings imply that the zeroing action happened at</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">8K page boundary. This is a standard ext3 FS with 4K
blocksize, so this raises</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">questionas to how we can ascertain that this was indeed a hardware/FS</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;">malfunction. And if it was a hardware/FS
problem,then why didn't we see zeroes</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">ext3 FS block size)
whichdoes not align with an 8 K boundary.</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">The backup from before
thecrash does not have these zero-pages.</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">Disk Page Validity Check
UsingMagic Number</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">===========================================</span><brstyle="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">Requirement: </span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">------------  </span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">We have encountered
quitea few zero pages in an index after a machine crash,</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">causing this index to be unusable. Although REINDEX is an option but we
have</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">no way of
tellingif these zero pages were caused by hardware or filesystem or</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">by Postgres. Postgres code analysis shows that
Postgresbeing the culprit is a</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">very low probablity, and similarly, since our hardware is also considered of</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">good quality with
hardwarelevel RAID-1 over 2 disks, it is difficult to consider</span><br style="font-family: courier new,monospace;"
/><spanstyle="font-family: courier new,monospace;">the hardware to be a problem. The ext3 filesystem being used is also
quitea</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">time-testedpiece of software, hence it becomes very difficult to point fingers</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">at any of these 3
componentsfor this corruption.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">Postgres is being deployed as a component of a
carrier-gradeplatform, and it is</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">required to run unattended as much as possible. There is a High Availability</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">monitoring component
thatis tasked with performing switchover to a standby node</span><br style="font-family: courier new,monospace;"
/><spanstyle="font-family: courier new,monospace;">in the event of any problem with the primary node. This HA component
needsto</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">performregular checks on health of all the other components, including Postgres,</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">and take corrective
actions.</span><brstyle="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">With the zero pages comes the difficulty of ascertaining whether these
are</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">legitimate
zeropages, (since Postgres considers zero pages as valid (maybe</span><br style="font-family: courier new,monospace;"
/><spanstyle="font-family: courier new,monospace;">leftover from previous extend-file followed by a crash)), or are
thesezero pages</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">aresult of FS/hardware failure.</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">We are required to
definitivelydifferentiate between zero pages from Postgres</span><br style="font-family: courier new,monospace;"
/><spanstyle="font-family: courier new,monospace;">vs. zero pages caused by hardware failure. Obviously this is not
possibleby the</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">verynature of the problem, so we explored a few ideas, including per-block</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">checksums in-block or in
checksum-fork,S.M.A.R.T monitoring of disk drives,</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">PageInit() before smgrextend() in ReadBuffer_common(), and additional member
in</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">PageHeader
fora magic number.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">Following is an approach which we think is least
invasive,and does not threaten</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">code-breakage, yet provides a definitive detection of corruption/data-loss</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">outside Postgres with
leastperformance penalty.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">Implementation:</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;">---------------</span><br
style="font-family:courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">.) The basic idea is to have a magic number in every PageHeader before it
is</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">written to
disk,and check for this magic number when performing page validity</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">checks.</span><br style="font-family: courier
new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">.)
Toavoid adding a new field to PageHeader, and any code breakage, we reuse </span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">   an existing member of the structure.</span><br
style="font-family:courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">.) We exploit the following facts and assumptions:</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">  -) Relations/files are
extended8 KB (BLCKSZ) at a time.</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">  -) Every I/O unit contains PageHeader structure (table/index/fork files),</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">     which in turn
containspd_lsn as the first member.</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">  -) Every newly written block is considered to be zero filled.</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;">  -) PageIsNew() assumes that if pd_upper is
0then the page is zero.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;"> -) PageHeaderIsValid() allows zero filled pages to be considered valid.</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;">  -) Anyone wishing to use a new page has to
doPageInit() on the page.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;"> -) PageInit() does a MemSet(0) on the whole page.</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">  -) XLogRecPtr={x,0} is considered
invalid</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">  -)
XLogRecPtr={x,~((uint32)0)} is not valid either (i.e. last byte of an xlog</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">      file (not segment)); we'll use this as the
magicnumber.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"
/><spanstyle="font-family: courier new,monospace;">      ... Above is my assumption, since it is not mentioned anywhere
inthe code.</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">     The XLogFileSize calculation seems to support this assumptiopn.</span><br style="font-family:
couriernew,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">     ... If this assumption doesn't hold good, then the previous assumption {x,0}</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">      can also be used
toimplement this magic number (with x > 0).</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">  -) There's only one implementation of Storage Manager, i.e.
md.c.</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">  -)
smgr_extend()-> mdextend() is the only place where a relation is extended.</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">  -) Writing beyond EOF in a file causes the
intermediatespace to become a hole,</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">     and any reads from such a hole returns zero filled pages.</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;">  -) Anybody trying to extend a file makes
surethat there's no cuncurrent</span><br style="font-family: courier new,monospace;" /><span style="font-family:
couriernew,monospace;">     extension going on from somewhere else.</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">     ... This is ensured either by implicit nature
ofthe calling code, or by</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">    calling LockRelationForExtension().</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">.) In mdextend(), if the
bufferbeing written is zero filled, then we write the</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">   magic number in that page's pd_lsn.</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">   ... This check can be optimized to just check
sizeof(pd_lsn)worth of buffer.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">.) In mdextend(), if the buffer is being written
beyondcurrent EOF, then we</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  forcibly write the intermediate blocks too, and write the magic number in</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">   each of
those.</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">   ...
Thisneeds an _mdnblocks() call and FileSeek(SEEK_END)+FileWrite() calls</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">   for every block in the hole.</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">   </span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">   ... Creation of holes
isbeing assumed to be a very limited corner case,</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">   hence this performace hit is acceptable in these rare corner cases. Tests
are</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">   being
plannedusing real application, to check how many times this occurs.</span><br style="font-family: courier
new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">.)
PageHeaderIsValid()needs to be modified to allow MagicNumber-followed-by-zeroes</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">   as a valid page (rather than a completely zero
page)</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">   ...
Ifthe page is completely filled with zeroes, this confirms the fact that</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">   either the filesystem or the disk storage zeroed
thesepages, since Postgres</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  never wrote zero pages to disk.</span><br style="font-family: courier new,monospace;" /><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">.) PageInit() and
PageIsNew()require no change.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">.) XLByteLT(), XLByteLE() and XLByteEQ() may be
changedto contain</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">   AssertMacro( !MagicNumber(a) && !MagicNumber(b) )</span><br style="font-family: courier
new,monospace;"/><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">.)
Ihaven't analyzed the effects of this change on the recovery code, but I</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">   have a feeling that we might not need to change
anythingthere.</span><br style="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;"
/><spanstyle="font-family: courier new,monospace;">.) We can create a contrib module (standalone binary or a loadable
module)that</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  goes through each disk page and checks it for being zero filled, and raises</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">   alarm if it finds
any.</span><brstyle="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">Thoughts welcome.</span><br style="font-family: courier new,monospace;" />--
<br/>gurjeet.singh<br />@ EnterpriseDB - The Enterprise Postgres Company<br /><a
href="http://www.EnterpriseDB.com">http://www.EnterpriseDB.com</a><br/><br />singh.gurjeet@{ gmail | yahoo }.com<br
/>Twitter/Skype:singh_gurjeet<br /><br />Mail sent from my BlackLaptop device<br /></div> 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: "Make" versus effective stack limit in regression tests
Следующее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: temporary functions (and other object types)