Extended Prefetching using Asynchronous IO - proposal and patch

Поиск
Список
Период
Сортировка
От John Lumby
Тема Extended Prefetching using Asynchronous IO - proposal and patch
Дата
Msg-id BAY175-W45086073075CA064EFE9A0A33A0@phx.gbl
обсуждение исходный текст
Ответ на Re: Race condition within _bt_findinsertloc()? (new page split code)  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: Extended Prefetching using Asynchronous IO - proposal and patch
Список pgsql-hackers
<div dir="ltr">Claudio Freire and I are proposing new functionality for Postgresql <br />to extend the scope of
prefetchingand also exploit posix asynchronous IO<br />when doing prefetching,    and have a patch based on 9.4dev<br
/>readyfor consideration.<br /><br />This topic has cropped up at irregular intervals over the years,<br />e.g. this
threadback in 2012<br />   <a
href="www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com"
target="_blank">www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com</a><br
/>andthis thread more recently<br />  
http://www.postgresql.org/message-id/CAGTBQpaFC_z=zdWVAXD8wWss3v6jxZ5pNmrrYPsD23LbrqGvgQ@mail.gmail.com<br/><br />We
nowhave an implementation which gives useful performance improvement<br />as well as other advantages compared to what
iscurrently available,<br />at least for certain environments.<br /><br />Below I am pasting the README we have written
forthis new functionality<br />which mentions some of the measurements, advantages (and disadvantages)<br />and we
welcomeall and any comments on this.<br /><br />I will send the patch to commitfest later, once this email is posted to
hackers,<br/>so that anyone who wishes can try it,  or apply directly to me if you wish.<br />The patch is currently
basedon 9.4dev but a version based on 9.3.4<br />will be available soon if anyone wants that.    The patch is large 
(43files)<br />so non-trivial to review,   but any comments on it (when posted) will be<br />appreciated and acted
on.   Note that at present the only environment<br />in which it has been applied and tested is linux.<br /><br />John
Lumby   <br />__________________________________________________________________________________________________<br
/><br/><br />Postgresql  --   Extended Prefetching using Asynchronous IO<br
/>============================================================<br/><br />Postgresql currently (9.3.4) provides a
limitedprefetching capability<br />using posix_fadvise to give hints to the Operating System kernel<br />about which
pagesit expects to read in the near future.<br />This capability is used only during the heap-scan phase of
bitmap-indexscans.<br />It is controlled via the effective_io_concurrency configuration parameter.<br /><br />This
capabilityis now extended in two ways :<br />   .   use asynchronous IO into Postgresql shared buffers as an<br
/>      alternative to posix_fadvise<br />   .   Implement prefetching in other types of scan :<br />            . 
non-bitmap(i.e. simple) index scans - index pages<br />                     currently only for B-tree indexes.<br
/>                   (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)<br />            .  non-bitmap
(i.e.simple) index scans - heap pages<br />                          currently only for B-tree indexes.<br
/>           .  simple heap scans<br /><br />Posix asynchronous IO is chosen as the function library for asynchronous
IO,<br/>since this is well supported and also fits very well with the model of<br />the prefetching process, 
particularlyas regards checking for completion<br />of an asynchronous read.    On linux,   Posix asynchronous IO is
provided<br/>in the librt library.    librt uses independently-schedulable threads to<br />achieve the
asynchronicity,  rather than kernel functionality.<br /><br />In this implementation,  use of asynchronous IO is
limitedto prefetching<br />while performing one of the three types of scan<br />            .  B-tree bitmap index scan
-heap pages    (as already exists)<br />            .  B-tree non-bitmap (i.e. simple) index scans - index and heap
pages<br/>            .  simple heap scans<br />on permanent relations.   It is not used on temporary tables nor for
writes.<br/><br />The advantages of Posix asynchronous IO into shared buffers<br />compared to posix_fadvise are :<br
/>  .   Beneficial for non-sequential access patterns as well as sequential<br />   .   No restriction on the kinds of
IOwhich can be used<br />       (other kinds of asynchronous IO impose restrictions such as<br />        buffer
alignment, use of non-buffered IO).<br />   .   Does not interfere with standard linux kernel read-ahead
functionality.<br/>       (It has been stated in <br
/> www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com<br/>       that
:<br/>          "the kernel stops doing read-ahead when a call to posix_fadvise comes.<br />           I noticed the
performancehit, and checked the kernel's code.<br />           It effectively changes the prediction mode from
sequentialto fadvise,<br />           negating the (assumed) kernel's prefetch logic")<br />   .   When the read
requestis issued after a prefetch has completed,<br />       no delay associated with a kernel call to copy the page
from<br/>       kernel page buffers into the Postgresql shared buffer,<br />       since it is already there.<br
/>      Also,   in a memory-constrained environment,   there is a greater<br />       probability that the prefetched
pagewill "stick" in memory<br />       since the linux kernel victimizes the filesystem page cache in preference<br
/>      to swapping out user process pages.<br />   .   Statistics on prefetch success can be gathered (see
"Statistics"below)<br />       which helps the administrator to tune the prefetching settings.<br /><br />These
benefitsare most likely to be obtained in a system whose usage profile<br />(e.g. from iostat)  shows:<br />     .  
highIO wait from mostly-read activity<br />     .   disk access pattern is not entirely sequential<br />         (so
kernelreadahead can't predict it but postgresql can)<br />     .   sufficient spare idle CPU to run the librt
pthreads<br/>         or,  stated another way,    the CPU subsystem is relatively powerful<br />         compared to
thedisk subsystem.<br />In such ideal conditions,  and with a workload with plenty of index scans,<br />around 10% -
20%improvement in throughput has been achieved.<br />In an admittedly extreme environment measured by this author,   
witha workload<br />consisting of 8 client applications each running similar complex queries<br />(same query structure
butdifferent predicates and constants),<br />including 2 Bitmap Index Scans and 17 non-bitmap index scans,<br />on a
dual-coreIntel laptop (4 hyperthreads) with the database on a single<br />USB3-attached 500GB disk drive, and no part
ofthe database in filesystem buffers<br />initially,  (filesystem freshly mounted),  comparing unpatched build<br
/>usingposix_fadvise with effective_io_concurrency 4 against same build patched<br />with async IO and
effective_io_concurrency4 and max_async_io_prefetchers 32,<br />elapse time repeatably improved from around 640-670
secondsto around 530-550 seconds,<br />a 17% - 18% improvement. <br /><br />The disadvantages of Posix asynchronous IO
comparedto posix_fadvise are:<br />     .   probably higher CPU utilization:<br />         Firstly, the extra work
performedby the librt threads adds CPU<br />         overhead, and secondly, if the asynchronous prefetching is
effective,<br/>         then it will deliver better (greater) overlap of CPU with IO, which<br />         will reduce
elapsedtimes and hence increase CPU utilization percentage<br />         still more (during that shorter elapsed
time).<br/>     .   more context switching,  because of the additional threads.<br /><br /><br />Statistics:<br
/>___________<br/><br />A number of additional statistics relating to effectiveness of asynchronous IO<br />are
providedas an extension of the existing pg_stat_statements loadable module.<br />Refer to the appendix "Additional
SuppliedModules" in the current<br />PostgreSQL Documentation for details of this module.<br /><br />The following
additionalstatistics are provided for asynchronous IO prefetching:<br /><br />    . aio_read_noneed  :   number of
prefetchesfor which no need for prefetch as block already in buffer pool<br />    . aio_read_discrd  :   number of
prefetchesfor which buffer not subsequently read and therefore discarded<br />    . aio_read_forgot  :   number of
prefetchesfor which buffer not subsequently read and then forgotten about<br />    . aio_read_noblok  :   number of
prefetchesfor which no available BufferAiocb  control block<br />    . aio_read_failed  :   number of aio reads for
whichaio itself failed or the read failed with an errno<br />    . aio_read_wasted  :   number of aio reads for which
in-progressaio cancelled and disk block not used<br />    . aio_read_waited  :   number of aio reads for which disk
blockused but had to wait for it<br />    . aio_read_ontime  :   number of aio reads for which disk block used and
readyon time when requested<br /><br />Some of these are (hopefully) self-explanatory.    Some additional notes:<br
/><br/>    . aio_read_discrd and aio_read_forgot  :<br />                    prefetch was wasted work since the buffer
wasnot subsequently read<br />                    The discrd case indicates that the scanner realized this and
discardedthe buffer,<br />                    whereas the forgot case indicates that the scanner did not realize it,<br
/>                   which should not normally occur.<br />                    A high number in either suggests
loweringeffective_io_concurrency.<br /><br />    . aio_read_noblok  :   <br />                    Any significant
numberin relation to all the other numbers indicates that<br />                    max_async_io_prefetchers should be
increased.<br/><br />    . aio_read_waited  :<br />                    The page was prefetched but the asynchronous
readhad not completed by the time the<br />                    scanner requested to read it.     causes extra overhead
inwaiting and indicates<br />                    prefetching is not providing much if any benefit.<br
/>                   The disk subsystem may be underpowered/overloaded in relation to the available CPU power.<br /><br
/>   . aio_read_ontime  :<br />                    The page was prefetched and the asynchronous read had completed by
thetime the<br />                    scanner requested to read it.     Optimal behaviour.      If this number if
large<br/>                    in relation to all the other numbers except (possibly) aio_read_noneed,<br
/>                   then prefetching is working well.<br /><br />To create the extension with support for these
additionalstatistics, use the following syntax:<br />     CREATE EXTENSION pg_stat_statements VERSION '1.3'<br />or, 
ifyou run the new code against an existing database which already has the extension<br />( see installation and
migrationbelow ),  you can <br />     ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'<br /><br />A suggested set of
commandsfor displaying these statistics might be :<br /><br /> /*  OPTIONALLY */ DROP extension pg_stat_statements;<br
/>                  CREATE extension pg_stat_statements VERSION '1.3';<br /> /*  run your workload   */<br
/>                 select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read
,blk_read_time , blk_write_time \<br />                    , aio_read_noneed , aio_read_noblok , aio_read_failed ,
aio_read_wasted, aio_read_waited , aio_read_ontime , aio_read_forgot       \<br />                      from
pg_stat_statementswhere shared_blks_read > 0;<br /><br /><br />Installation and Build Configuration:<br
/>_____________________________________<br/><br />1. First -  a prerequsite:<br />#  as well as requiring all the usual
packagebuild tools such as gcc , make etc,<br />#  as described in the instructions for building postgresql,<br /># 
thefollowing is required :<br />    gnu autoconf at version 2.69 :<br /># run the following command<br />autoconf -V<br
/>#it *must* return<br />autoconf (GNU Autoconf) 2.69<br /><br />2. If you don't have it or it is a different
version,<br/>then you must obtain version 2.69 (which is the current version)<br />from your distribution provider or
fromthe gnu software download site.<br /><br />3. Also you must have the source tree for postgresql version 9.4
(developmentversion).<br />#   all the following commands assume your current working directory is the top of the
sourcetree.<br /><br />4. cd to top of source tree :<br />#   check it appears to be a postgresql source tree<br />ls
-ldconfigure.in src<br />#   should show both the file and the directory<br />grep PostgreSQL COPYRIGHT<br />#   should
showPostgreSQL Database Management System<br /><br />5. Apply the patch :<br />patch -b -p0 -i
<patch_file_path><br/>#   should report no errors, 42 files patched (see list at bottom of this README)<br />#  
andall hunks applied<br />#  check the patch was appplied to configure.in<br />ls -ld configure.in.orig configure.in<br
/>#  should show both files<br /><br />6. Rebuild the configure script with the patched configure.in :<br />mv
configureconfigure.orig;<br />autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;<br
/>ls-lrt configure.orig configure;<br /><br />7. run the new configure script :<br />#   if you have run configure
before,<br/>#   then you may first want to save existing config.status and config.log if they exist,<br />#   and then
specifysame configure flags and options as you specified before.<br />#   the patch does not alter or extend the set of
configureoptions<br />#   if unsure,   run ./configure --help<br />#   if still unsure,   run ./configure<br
/>./configure<other configure options as desired><br /><br /><br /><br />8. now check that configure decided that
thisenvironment supports asynchronous IO :<br />grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h<br /># 
itshould show<br />#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1<br />#  if not,  apparently your environment does not
supportasynch IO  -<br />#  the config.log will show how it came to that conclusion,<br />#  also check for :<br />#   
.a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)<br />#    . your gcc must
supportthe atomic compare_and_swap __sync_bool_compare_and_swap built-in function<br />#  do not proceed without this
definebeing set.<br /><br />9. do you want to use the new code on an existing cluster<br />   that was created using
thesame code base but without the patch?<br />   If so then run this nasty-looking command :<br />   (cut-and-paste it
intoa terminal window or a shell-script file)<br />   Otherwise continue to step 10.<br />   see Migration note below
forexplanation.<br />###############################################################################################<br
/>  fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {<br />       bkfl="${fl}.bak${bkx}"; if [[
-a${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;<br />   }; done;<br />   if [[ -a ${bkfl} ]]; then echo "sorry cannot
finda backup name for $fl";<br />   elif [[ -a $fl ]]; then {<br />       mv $fl $bkfl && {<br />          sed
-e"/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;<br />          str="diff -w $bkfl
$fl";echo"$str"; eval "$str";<br />       };<br />   };<br />   else echo "ooopppss $fl is missing";<br />   fi;<br
/>###############################################################################################<br/># it should
reportsomething like<br />diff -w Makefile.global.bak0 Makefile.global<br />222c222<br />< CFLAGS = XXXX<br />---<br
/>>CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO<br />#   where XXXX is some set of flags<br /><br /><br />10.
nowrun the rest of the build process as usual  -<br />    follow instructions in file INSTALL if that file exists,<br
/>   else e.g. run<br />make && make install<br /><br />If the build fails with the following error:<br
/>undefinedreference to `aio_init'<br />Then edit the following file<br />src/include/pg_config_manual.h<br />and add
thefollowing line at the bottom:<br /><br />#define DONT_HAVE_AIO_INIT<br /><br />and then run<br />make clean
&&make && make install<br />See notes to section Runtime Configuration below for more information on
this.<br/><br /><br /><br />Migration , Runtime Configuration, and Use:<br
/>___________________________________________<br/><br /><br />Database Migration:<br />___________________<br /><br
/>Thenew prefetching code for non-bitmap index scans introduces a new btree-index<br />function named
btpeeknexttuple.   The correct way to add such a function involves<br />also adding it to the catalog as an internal
functionin pg_proc.<br />However,  this results in the new built code considering an existing database to be<br
/>incompatible, i.e requiring backup on the old code and restore on the new.<br />This is normal behaviour for
migrationto a new version of postgresql,  and is<br />also a valid way of migrating a database for use with this
asynchronousIO feature,<br />but in this case it may be inconvenient.<br /><br />As an alternative,    the new code may
becompiled with the macro define<br />AVOID_CATALOG_MIGRATION_FOR_ASYNCIO<br />which does what it says by not altering
thecatalog.   The patched build can then<br />be run against an existing database cluster initdb'd using the unpatched
build.<br/><br />There are no known ill-effects of so doing,  but :<br />     .  in any case,  it is strongly suggested
tomake a backup of any precious database<br />        before accessing it with a patched build<br />     .  be aware
thatif this asynchronous IO feature is eventually released as part of postgresql,<br />        migration will probably
berequired anyway.<br /><br />This option to avoid catalog migration is intended as a convenience for a quick test,<br
/>andalso makes it easier to obtain performance comparisons on the same database.<br /><br /><br /><br />Runtime
Configuration:<br/>______________________<br /><br />One new configuration parameter settable in postgresql.conf and<br
/>inany other way as described in the postgresql documentation :<br /><br />max_async_io_prefetchers<br />  Maximum
numberof background processes concurrently using asynchronous<br />  librt threads to prefetch pages into shared memory
buffers<br/><br />This number can be thought of as the maximum number<br />of librt threads concurrently active,   each
workingon a list of<br />from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).<br /><br />In practice,    this
numbersimply controls how many prefetch requests in total<br />may be active concurrently :<br />       
max_async_io_prefetchers* target_prefetch_pages ( see note 1)<br /><br />default is max_connections/6<br />and recall
thatthe default for max_connections is 100<br /><br /><br />note 1  a number based on effective_io_concurrency and
approximatelyn * ln(n)<br />        where n is effective_io_concurrency<br /><br />note 2  Provided that the gnu
extensionto Posix AIO which provides the<br />aio_init() function is present,   then aio_init() is called<br />to set
thelibrt maximum number of threads to max_async_io_prefetchers,<br />and to set the maximum number of concurrent aio
readrequests to the product of<br />        max_async_io_prefetchers * target_prefetch_pages<br /><br /><br />As well
asthis regular configuration parameter,<br />there are several other parameters that can be set via environment
variable.<br/>The reason why they are environment vars rather than regular configuration parameters<br />is that it is
notexpected that they should need to be set,   but they may be useful :<br />                variable name        
values                 default        meaning<br />   PG_TRY_PREFETCHING_FOR_BITMAP      [Y|N]                   
Y        whether to prefetch bitmap heap scans<br />   PG_TRY_PREFETCHING_FOR_ISCAN       [Y|N|integer[,[N|Y]]]  
256,N     whether to prefetch  non-bitmap index scans<br
/>                                                                   also numeric size of list of prefetched blocks<br
/>                                                                   also whether to prefetch
forward-sequential-patternindex pages<br />   PG_TRY_PREFETCHING_FOR_BTREE       [Y|N]                    Y        
whetherto prefetch heap pages in non-bitmap index scans<br />   PG_TRY_PREFETCHING_FOR_HEAP       
[Y|N]                   N         whether to prefetch relation (un-indexed) heap scans<br /><br /><br />The setting for
PG_TRY_PREFETCHING_FOR_ISCANis a litle complicated.<br />It can be set to Y or N to control prefetching of  non-bitmap
indexscans;<br />But in addition it can be set to an integer,   which both implies Y<br />and also sets the size of a
listused to remember prefetched but unread heap pages.<br />This list is an optimization used to avoid re-prefetching
andmaximise the potential<br />set of prefetchable blocks indexed by one index page.<br />And if set to an integer, 
thisinteger may be followed by either ,Y or ,N<br />to specify to prefetch index pages which are being accessed
forward-sequentially.<br/>It has been found that prefetching is not of great benefit for this access pattern,<br />and
soit is not the default,  but also does no harm (provided sufficient CPU capacity).<br /><br /><br /><br />Usage :<br
/>______<br/><br /><br />There are no changes in usage other than as noted under Configuration and Statistics.<br
/>However,  in order to assess benefit from this feature,   it will be useful to<br />understand the query access plans
ofyour workload using EXPLAIN.    Before doing that,<br />make sure that statistics are up to date using ANALYZE.<br
/><br/><br /><br />Internals:<br />__________<br /><br /><br />Internal changes span two areas and the interface
betweenthem :<br /><br /> .  buffer manager layer<br /> .  programming interface for scanner to call buffer manager<br
/> . scanner layer<br /><br /> .  buffer manager layer<br />    ____________________<br /><br />    changes comprise
:<br/>       .   allocating,  pinning , unpinning buffers<br />            this is complex and discussed briefly below
in"Buffer Management"<br />       .   acquiring and releasing a BufferAiocb, the control block<br />           
associatedwith a single aio_read,  and checking for its completion<br />            a new file, 
backend/storage/buffer/buf_async.c,provides three new functions,<br />                  BufStartAsync       
BufReleaseAsync           BufCheckAsync<br />            which handle this.<br />       .   calling librt asynch io
functions<br/>            this follows the example of all other filesystem interfaces<br />            and is
straightforward.   <br />            two new functions are provided in fd.c:<br />                  
FileStartaio       FileCompleteaio<br />            and corresponding interfaces in smgr.c<br /><br /> .  programming
interfacefor scanner to call buffer manager<br />    ________________________________________________________<br
/>      . calling interface for existing function PrefetchBuffer is modified :<br />           .  one new argument,  
BufferAccessStrategystrategy<br />           .  now returns an int return code which indicates :<br
/>                    whether pin count on buffer has been increased by 1<br />                     whether block was
alreadypresent in a buffer<br />       .  new function DiscardBuffer<br />           .  discard buffer used for a
previouslyprefetched page<br />                 which scanner decides it does not want to read.<br />           .  same
argumentsas for PrefetchBuffer except for omission of BufferAccessStrategy<br />           .  note - this is different
fromthe existing function ReleaseBuffer<br />                     in that ReleaseBuffer takes a buffer_descriptor as
argument<br/>                     for a buffer which has been read, but has similar purpose.<br /><br /> .  scanner
layer<br/>    _____________<br />        common to all scanners is that the scanner which wishes to prefetch must do
twothings:<br />          .  decide which pages to prefetch and call PrefetchBuffer to prefetch them<br
/>                nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)<br />         
. remember which pages it has prefetched in some list (actual or conceptual,  e.g. a page range),<br />                
removingeach page from this list if and when it subsequently reads the page.<br />          .  at end of scan,  call
DiscardBufferfor every remembered (i.e. prefetched not unread) page<br />       how this list of prefetched pages is
implementedvaries for each of the three scanners and four scan types:<br />            .  bitmap index scan - heap
pages<br/>            .  non-bitmap (i.e. simple) index scans - index pages<br />            .  non-bitmap (i.e.
simple)index scans - heap pages<br />            .  simple heap scans<br />       The consequences of forgetting to
callDiscardBuffer on a prefetched but unread page are:<br />            .   counted in aio_read_forgot  (see
"Statistics"above)<br />            .   may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "<br
/>                 (the buffer is released at commit)<br />       This does sometimes happen ...<br />     <br /><br
/><br/>Buffer Management<br />_________________<br /><br />With async io,   PrefetchBuffer must allocate and pin a
buffer, which is relatively straightforward,<br />but also every other part of buffer manager must know about the
possibilitythat a buffer may be in<br />a state of async_io_in_progress state and be prepared to determine the possible
completion.<br/>That is,  one backend BK1 may start the io but another BK2 may try to read it before BK1 does.<br
/>PosixAsynchronous IO provides a means for waiting on this or another task's read if in progress,<br />namely
aio_suspend(), which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO<br />are called
aspart of asynchronous prefetching,   their role is limited to maintaining the buffer descriptor flags,<br />and they
donot track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in<br />a separate set of shared
controlblocks,  the BufferAiocb list -<br />refer to     include/storage/buf_internals.h<br />Checking asynchronous io
statusis handled in  backend/storage/buffer/buf_async.c BufCheckAsync function.<br />Read the commentary for this
functionfor more details.<br /><br />Pinning and unpinning of buffers is the most complex aspect of asynch io
prefetching,<br/>and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.<br
/>Whena backend BK2 requests ReadBuffer of a page for which asynch read is in progress,<br />buffer manager has to
determinewhich backend BK1 pinned this buffer during previous PrefetchBuffer,<br />and for example must not be
re-pinneda second time if BK2 is BK1.<br />Information concerning which backend initiated the prefetch is held in the
BufferAiocb.<br/><br />The trickiest case concerns the scenario in which :<br />   .  BK1 initiates prefetch and
acquiresa pin<br />   .  BK2 possibly waits for completion and then reads the buffer,  and perhaps later on<br
/>        releases it by ReleaseBuffer.<br />   .  Since the asynchronous IO is no longer in progress,     there is no
longerany<br />         BufferAiocb associated with it.    Yet buffer manager must remember that BK1 holds a<br
/>        "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.<br />   . 
Thesolution to this problem is to invent the concept of a "banked" pin,<br />      which is a pin obtained when
prefetchwas issued,   identied as in "banked" status only if and when<br />      the associated asynchronous IO
terminates, and redeemable by the next use by same task,<br />      either by ReadBuffer or DiscardBuffer.<br />     
Thepid of the backend which holds a banked pin on a buffer (there can be at most one such backend)<br />      is stored
inthe buffer descriptor.<br />      This is done without increasing size of the buffer descriptor,  which is important
since<br/>      there may be a very large number of these.     This does overload the relevant field in the
descriptor.<br/>      Refer to include/storage/buf_internals.h for more details<br />      and search for
BM_AIO_PREFETCH_PIN_BANKEDin storage/buffer/bufmgr.c and  backend/storage/buffer/buf_async.c<br /><br
/>______________________________________________________________________________<br/>The following 43 files are changed
inthis feature (output of the patch command) :<br /><br />patching file configure.in<br />patching file
contrib/pg_stat_statements/pg_stat_statements--1.3.sql<br/>patching file contrib/pg_stat_statements/Makefile<br
/>patchingfile contrib/pg_stat_statements/pg_stat_statements.c<br />patching file
contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql<br/>patching file config/c-library.m4<br />patching file
src/backend/postmaster/postmaster.c<br/>patching file src/backend/executor/nodeBitmapHeapscan.c<br />patching file
src/backend/executor/nodeIndexscan.c<br/>patching file src/backend/executor/instrument.c<br />patching file
src/backend/storage/buffer/Makefile<br/>patching file src/backend/storage/buffer/bufmgr.c<br />patching file
src/backend/storage/buffer/buf_async.c<br/>patching file src/backend/storage/buffer/buf_init.c<br />patching file
src/backend/storage/smgr/md.c<br/>patching file src/backend/storage/smgr/smgr.c<br />patching file
src/backend/storage/file/fd.c<br/>patching file src/backend/storage/lmgr/proc.c<br />patching file
src/backend/access/heap/heapam.c<br/>patching file src/backend/access/heap/syncscan.c<br />patching file
src/backend/access/index/indexam.c<br/>patching file src/backend/access/index/genam.c<br />patching file
src/backend/access/nbtree/nbtsearch.c<br/>patching file src/backend/access/nbtree/nbtinsert.c<br />patching file
src/backend/access/nbtree/nbtpage.c<br/>patching file src/backend/access/nbtree/nbtree.c<br />patching file
src/backend/nodes/tidbitmap.c<br/>patching file src/backend/utils/misc/guc.c<br />patching file
src/backend/utils/mmgr/aset.c<br/>patching file src/include/executor/instrument.h<br />patching file
src/include/storage/bufmgr.h<br/>patching file src/include/storage/smgr.h<br />patching file
src/include/storage/fd.h<br/>patching file src/include/storage/buf_internals.h<br />patching file
src/include/catalog/pg_am.h<br/>patching file src/include/catalog/pg_proc.h<br />patching file
src/include/pg_config_manual.h<br/>patching file src/include/access/nbtree.h<br />patching file
src/include/access/heapam.h<br/>patching file src/include/access/relscan.h<br />patching file
src/include/nodes/tidbitmap.h<br/>patching file src/include/utils/rel.h<br />patching file
src/include/pg_config.h.in<br/><br /><br />Future Possibilities:<br />____________________<br /><br />There are several
possibleextensions of this feature :<br />   .   Extend prefetching of index scans to types of index<br />       other
thanB-tree.<br />       This should be fairly straightforward,  but requires some<br />       good base of
benchmarkableworkloads to prove the value.<br />   .   Investigate why asynchronous IO prefetching does not greatly<br
/>      improve sequential relation heap scans and possibly find how to<br />       achieve a benefit.<br />   .  
Buildknowledge of asycnhronous IO prefetching into the<br />       Query Planner costing.<br />       This is far from
straightforward.   The Postgresql Query Planner's<br />       costing model is based on resource consumption rather
thanelapsed time.<br />       Use of asynchronous IO prefetching is intended to improve elapsed time<br />       as the
expenseof (probably) higher resource consumption.<br />       Although Costing understands about the reduced cost of
readingbuffered<br />       blocks, it does not take asynchronicity or overlap of CPU with disk<br />       into
account. A naive approach might be to try to tweak the Query<br />       Planner's Cost Constant configuration
parameters<br/>       such as seq_page_cost , random_page_cost<br />       but this is hazardous as explained in the
Documentation.<br/><br /><br /><br />John Lumby,  johnlumby(at)hotmail(dot)com<br /><br /></div> 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Hannu Krosing
Дата:
Сообщение: Re: json casts
Следующее
От: David G Johnston
Дата:
Сообщение: Re: PG Manual: Clarifying the repeatable read isolation example