Avoiding redundant fetches of btree index metapages
От | Tom Lane |
---|---|
Тема | Avoiding redundant fetches of btree index metapages |
Дата | |
Msg-id | 22591.1145987035@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Avoiding redundant fetches of btree index metapages
(Alvaro Herrera <alvherre@commandprompt.com>)
Re: Avoiding redundant fetches of btree index metapages (Simon Riggs <simon@2ndquadrant.com>) |
Список | pgsql-hackers |
Some experimentation over the weekend turned up the result that $SUBJECT is a good idea. I had never thought about this much before, figuring that in a heavily-used index the metapage would stay in cache anyway and so fetching it at the start of each index search isn't costing any extra I/O. That's true, but what it does cost is bufmgr contention, and in fact more than you'd expect because the amount of work actually done before releasing the page again is minuscule. (See off-list discussion attached below.) I'm working on cleaning up the patch shown below to make it really usable. The thoughts I have are: * Re-using rd_targblock was OK for a quick hack because we don't use it for indexes, but it's too klugy for real use, and it's not quite enough info anyway (we really ought to cache the root level as well as root block number). I'm planning to add a "void *" pointer to the Relation struct that the index AM is allowed to use as it sees fit, with the understanding that any pointed-to data lives in rd_indexcxt and the pointer will be reset to NULL on any relcache clear. btree would store a copy of the BTMetaPageData struct. The other AMs might have some other use for this. * The tricky part of caching this data is that we might try to use a very stale (many transactions old) root pointer, if a relcache entry sits unused for a long time. Btree's internal vacuuming protocol ensures that we won't delete a page that any active transaction is "in flight" to, but that logic won't protect a backend that tries to use an old cached pointer. So we could arrive at a page that is deleted, or has been deleted and then recycled to be a valid page (but not the one we want) or worst case ReadBuffer fails entirely because the pointer points past the current physical index end. "Deleted" isn't a problem because we can recognize such a page and fall back to fetching the metapage, but the other cases are nastier. What I'm considering doing to fix that is require any change to a btree's metapage data to send out a relcache inval message for the index. That will force all backends to flush the stale cache item not later than the start of their next transaction, and thereby guarantee that they aren't using pointers that are too old to be safe against vacuuming. (There are other ways we could handle this, but that one seems like the simplest and least invasive.) Comments? Anyone see any flaws in the reasoning? regards, tom lane ------- Forwarded Messages Date: Sun, 23 Apr 2006 13:30:48 -0400 From: Tom Lane <tgl@sss.pgh.pa.us> To: Gavin Hamill <gdh@acentral.co.uk> cc: Simon Riggs <simon@2ndquadrant.com>, jonah.harris@gmail.com Subject: Re: [HACKERS] Further reduction of bufmgr lock contention I'm still not feeling very good, so I'll pass on trying to do anything with partitioning the BufMappingLock right now --- Simon, if you feel like having a go at it, be my guest. I did think of something simple to try, though, after noting that close to a quarter of the ReadBuffer traffic in Gavin's test comes from fetching btree index metapages. Apparently the indexes are not too large and mostly have just one level below the root, so the typical index lookup sequence is metapage -> root -> leaf -> heap. It occurs to me that we could probably cache the root-page location and thereby avoid fetching the metapage at all in most cases. This won't save any actual I/O, since the metapage would've stayed in cache anyway, but what it does save is trips to the buffer manager. Since bufmgr contention is exactly our problem here, I thought that a 25% reduction in accesses might lead to better-than-25% improvement. An extremely quick and dirty patch for this is attached; it's not nearly ready for prime time, but it's enough to check performance, and what I see is 8.1.3: Concurrency Level: 4 Time taken for tests: 149.467 seconds Complete requests: 200 Failed requests: 0 Broken pipe errors: 0 Total transferred: 58600 bytes HTML transferred: 0 bytes Requests per second: 1.34 [#/sec] (mean) Time per request: 2989.34 [ms] (mean) Time per request: 747.34 [ms] (mean, across all concurrent requests) Transfer rate: 0.39 [Kbytes/sec] received Connnection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.0 0 0 Processing: 868 2962 2163.3 2540 26458 Waiting: 868 2961 2163.3 2540 26457 Total: 868 2962 2163.3 2540 26458 Percentage of the requests served within a certain time (ms) 50% 2540 66% 3017 75% 3409 80% 3556 90% 4567 95% 5406 98% 8357 99% 9569100% 26458 (last request) With cache-the-rootpage-location hack: Concurrency Level: 4 Time taken for tests: 99.063 seconds Complete requests: 200 Failed requests: 0 Broken pipe errors: 0 Total transferred: 58600 bytes HTML transferred: 0 bytes Requests per second: 2.02 [#/sec] (mean) Time per request: 1981.26 [ms] (mean) Time per request: 495.32 [ms] (mean, across all concurrent requests) Transfer rate: 0.59 [Kbytes/sec] received Connnection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.0 0 0 Processing: 884 1966 523.5 1922 3760 Waiting: 884 1965 523.5 1922 3760 Total: 884 1966 523.5 1922 3760 Percentage of the requests served within a certain time (ms) 50% 1922 66% 2126 75% 2259 80% 2376 90% 2584 95% 2904 98% 3479 99% 3722100% 3760 (last request) So this looks like something worth pursuing. regards, tom lane *** REL8_1/src/backend/access/nbtree/nbtpage.c.orig Tue Nov 22 16:20:01 2005 --- REL8_1/src/backend/access/nbtree/nbtpage.c Sun Apr 23 13:05:35 2006 *************** *** 163,168 **** --- 163,192 ---- uint32 rootlevel; BTMetaPageData *metad; + /* Try to use cached root-page location */ + if (BlockNumberIsValid(rel->rd_targblock)) + { + rootblkno = rel->rd_targblock; + Assert(rootblkno != P_NONE); + + rootbuf = _bt_getbuf(rel, rootblkno, BT_READ); + rootpage = BufferGetPage(rootbuf); + rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage); + + /* + * Don't check P_ISROOT, because it's not set in a "fast root". + * Instead, check that page has no siblings. + */ + if (!P_IGNORE(rootopaque) && + P_LEFTMOST(rootopaque) && + P_RIGHTMOST(rootopaque)) + { + return rootbuf; + } + + _bt_relbuf(rel, rootbuf); + } + metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ); metapg = BufferGetPage(metabuf); metaopaque = (BTPageOpaque)PageGetSpecialPointer(metapg); *************** *** 310,315 **** --- 334,341 ---- rootblkno, RelationGetRelationName(rel), rootopaque->btpo.level, rootlevel); } + + rel->rd_targblock = rootblkno; /* * By here, we have a pin and read lock on the root page, and no lock set ------- Message 2 Date: Sun, 23 Apr 2006 19:05:35 -0400 From: Tom Lane <tgl@sss.pgh.pa.us> To: Simon Riggs <simon@2ndquadrant.com> cc: Gavin Hamill <gdh@acentral.co.uk>, jonah.harris@gmail.com Subject: Re: [HACKERS] Further reduction of bufmgr lock contention Simon Riggs <simon@2ndquadrant.com> writes: > Results look much much better than I would have guessed though. It surprised me too that it worked so well, but after further thought maybe it's not so surprising. In the metapage -> root -> leaf -> heap sequence, the only page the backend will spend a trivial amount of time processing is the metapage --- there is useful work to do at each other page. Now obviously we don't have a contention problem if CPU effort per page fetched is high; the backends would spend their time doing useful work instead of contending. So the metapage fetch is contributing more than a 25% share of bufmgr accesses per CPU cycle, and that's why getting rid of it is worth more than 25%. This also suggests that it'll be worth doing even for larger indexes with two or three levels below the root. Or at least that analysis seems good to me at the moment ... regards, tom lane ------- End of Forwarded Messages
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Martijn van OosterhoutДата:
Сообщение: Re: Catalog Access (was: [GENERAL] Concurrency problem building indexes)
Следующее
От: Alvaro HerreraДата:
Сообщение: Re: Avoiding redundant fetches of btree index metapages