Re: Pluggable Storage - Andres's take
От | Andres Freund |
---|---|
Тема | Re: Pluggable Storage - Andres's take |
Дата | |
Msg-id | 20190425224315.mzgusflgimaogyqb@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: Pluggable Storage - Andres's take (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: Pluggable Storage - Andres's take
(Ashwin Agrawal <aagrawal@pivotal.io>)
Re: Pluggable Storage - Andres's take (Andres Freund <andres@anarazel.de>) |
Список | pgsql-hackers |
Hi Heikki, Ashwin, Tom, On 2019-04-23 15:52:01 -0700, Andres Freund wrote: > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > > index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM > > doesn't use normal data files, that won't work. I bumped into that with my > > toy implementation, which wouldn't need to create any data files, if it > > wasn't for this. > > There are a few more of these: > I'm not sure about doing so for v12 though. 1) and 3) are fairly > trivial, but 2) would involve changing the FDW interface, by changing > the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH, > we're not even in beta1. Hm. I think some of those changes would be a bit bigger than I initially though. Attached is a more minimal fix that'd route RelationGetNumberOfBlocksForFork() through tableam if necessary. I think it's definitely the right answer for 1), probably the pragmatic answer to 2), but certainly not for 3). I've for now made the AM return the size in bytes, and then convert that into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers are going to continue to want it internally as pages (otherwise there's going to be way too much churn, without a benefit I can see). So I thinkt that's OK. There's also a somewhat weird bit of returning the total relation size for InvalidForkNumber - it's pretty likely that other AMs wouldn't use postgres' current forks, but have equivalent concepts. And without that there'd be no way to get that size. I'm not sure I like this, input welcome. But it seems good to offer the ability to get the entire size somehow. Btw, isn't RelationGetNumberOfBlocksForFork() currently weirdly placed? I don't see why bufmgr.c would be appropriate? Although I don't think it's particulary clear where it'd best reside - I'd tentatively say storage.c. Heikki, Ashwin, your inputs would be appreciated here, in particular the tid fetch bit below. The attached patch isn't intended to be applied as-is, just basis for discussion. > 1) index_update_stats(), computing pg_class.relpages > > Feels like the number of both heap and index blocks should be > computed by the index build and stored in IndexInfo. That'd also get > a bit closer towards allowing indexams not going through smgr (useful > e.g. for memory only ones). Due to parallel index builds that'd actually be hard. Given the number of places wanting to compute relpages for pg_class I think the above patch routing RelationGetNumberOfBlocksForFork() through tableam is the right fix. > 2) commands/analyze.c, computing pg_class.relpages > > This should imo be moved to the tableam callback. It's currently done > a bit weirdly imo, with fdws computing relpages the callback, but > then also returning the acquirefunc. Seems like it should entirely be > computed as part of calling acquirefunc. Here I'm not sure routing RelationGetNumberOfBlocksForFork() through tableam wouldn't be the right minimal approach too. It has the disadvantage of implying certain values for the RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative would be to return the desire sampling range in table_beginscan_analyze() - but that'd require some duplication because currently that just uses the generic scan_begin() callback. I suspect - as previously mentioned- that we're going to have to extend statistics collection beyond the current approach at some point, but I don't think that's now. At least to me it's not clear how to best represent the stats, and how to best use them, if the underlying storage is fundamentally not block best. Nor how we'd avoid code duplication... > 3) nodeTidscan, skipping over too large tids > I think this should just be moved into the AMs, there's no need to > have this in nodeTidscan.c I think here it's *not* actually correct at all to use the relation size. It's currently doing: /* * We silently discard any TIDs that are out of range at the time of scan * start. (Since we hold at least AccessShareLock on the table, it won't * be possible for someone to truncate away the blocks we intend to * visit.) */ nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation); which is fine (except for a certain abstraction leakage) for an AM like heap or zheap, but I suspect strongly that that's not ok for Ashwin & Heikki's approach where tid isn't tied to physical representation. The obvious answer would be to just move that check into the table_fetch_row_version implementation (currently just calling heap_fetch()) - but that doesn't seem OK from a performance POV, because we'd then determine the relation size once for each tid, rather than once per tidscan. And it'd also check in cases where we know the tid is supposed to be valid (e.g. fetching trigger tuples and such). The proper fix seems to be to introduce a new scan variant (e.g. table_beginscan_tid()), and then have table_fetch_row_version take a scan as a parameter. But it seems we'd have to introduce that as a separate tableam callback, because we'd not want to incur the overhead of creating an additional scan / RelationGetNumberOfBlocks() checks for triggers et al. Greetings, Andres Freund
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Andres FreundДата:
Сообщение: Re: REINDEX INDEX results in a crash for an index of pg_class since9.6
Следующее
От: Thomas MunroДата:
Сообщение: Re: Do PostgreSQL have map and set structure(like STL in C++)?