Обсуждение: RFC: Table access methods and scans
Hi all, I started looking into how table scans are handled for table access methods and have discovered a few things that I find odd. I cannot find any material regarding why this particular choice was made (if anybody has pointers, I would be very grateful). I am quite new to PostgreSQL so forgive me if my understanding of the code below is wrong and please clarify what I have misunderstood. I noted that `scan_begin` accepts a `ScanKey` and my *guess* was that the intention for adding this to the interface was to support primary indexes for table access methods (the comment is a little vague, but it seems to point to that). However, looking at where `scan_begin` is called from, I see that it is called from the following methods in `tableam.h`: - `table_beginscan` is always called using zero scan keys and NULL. - `table_beginscan_strat` is mostly called with zero keys and NULL, with the exception of `systable_beginscan`, which is only for system tables. It does use this feature. - `table_beginscan_bm` is only called with zero keys and NULL. - `table_beginscan_sampling` is only called with zero keys and NULL. - `table_beginscan_tid` calls `scan_begin` with zero keys and NULL. - `table_beginscan_analyze` calls `scan_begin` with zero keys and NULL. - `table_beginscan_catalog` is called with more than one key, but AFACT this is only for catalog tables. - `table_beginscan_parallel` calls `scan_begin` with zero keys and NULL. I draw the conclusion that the scan keys only make sense for a table access method for the odd case where it is used for a system tables or catalog tables, so for all practical purposes the scan key cannot be used to implement a primary index for general tables. As an example of how this is useful, I noticed the work by Heikki and Ashwin [1], where they return a `TableScanDesc` that contains information about what columns to scan, which looks very useful. Since the function `table_beginscan` in `src/include/access/tableam.h` accept a `ScanKey` as input, this is (AFAICT) what Heikki and Ashwin was exploiting to create a specialized scan for a columnar store. Another example of where this can be useful is to optimize access during a sequential scan when you can handle some specific scans very efficiently and can "skip ahead" many tuples if you know what is being looked for instead of filtering "late". Two examples of where this could be useful are: - An access method that reads data from a remote system and doesn't want to transfer all tuples unless necessary. - Some sort of log-structured storage with Bloom filters that allows you to quickly skip suites that do not have a key. Interestingly enough, `ScanKey` is generated for `IndexScan` and I think that the same approach could be used for sequential scans: pick out the quals that can be used for filtering and offer them to the table access method through the `scan_begin` callback. Thoughts around this? Best wishes, Mats Kindahl [1] https://www.postgresql-archive.org/Zedstore-compressed-in-core-columnar-storage-tp6081536.html
Hi, On Wed, 2021-03-31 at 22:10 +0200, Mats Kindahl wrote: > As an example of how this is useful, I noticed the work by Heikki and > Ashwin [1], where they return a `TableScanDesc` that contains > information about what columns to scan, which looks very useful. > Since > the function `table_beginscan` in `src/include/access/tableam.h` > accept a `ScanKey` as input, this is (AFAICT) what Heikki and Ashwin > was exploiting to create a specialized scan for a columnar store. I don't think ScanKeys are the right place to store information about what columns would be useful. See another thread[2] about that topic. > Another example of where this can be useful is to optimize access > during a sequential scan when you can handle some specific scans very > efficiently and can "skip ahead" many tuples if you know what is > being > looked for instead of filtering "late". Two examples of where this > could be useful are: > > - An access method that reads data from a remote system and doesn't > want > to transfer all tuples unless necessary. > - Some sort of log-structured storage with Bloom filters that allows > you to quickly skip suites that do not have a key. I agree that would be very conventient for non-heap AMs. There's a very old commit[3] that says: + /* + * Note that unlike IndexScan, SeqScan never use keys + * in heap_beginscan (and this is very bad) - so, here + * we have not check are keys ok or not. + */ and that language has just been carried forward for decades. I wonder if there's any major reason this hasn't been done yet. Does it just not improve performance for a heap, or is there some other reason? Regards, Jeff Davis [2] https://www.postgresql.org/message-id/CAE-ML+9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw@mail.gmail.com [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3a1ab764ef2
Hi Jeff,
On Fri, Jun 4, 2021 at 2:52 AM Jeff Davis <pgsql@j-davis.com> wrote:
Hi,
On Wed, 2021-03-31 at 22:10 +0200, Mats Kindahl wrote:
> As an example of how this is useful, I noticed the work by Heikki and
> Ashwin [1], where they return a `TableScanDesc` that contains
> information about what columns to scan, which looks very useful.
> Since
> the function `table_beginscan` in `src/include/access/tableam.h`
> accept a `ScanKey` as input, this is (AFAICT) what Heikki and Ashwin
> was exploiting to create a specialized scan for a columnar store.
I don't think ScanKeys are the right place to store information about
what columns would be useful. See another thread[2] about that topic.
Yeah, it is not a good example. The examples below are better examples. The scan keys are not sufficient to get all the columns, but AFAICT, it is this callback that is exploited in the patch.
> Another example of where this can be useful is to optimize access
> during a sequential scan when you can handle some specific scans very
> efficiently and can "skip ahead" many tuples if you know what is
> being
> looked for instead of filtering "late". Two examples of where this
> could be useful are:
>
> - An access method that reads data from a remote system and doesn't
> want
> to transfer all tuples unless necessary.
> - Some sort of log-structured storage with Bloom filters that allows
> you to quickly skip suites that do not have a key.
I agree that would be very conventient for non-heap AMs. There's a very
old commit[3] that says:
+ /*
+ * Note that unlike IndexScan, SeqScan never use keys
+ * in heap_beginscan (and this is very bad) - so, here
+ * we have not check are keys ok or not.
+ */
and that language has just been carried forward for decades. I wonder
if there's any major reason this hasn't been done yet. Does it just not
improve performance for a heap, or is there some other reason?
That is basically the question. I'm prepared to take a shot at it unless there is a good reason not to.
Best wishes,
Mats Kindahl
Regards,
Jeff Davis
[2]
https://www.postgresql.org/message-id/CAE-ML+9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw@mail.gmail.com
[3]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3a1ab764ef2
On Fri, 2021-06-04 at 08:23 +0200, Mats Kindahl wrote: > That is basically the question. I'm prepared to take a shot at it > unless there is a good reason not to. Sounds good, I can review. Regards, Jeff Davis
Hi, On 2021-06-03 17:52:24 -0700, Jeff Davis wrote: > I agree that would be very conventient for non-heap AMs. There's a very > old commit[3] that says: > > + /* > + * Note that unlike IndexScan, SeqScan never use keys > + * in heap_beginscan (and this is very bad) - so, here > + * we have not check are keys ok or not. > + */ > > and that language has just been carried forward for decades. I wonder > if there's any major reason this hasn't been done yet. Does it just not > improve performance for a heap, or is there some other reason? It's not actually a good idea in general: - Without substantial refactoring more work is done while holding the content lock on the page. Whereas doing it as part of a seqscan only requires a buffer pin (and thus allows for concurrent writes to the same page) - It's hard to avoid repeated work with expressions that can't fully be evaluated as part of the ScanKey. Expression evaluation generally can be a bit smarter about evaluation, e.g. not deforming the tuple one-by-one. Greetings, Andres Freund