Обсуждение: Tid scan improvements

Поиск

Список

Период

Сортировка

Tid scan improvements

От

Edmund Horner

Дата:

12 августа 2018 г., 08:29:05

Hello,

To scratch an itch, I have been working on teaching TidScan how to do
range queries, i.e. those using >=, <, BETWEEN, etc.  This means we
can write, for instance,

    SELECT * FROM t WHERE ctid >= '(1000,0)' AND ctid < '(2000,0)';

instead of resorting to the old trick:

    SELECT * FROM t WHERE ctid = ANY (ARRAY(SELECT format('(%s,%s)', i, j)::tid
    FROM generate_series(1000,1999) AS gs(i), generate_series(1,200)
AS gs2(j)));

where "200" is some guess at how many tuples can fit on a page for that table.

There's some previous discussion about this at

https://www.postgresql.org/message-id/flat/CAHyXU0zJhg_5RtxKnNbAK%3D4ZzQEFUFi%2B52RjpLrxtkRTD6CDFw%40mail.gmail.com#3ba2c3a6be217f40130655a3250d80a4
.

Since range scan execution is rather different from the existing
TidScan execution, I ended up making a new plan type, TidRangeScan.
There is still only one TidPath, but it has an additional member that
describes which method to use.

As part of the work I also taught TidScan that its results are ordered
by ctid, i.e. to set a pathkey on a TidPath.  The benefit of this is
that queries such as

    SELECT MAX(ctid) FROM t;
    SELECT * FROM t WHERE ctid IN (...) ORDER BY ctid;

are now planned a bit more efficiently.  Execution was already
returning tuples in ascending ctid order; I just had to add support
for descending order.

Attached are a couple of patches:
  - 01_tid_scan_ordering.patch
  - 02_tid_range_scan.patch, to be applied on top of 01.

Can I add this to the next CommitFest?

Obviously the whole thing needs thorough review, and I expect there to
be numerous problems.  (I had to make this prototype to demonstrate to
myself that it wasn't completely beyond me.  I know from experience
how easy it is to enthusiastically volunteer something for an open
source project, discover that one does not have the time or skill
required, and be too embarrassed to show one's face again!)

As well as actual correctness, some aspects that I am particularly
unsure about include:

  - Is it messy to use TidPath for both types of scan?
  - What is the planning cost for plans that don't end up being a
TidScan or TidRangeScan?
  - Have I put the various helper functions in the right files?
  - Is there a less brittle way to create tables of a specific number
of blocks/tuples in the regression tests?
  - Have a got the ScanDirection right during execution?
  - Are my changes to heapam ok?

Cheers,
Edmund

On 15 August 2018 at 11:11, Edmund Horner <ejrh00@gmail.com> wrote:

So we'd extend that to:
- Include in the OR-list "range" subquals of the form (ctid > ? AND
ctid < ?) (either side could be optional, and we have to deal with >=
and <= and having ctid on the rhs, etc.).
- Cost the range subquals by assuming they don't overlap, and
estimating how many blocks and tuples they span.
- When beginning the scan, evaluate all the ?s and build an array of
"tid ranges" to fetch. A tid range is a struct with a starting tid,
and an ending tid, and might just be a single tid item.
- Sort and remove duplicates.
- Iterate over the array, using a single fetch for single-item tid
ranges, and starting/ending a heap scan for multi-item tid ranges.

I think I'll try implementing this.

I've set this patch as waiting on author in the commitfest app.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: Tid scan improvements

От

Edmund Horner

Дата:

19 сентября 2018 г., 12:04:43

On Mon, 17 Sep 2018 at 23:21, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 15 August 2018 at 11:11, Edmund Horner <ejrh00@gmail.com> wrote:
>> So we'd extend that to:
>>   - Include in the OR-list "range" subquals of the form (ctid > ? AND
>> ctid < ?) (either side could be optional, and we have to deal with >=
>> and <= and having ctid on the rhs, etc.).
>>   - Cost the range subquals by assuming they don't overlap, and
>> estimating how many blocks and tuples they span.
>>   - When beginning the scan, evaluate all the ?s and build an array of
>> "tid ranges" to fetch.  A tid range is a struct with a starting tid,
>> and an ending tid, and might just be a single tid item.
>>   - Sort and remove duplicates.
>>   - Iterate over the array, using a single fetch for single-item tid
>> ranges, and starting/ending a heap scan for multi-item tid ranges.
>>
>> I think I'll try implementing this.
>
>
> I've set this patch as waiting on author in the commitfest app.

Thanks David.

Between work I have found time here and there to work on it, but
making a path type that handles all the above turns out to be
surprisingly harder than my tid range scan.

In the earlier discussion from 2012, Tom Lane said:

> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > On Wed, Jun 13, 2012 at 03:21:17PM -0500, Merlin Moncure wrote:
> >> IMNSHO, it's a no-brainer for the todo (but I think it's more
> >> complicated than adding some comparisons -- which are working now):
>
> > I see. Seems we have to add index smarts to those comparisons. That
> > might be complicated.
>
> Uh, the whole point of a TID scan is to *not* need an index.
>
> What would be needed is for tidpath.c to let through more kinds of TID
> comparison quals than it does now, and then for nodeTidscan.c to know
> what to do with them. The latter logic might well look something like
> btree indexscan qual preparation, but it wouldn't be the same code.

I have been generally following this approach (handling more kinds of
TID comparisons), and have found myself doing things like pairing up >
with <, estimating how much of a table is covered by some set of >, <,
or "> AND <" quals, etc.  Things that I'm sure are handled in an
advanced way by index paths; unfortunately I didn't see any easily
reusable code in the index path code.  So I've ended up writing
special-case code for TID scans.  Hopefully it will be worth it.

Edmund

Re: Tid scan improvements

От

David Rowley

Дата:

19 сентября 2018 г., 12:56:30

On 19 September 2018 at 18:04, Edmund Horner <ejrh00@gmail.com> wrote:
> I have been generally following this approach (handling more kinds of
> TID comparisons), and have found myself doing things like pairing up >
> with <, estimating how much of a table is covered by some set of >, <,
> or "> AND <" quals, etc.  Things that I'm sure are handled in an
> advanced way by index paths; unfortunately I didn't see any easily
> reusable code in the index path code.  So I've ended up writing
> special-case code for TID scans.  Hopefully it will be worth it.

I don't think it would need to be as complex as the index matching
code. Just looping over the quals and gathering up all compatible ctid
quals should be fine.  I imagine the complex handling of sorting the
quals by ctid and removal of redundant quals that are covered by some
range would be done in the executor.

Probably the costing will get more complex. At the moment it seems we
add a random_page_cost per ctid, but you'd probably need to make that
better and loop over the quals in each implicitly ANDed set and find
the max ctid for the > / >= quals and the the min < / <= ctid, then
get the page number from each and assume max - min seq_page_cost, then
add random_page_cost for any remaining equality quals.  The costs from
other OR branches can likely just be added on.  This would double
count if someone did WHERE ctid BETWEEN '(0,0') AND '(100,300)' OR
ctid BETWEEN '(0,0') AND '(100,300)';  The current code seems to
double count now for duplicate ctids anyway. It even double counts if
the ctid being compared to is on the same page as another ctid, so I
don't think that would be unacceptable.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Tid scan improvements

От

David Rowley

Дата:

03 октября 2018 г., 10:36:02

On 28 September 2018 at 18:13, Edmund Horner <ejrh00@gmail.com> wrote:
> On Fri, 28 Sep 2018 at 17:02, Edmund Horner <ejrh00@gmail.com> wrote:
>> I did run pgindent over it though. :)
>
> But I didn't check if it still applied to master.  Sigh.  Here's one that does.

I know commit fest is over, but I made a pass of this to hopefully
provide a bit of guidance so that it's closer for the November 'fest.

I've only done light testing on the patch and it does seem to work,
but there are a few things that I think should be changed. Most
importantly #11 below I think needs to be done. That might overwrite
some of the items that come before it in the list as you likely will
have to pull some of code which I mention out out due to changing #11.
I've kept them around anyway just in case some of it remains.

1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()

int nrandompages;
int nseqpages;

2. Should multiply by nseqpages, not add.

run_cost += spc_random_page_cost * nrandompages + spc_seq_page_cost + nseqpages;

3. Should be double:

BlockNumber pages = selectivity * baserel->pages;

4. Comment needs updated to mention what the new code does in
heapgettup() and heapgettup_pagemode()

+
  /* start from last page of the scan */
- if (scan->rs_startblock > 0)
- page = scan->rs_startblock - 1;
+ if (scan->rs_numblocks == InvalidBlockNumber)
+ {
+ if (scan->rs_startblock > 0)
+ page = scan->rs_startblock - 1;
+ else
+ page = scan->rs_nblocks - 1;
+ }
  else
- page = scan->rs_nblocks - 1;
+ page = scan->rs_startblock + scan->rs_numblocks - 1;
+

5. Variables should be called "inclusive". We use "strict" to indicate
an operator comparison cannot match NULL values.

+ bool strict; /* Indicates < rather than <=, or > rather */
+ bool strict2; /* than >= */

Don't break the comment like that. If you need more space don't end
the comment and use a new line and tab the next line out to match the
* of the first line.

6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
the type instead of repeating code

7. It's not very obvious why the following Assert() can't fail.

+ bool invert;
+ bool invert2;
+
+ Assert(list_length(((BoolExpr *) expr)->args) == 2);

I had to hunt around quite a bit to see that
TidQualFromBaseRestrictinfo could only ever make the list have 2
elements, and we'd not form a BoolExpr with just 1. (but see #11)

8. Many instances of the word "strict" are used to mean "inclusive".
Can you please change all of them.

9. Confusing comment:

+ * If the expression was non-strict (<=) and the offset is 0, then just
+ * pretend it was strict, because offset 0 doesn't exist and we may as
+ * well exclude that block.

Shouldn't this be, "If the operator is non-inclusive, then since TID
offsets are 1-based, for simplicity, we can just class the expression
as inclusive.", or something along those lines.

10. Comment talks about LHS, but the first OpExpr in a list of two
OpExprs has nothing to do with left hand side.  You could use LHS if
you were talking about the first arg in an OpExpr, but this is not the
case here.

/* If the LHS is not the lower bound, swap them. */

You could probably just ensure that the >=, > ops is the first in the
list inside TidQualFromBaseRestrictinfo(), but you'd need to clearly
comment that this is the case in both locations. Perhaps use lcons()
for the lower end and lappend() for the upper end, but see #11.

11. I think the qual matching code needs an overhaul.  Really you
should attempt to find the smallest and largest ctid for your
implicitly ANDed ranges.  This would require you getting rid of the
BETWEEN type claused you're trying to build in
TidQualFromBaseRestrictinfo
and instead just include all quals, don't ignore other quals when
you've already found your complete range bounds.

The problem with doing it the way that you're doing it now is in cases like:

create table t1(a int);
insert into t1 select generate_Series(1,10000000);
create index on t1 (a);
select ctid,a from t1 order by a desc limit 1; -- find the max ctid.
    ctid     |    a
-------------+----------
 (44247,178) | 10000000
(1 row)

set max_parallel_workers_per_gather=0;
explain analyze select ctid,* from t1 where ctid > '(0,0)' and ctid <=
'(44247,178)' and ctid <= '(0,1)';
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.01..169248.78 rows=1 width=10) (actual
time=0.042..2123.432 rows=1 loops=1)
   TID Cond: ((ctid > '(0,0)'::tid) AND (ctid <= '(44247,178)'::tid))
   Filter: (ctid <= '(0,1)'::tid)
   Rows Removed by Filter: 9999999
 Planning Time: 4.049 ms
 Execution Time: 2123.464 ms
(6 rows)

Due to how you've coded TidQualFromBaseRestrictinfo(), the ctid <=
'(0,1)' qual does not make it into the range. It's left as a filter in
the Tid Scan.

I think I'm going to stop here as changing this going to cause quite a
bit of churn.

but one more...

12. I think the changes to selfuncs.c to get the selectivity estimate
is a fairly credible idea, but I think it also needs to account for
offsets. You should be able to work out the average number of items
per page with rel->tuples / rel->pages and factor that in to get a
better estimate for cases like:

postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.00..5.00 rows=1 width=10) (actual
time=0.025..0.065 rows=200 loops=1)
   TID Cond: (ctid <= '(0,200)'::tid)
 Planning Time: 0.081 ms
 Execution Time: 0.088 ms
(4 rows)

You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
the selectivity and get a fairly accurate estimate... at least when
there are no dead tuples in the heap

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Tid scan improvements

От

Edmund Horner

Дата:

05 октября 2018 г., 07:24:50

On Wed, 3 Oct 2018 at 17:36, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I know commit fest is over, but I made a pass of this to hopefully
> provide a bit of guidance so that it's closer for the November 'fest.

Hi David.  Thanks for the review.  It's fairly thorough and you must
have put some time into it -- I really appreciate it.

> I've only done light testing on the patch and it does seem to work,
> but there are a few things that I think should be changed. Most
> importantly #11 below I think needs to be done. That might overwrite
> some of the items that come before it in the list as you likely will
> have to pull some of code which I mention out out due to changing #11.
> I've kept them around anyway just in case some of it remains.

> 1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()
> 2. Should multiply by nseqpages, not add.
> 3. Should be double.

I agree with these three.

> 4. Comment needs updated to mention what the new code does in
> heapgettup() and heapgettup_pagemode()
>
> +
>   /* start from last page of the scan */
> - if (scan->rs_startblock > 0)
> - page = scan->rs_startblock - 1;
> + if (scan->rs_numblocks == InvalidBlockNumber)
> + {
> + if (scan->rs_startblock > 0)
> + page = scan->rs_startblock - 1;
> + else
> + page = scan->rs_nblocks - 1;
> + }
>   else
> - page = scan->rs_nblocks - 1;
> + page = scan->rs_startblock + scan->rs_numblocks - 1;
> +

I'm thinking that, as they don't depend on the others, the heapam.c
changes should be a separate preparatory patch?

The heap scan code has to support things like synchonised scans and
parallel scans, but as far as I know, its support for scanning
subranges is currently used only for building BRIN indexes.  I found
that although I could specify a subrange with heap_setscanlimits, I
could not scan backward over it, because the original version of the
above code would start at the end of the whole table.

I'm not especially comfortable with this understanding of heapam, so
close review would be appreciated.

I note that there's a lot of common code in heapgettup and
heapgettup_pagemode, which my changes add to.  It might be worth
trying to factor out somehow.

> 5. Variables should be called "inclusive". We use "strict" to indicate
> an operator comparison cannot match NULL values.
>
> + bool strict; /* Indicates < rather than <=, or > rather */
> + bool strict2; /* than >= */
>
> Don't break the comment like that. If you need more space don't end
> the comment and use a new line and tab the next line out to match the
> * of the first line.

> 8. Many instances of the word "strict" are used to mean "inclusive".
> Can you please change all of them.

I don't mind renaming it.  I took "strict" from "strictly greater/less
than" but I knew it was confusable with the other usages of "strict".

> 9. Confusing comment:
>
> + * If the expression was non-strict (<=) and the offset is 0, then just
> + * pretend it was strict, because offset 0 doesn't exist and we may as
> + * well exclude that block.
>
> Shouldn't this be, "If the operator is non-inclusive, then since TID
> offsets are 1-based, for simplicity, we can just class the expression
> as inclusive.", or something along those lines.

Ok, I'll try to reword it along those lines.

> I think I'm going to stop here as changing this going to cause quite a
> bit of churn.
>
> but one more...

> 12. I think the changes to selfuncs.c to get the selectivity estimate
> is a fairly credible idea, but I think it also needs to account for
> offsets. You should be able to work out the average number of items
> per page with rel->tuples / rel->pages and factor that in to get a
> better estimate for cases like:
>
> postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
>                                           QUERY PLAN
> -----------------------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.00..5.00 rows=1 width=10) (actual
> time=0.025..0.065 rows=200 loops=1)
>    TID Cond: (ctid <= '(0,200)'::tid)
>  Planning Time: 0.081 ms
>  Execution Time: 0.088 ms
> (4 rows)
>
> You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
> the selectivity and get a fairly accurate estimate... at least when
> there are no dead tuples in the heap

I think the changes to selfuncs.c could also be a separate patch?

I'll try to include the offset in the selectivity too.

Related -- what should the selectivity be on an empty table?  My code has:

    /* If the relation's empty, we're going to read all of it. */
    if (vardata->rel->pages == 0)
        return 1.0;

(which needs rewording, since selectivity isn't always about reading).
Is 1.0 the right thing to return?

> 6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
> the type instead of repeating code
> 7. It's not very obvious why the following Assert() can't fail. [...]
> I had to hunt around quite a bit to see that
> TidQualFromBaseRestrictinfo could only ever make the list have 2
> elements, and we'd not form a BoolExpr with just 1. (but see #11)
> 10. Comment talks about LHS, but the first OpExpr in a list of two
> OpExprs has nothing to do with left hand side.  You could use LHS if
> you were talking about the first arg in an OpExpr, but this is not the
> case here.

These three might become non-issues if we change it along the lines of #11:

> 11. I think the qual matching code needs an overhaul.  Really you
> should attempt to find the smallest and largest ctid for your
> implicitly ANDed ranges.  This would require you getting rid of the
> BETWEEN type claused you're trying to build in
> TidQualFromBaseRestrictinfo
> and instead just include all quals, don't ignore other quals when
> you've already found your complete range bounds.
>
> The problem with doing it the way that you're doing it now is in cases like:
>
> create table t1(a int);
> insert into t1 select generate_Series(1,10000000);
> create index on t1 (a);
> select ctid,a from t1 order by a desc limit 1; -- find the max ctid.
>     ctid     |    a
> -------------+----------
>  (44247,178) | 10000000
> (1 row)
>
> set max_parallel_workers_per_gather=0;
> explain analyze select ctid,* from t1 where ctid > '(0,0)' and ctid <=
> '(44247,178)' and ctid <= '(0,1)';
>                                              QUERY PLAN
> -----------------------------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.01..169248.78 rows=1 width=10) (actual
> time=0.042..2123.432 rows=1 loops=1)
>    TID Cond: ((ctid > '(0,0)'::tid) AND (ctid <= '(44247,178)'::tid))
>    Filter: (ctid <= '(0,1)'::tid)
>    Rows Removed by Filter: 9999999
>  Planning Time: 4.049 ms
>  Execution Time: 2123.464 ms
> (6 rows)
>
> Due to how you've coded TidQualFromBaseRestrictinfo(), the ctid <=
> '(0,1)' qual does not make it into the range. It's left as a filter in
> the Tid Scan.

My first thought was to support the fairly-common use case of the
two-bound range "ctid >= ? AND ctid< ?" (or single-bound variations);
hence my original patch for a "TID Range Scan".

Following the comments made earlier, I tried incorporating this into
the existing TID Scan; but I still had the same use case in mind, so
only the first lower and upper bounds were used.  My thoughts were
that, while we need to *correctly* handle more complicated cases like
"ctid > '(0,0)' AND ctid <= '(44247,178)' AND ctid <= '(0,1)'", such
queries will not come up in practice and hence it's OK if those extra
bounds are applied in the filter.  For the same reason, I did not
consider it worthwhile trying to pick which bound to use in the scan.

I've since realised that such queries aren't always redundant.  At
query time we might not know which of the bounds if the "best", but we
will after evaluating them in the executor.  So I quite like the idea
of keeping all of them.

This means a TID path's quals is an OR-list of:
  - "ctid = ?"
  - "ctid = ANY (?)" / "ctid IN (?)"
  - "(ctid op ?) AND ..."   (where op is one of >,>=,<,<=)
  - "CURRENT OF"

I still don't think the scan needs to support quals like "ctid = ? AND
ctid > ?", or "ctid IN (?) AND ctid IN (?)" -- the executor *could*
try to form the intersection but I don't think it's worth the code.
In these cases, picking a simple qual is usually enough for an
efficient scan; the full qual can go into the filter.

I'm part way through implementing this.  It looks like it might
actually be less code than what I had before.

Re: Tid scan improvements

От

Edmund Horner

Дата:

04 ноября 2018 г., 07:20:32

Hi all,

I have managed to split my changes into 4 patches:

v3-0001-Add-selectivity-and-nullness-estimates-for-the-ItemP.patch
v3-0002-Support-range-quals-in-Tid-Scan.patch
v3-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch
v3-0004-Tid-Scan-results-are-ordered.patch

(1) is basically independent, and usefully improves estimates for ctid quals.
(2) is the main patch, adding basic range scan support to TidPath and TidScan.
(3) is a small change to properly support backward scans over a
restricted range in heapam.c, and is needed for (4).
(4) adds Backward Tid Scans, and adds path keys to Tid Paths so that
the planner doesn't have to add a sort for certain queries.

I have tried to apply David's suggestions.

In (1), I've included the offset part of a CTID constant in the
selectivity calculation.  I've not included "allvisfrac" in the
calculation; I'm not sure it's worth it as it would only affect the
offset part.
I have tried to use iseq to differentiate between <=,>= versus <,>,
but I'm not sure I've got this right.  I am also not entirely sure
it's worth it; the changes are already an improvement over the current
behaviour of using hardcoded selectivity constants.

In (2), the planner now picks up a greater variety of TID quals,
including AND-clauses with arbitrary children instead of the original
lower bound/upper bound pair.  These are resolved in the executor into
a list of ranges to scan.

(3) is the same code, but I've added a couple of comments to explain the change.

(4) is basically the same pathkey/direction code as before (but as a
separate patch).

I hope the separation will make it easier to review.  Item (2) is
still quite big, but a third of it is tests.

Cheers.
Edmund

On Tue, 6 Nov 2018 at 16:52, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2018-Nov-06, David Rowley wrote:
> > 14. we pass 'false' to what?
> >
> > + * save the tuple and the buffer returned to us by the access methods in
> > + * our scan tuple slot and return the slot.  Note: we pass 'false' because
> > + * tuples returned by heap_getnext() are pointers onto disk pages and were
> > + * not created with palloc() and so should not be pfree()'d.  Note also
> > + * that ExecStoreHeapTuple will increment the refcount of the buffer; the
> > + * refcount will not be dropped until the tuple table slot is cleared.
> >   */
>
> Seems a mistake stemming from 29c94e03c7d0 ...

Yep -- I copied that bit from nodeSeqscan.c.  Some of the notes were
removed in that change, but nodeSeqscan.c and nodeIndexscan.c still
have them.

I made a little patch to remove them.

Вложения

remove-obsolete-ExecStoreTuple-notes.patch

Re: Tid scan improvements

От

Edmund Horner

Дата:

09 ноября 2018 г., 05:01:33

On Tue, 6 Nov 2018 at 16:40, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've been looking over 0001 to 0003. I ran out of steam before 0004.

Hi David, thanks for another big review with lots of improvements.

> I like the design of the new patch. From what I threw so far at the
> selectivity estimation code, it seems pretty good.  I also quite like
> the design in nodeTidscan.c for range scans.


> I didn't quite manage to wrap my head around the code that removes
> redundant quals from the tidquals. For example, with:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
>                     QUERY PLAN
> --------------------------------------------------
>  Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
>    TID Cond: (ctid <= '(0,10)'::tid)
>    Filter: (a = 0)
> (3 rows)
>
> and:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
> and ctid >= '(0,0)';
>                                   QUERY PLAN
> ------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
>    TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
>    Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
> (3 rows)
>
> I understand why the 2nd query didn't remove the ctid quals from the
> filter, and I understand why the first query could. I just didn't
> manage to convince myself that the code behaves correctly for all
> cases.

I agree it's not obvious.

1. We extract a set of tidquals that can be directly implemented by
the Tid scan.  This set is of the form:  "(CTID op ? AND ...) OR
(...)" (with some limitations).
2. If they happened to come verbatim from the original RestrictInfos,
then they will be found in scan_clauses, and we can remove them.
3. If they're not verbatim, i.e. the original RestrictInfos have
additional criteria that the Tid scan can't use, then tidquals won't
match anything in scan_clauses, and hence scan_clauses will be
unchanged.
4. We do a bit of extra work for the common and useful case of "(CTID
op ? AND ...)".  Since the top-level operation of the input quals is
an AND, it will typically be split into multiple RestrictInfo items.
We remove each part from scan_clauses.

> 1. I see a few instances of:
>
> #define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
> #define ItemPointerGetDatum(X) PointerGetDatum(X)
>
> in both tid.c and ginfuncs.c, and I see you have:
>
> + itemptr = (ItemPointer) DatumGetPointer(constval);
>
> Do you think it would be worth moving the macros out of tid.c and
> ginfuncs.c into postgres.h and use that macro instead?
>
> (I see the code in this file already did this, so it might not matter
> about this)

I'm not sure about this one - - I think it's better as a separate
patch, since we'd also change ginfuncs.c.  I have left it alone for
now.

> 2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
> can just return MakeTidRangeQuals(found_quals); or return NIL.

Yup, gone.

> 3. Can you explain why this only needs to take place when list_length() == 1?
>
> /*
> * In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
> * the various parts will have come from different RestrictInfos.  So
> * remove each part separately.
> */
> ...

I've tried to improve the comment.

> 4. Accidental change?
>
> - tidquals);
> + tidquals
> + );
>
> 5. Shouldn't this comment get changed?
>
> - * NumTids    number of tids in this scan
> + * NumRanges    number of tids in this scan
>
> 6. There's no longer a field named NumTids
>
> - * TidList    evaluated item pointers (array of size NumTids)
> + * TidRanges    evaluated item pointers (array of size NumTids)
>
> 7. The following field is not documented in TidScanState:
>
> + bool tss_inScan;
>
> 8. Can you name this exprtype instead?
>
> + TidExprType type; /* type of op */
>
> "type" is used by Node types to indicate their type.

Yup, yup, yup, yup, yup.

> 9. It would be neater this:
>
> if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
> tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
> tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> else
> tidopexpr->type = TIDEXPR_EQ;
>
> tidopexpr->exprstate = exprstate;
>
> tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
> == TIDGreaterEqOperator;
>
> as a switch: ...

Yup, I think the switch is a bit nicer.

> 10. I don't quite understand this comment:
>
> + * Create an ExprState corresponding to the value part of a TID comparison,
> + * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
> + * appropriately, depending on the operator and position of the its arguments.
>
> I don't quite see how the code sets the inclusivity depending on the
> position of the arguments.
>
> Maybe the comment should be:
>
> + * For the given 'expr' build and return an appropriate TidOpExpr taking into
> + * account the expr's operator and operand order.

I'll go with your wording.

> 11. ScalarArrayOpExpr are commonly named "saop": ...

Yup.

> 12. You need to code SetTidLowerBound() with similar wraparound logic
> you have in SetTidUpperBound().
>
> It's perhaps unlikely, but the following shows incorrect results.
>
> postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
>  ctid
> -------
>  (0,1)
> (1 row)
>
> -- the following is fine.
>
> Time: 1.652 ms
> postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
>  ctid
> -------
>  (1,1)
> (1 row)
>
> Likely you can just upgrade to the next block when the offset is >
> MaxOffsetNumber.

This is important, thanks for spotting it.

I've tried to add some code to handle this case (and also that of
"ctid < '(0,0)'") with a couple of tests too.

> 13. It looks like the previous code didn't make the assumption you're making in:
>
> + * A current-of TidExpr only exists by itself, and we should
> + * already have allocated a tidList entry for it.  We don't
> + * need to check whether the tidList array needs to be
> + * resized.
>
> I'm not sure if it's a good idea to lock the executor code into what
> the grammar currently says is possible. The previous code didn't
> assume that.

Fair enough, I've restored the previous code without the assumption.

> 14. we pass 'false' to what?

Obsolete comment (see reply to Alvaro).

I've applied most of these, and I'll post a new patch soon.

Re: Tid scan improvements

От

Edmund Horner

Дата:

12 ноября 2018 г., 07:35:45

Hi, here's the new patch(s).

Mostly the same, but trying to address your comments from earlier as
well as clean up a few other things I noticed.

Cheers,
Edmund

On Fri, 9 Nov 2018 at 15:01, Edmund Horner <ejrh00@gmail.com> wrote:
>
> On Tue, 6 Nov 2018 at 16:40, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > I've been looking over 0001 to 0003. I ran out of steam before 0004.
>
> Hi David, thanks for another big review with lots of improvements.
>
> > I like the design of the new patch. From what I threw so far at the
> > selectivity estimation code, it seems pretty good.  I also quite like
> > the design in nodeTidscan.c for range scans.
>
>
> > I didn't quite manage to wrap my head around the code that removes
> > redundant quals from the tidquals. For example, with:
> >
> > postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
> >                     QUERY PLAN
> > --------------------------------------------------
> >  Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
> >    TID Cond: (ctid <= '(0,10)'::tid)
> >    Filter: (a = 0)
> > (3 rows)
> >
> > and:
> >
> > postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
> > and ctid >= '(0,0)';
> >                                   QUERY PLAN
> > ------------------------------------------------------------------------------
> >  Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
> >    TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
> >    Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
> > (3 rows)
> >
> > I understand why the 2nd query didn't remove the ctid quals from the
> > filter, and I understand why the first query could. I just didn't
> > manage to convince myself that the code behaves correctly for all
> > cases.
>
> I agree it's not obvious.
>
> 1. We extract a set of tidquals that can be directly implemented by
> the Tid scan.  This set is of the form:  "(CTID op ? AND ...) OR
> (...)" (with some limitations).
> 2. If they happened to come verbatim from the original RestrictInfos,
> then they will be found in scan_clauses, and we can remove them.
> 3. If they're not verbatim, i.e. the original RestrictInfos have
> additional criteria that the Tid scan can't use, then tidquals won't
> match anything in scan_clauses, and hence scan_clauses will be
> unchanged.
> 4. We do a bit of extra work for the common and useful case of "(CTID
> op ? AND ...)".  Since the top-level operation of the input quals is
> an AND, it will typically be split into multiple RestrictInfo items.
> We remove each part from scan_clauses.
>
> > 1. I see a few instances of:
> >
> > #define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
> > #define ItemPointerGetDatum(X) PointerGetDatum(X)
> >
> > in both tid.c and ginfuncs.c, and I see you have:
> >
> > + itemptr = (ItemPointer) DatumGetPointer(constval);
> >
> > Do you think it would be worth moving the macros out of tid.c and
> > ginfuncs.c into postgres.h and use that macro instead?
> >
> > (I see the code in this file already did this, so it might not matter
> > about this)
>
> I'm not sure about this one - - I think it's better as a separate
> patch, since we'd also change ginfuncs.c.  I have left it alone for
> now.
>
> > 2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
> > can just return MakeTidRangeQuals(found_quals); or return NIL.
>
> Yup, gone.
>
> > 3. Can you explain why this only needs to take place when list_length() == 1?
> >
> > /*
> > * In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
> > * the various parts will have come from different RestrictInfos.  So
> > * remove each part separately.
> > */
> > ...
>
> I've tried to improve the comment.
>
> > 4. Accidental change?
> >
> > - tidquals);
> > + tidquals
> > + );
> >
> > 5. Shouldn't this comment get changed?
> >
> > - * NumTids    number of tids in this scan
> > + * NumRanges    number of tids in this scan
> >
> > 6. There's no longer a field named NumTids
> >
> > - * TidList    evaluated item pointers (array of size NumTids)
> > + * TidRanges    evaluated item pointers (array of size NumTids)
> >
> > 7. The following field is not documented in TidScanState:
> >
> > + bool tss_inScan;
> >
> > 8. Can you name this exprtype instead?
> >
> > + TidExprType type; /* type of op */
> >
> > "type" is used by Node types to indicate their type.
>
> Yup, yup, yup, yup, yup.
>
> > 9. It would be neater this:
> >
> > if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
> > tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> > else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
> > tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> > else
> > tidopexpr->type = TIDEXPR_EQ;
> >
> > tidopexpr->exprstate = exprstate;
> >
> > tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
> > == TIDGreaterEqOperator;
> >
> > as a switch: ...
>
> Yup, I think the switch is a bit nicer.
>
> > 10. I don't quite understand this comment:
> >
> > + * Create an ExprState corresponding to the value part of a TID comparison,
> > + * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
> > + * appropriately, depending on the operator and position of the its arguments.
> >
> > I don't quite see how the code sets the inclusivity depending on the
> > position of the arguments.
> >
> > Maybe the comment should be:
> >
> > + * For the given 'expr' build and return an appropriate TidOpExpr taking into
> > + * account the expr's operator and operand order.
>
> I'll go with your wording.
>
> > 11. ScalarArrayOpExpr are commonly named "saop": ...
>
> Yup.
>
> > 12. You need to code SetTidLowerBound() with similar wraparound logic
> > you have in SetTidUpperBound().
> >
> > It's perhaps unlikely, but the following shows incorrect results.
> >
> > postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
> >  ctid
> > -------
> >  (0,1)
> > (1 row)
> >
> > -- the following is fine.
> >
> > Time: 1.652 ms
> > postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
> >  ctid
> > -------
> >  (1,1)
> > (1 row)
> >
> > Likely you can just upgrade to the next block when the offset is >
> > MaxOffsetNumber.
>
> This is important, thanks for spotting it.
>
> I've tried to add some code to handle this case (and also that of
> "ctid < '(0,0)'") with a couple of tests too.
>
> > 13. It looks like the previous code didn't make the assumption you're making in:
> >
> > + * A current-of TidExpr only exists by itself, and we should
> > + * already have allocated a tidList entry for it.  We don't
> > + * need to check whether the tidList array needs to be
> > + * resized.
> >
> > I'm not sure if it's a good idea to lock the executor code into what
> > the grammar currently says is possible. The previous code didn't
> > assume that.
>
> Fair enough, I've restored the previous code without the assumption.
>
> > 14. we pass 'false' to what?
>
> Obsolete comment (see reply to Alvaro).
>
> I've applied most of these, and I'll post a new patch soon.

On Thu, 22 Nov 2018 at 20:41, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've now had a look over the latest patches and I've found a few more
> things.  Many of these are a bit nitpicky, but certainly not all. I
> also reviewed 0004 this time.


Whew!  A lot more things to look at.

I've tried to address most of what you've raised, and attach yet
another set of patches.  There are are few things that I'm not settled
on, discussed below under Big Items.

CC'd Tomas, if he wants to check what I've done with the TidRange
array allocation.


***** Big Items *****

> 0001:
>
> 1. The row estimates are not quite right.  This cases the row
> estimation to go the wrong way for isgt.
>
> For example, the following gets 24 rows instead of 26.
>
> postgres=# create table t (a int);
> CREATE TABLE
> postgres=# insert into t select generate_Series(1,100);
> INSERT 0 100
> postgres=# analyze t;
> postgres=# explain analyze select * from t where ctid >= '(0,75)';
>                                          QUERY PLAN
> ---------------------------------------------------------------------------------------------
>  Seq Scan on t  (cost=0.00..2.25 rows=24 width=4) (actual
> time=0.046..0.051 rows=26 loops=1)
>    Filter: (ctid >= '(0,75)'::tid)
>    Rows Removed by Filter: 74
>  Planning Time: 0.065 ms
>  Execution Time: 0.074 ms
> (5 rows)
>
> The < and <= case is not quite right either. < should have 1 fewer
> tuple than the calculated average tuples per page, and <= should have
> the same (assuming no gaps)
>
> I've attached a small delta patch that I think improves things here.

Thanks, I've incorporated your patch.  I think the logic for iseq and
isgt makes sense now.

Since we only have the total number of tuples and the total number of
pages, and no real statistics, this might be the best we can
reasonably do.  There's still a noticable rowcount error for the last
page, and slighter rowcount errors for other pages.  We estimate
density = ntuples/npages for all pages; but in a densely populated
table, we'll average only half the number of tuples in the last page
as earlier pages.

I guess we *could* estimate density = ntuples/(npages - 0.5) for all
but the last page; and half that for the last.  But that adds
complexity, and you'd still only get a good row count when the last
page was about half full.

I implemented this anyway, and it does improve row counts a bit.  I'll
include it in the next patch set and you can take a look.

I also spent some time today agonising over how visiblity would affect
things, but did not come up with anything useful to add to our
formulas.

> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> allocation until it reaches the required size. See how
> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> we always have a power of two sized array which is much nicer if we
> ever reach the palloc() limit as if the array is sized at the palloc()
> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> unlikely to be a problem here, but... the question would be how to
> decide on the initial size.

I've tried to change things that way, but we still need to deal with
excessive numbers of items.

I've defined a constant MaxTidRanges = MaxAllocSize/sizeof(TidRange),
and raise an error if the required size exceeds that.

> 4. "at" needs shifted left a couple of words
>
> /*
> * If the lower bound was already or above at the maximum block
> * number, then there is no valid range.
> */
>
> but I don't see how it could be "or above".  The ctid type does not
> have the room for that. Although, that's not to say you should test if
> (block == MaxBlockNumber), the >= seems better for the code. I'm just
> complaining about the comment.

We have to deal with TIDs entered by the user, which can include
invalid ones like (4294967295,0).  MaxBlockNumber is 4294967294.

> 12. expected_comparison_operator is a bit long a name:
>
> IsTidComparison(OpExpr *node, int varno, Oid expected_comparison_operator)
>
> How about just expected_opno?
>
> 14. This is not great:
>
> [horrible macros in tidpath.c]
>
> The 4 macros for >, >=, < and <= are only used by IsTidRangeClause()
> which means IsTidComparison() could get called up to 4 times. Most of
> the work it does would be redundant in that case.  Maybe it's better
> to rethink that?

Yeah.  I've rewritten all this as two functions, IsTidEqualClause and
IsTidRangeClause, which each check the opno, with a helper function
IsTidBinaryExpression that checks everything else.

> 18. I'd expect the following not to produce a sort above the Tid Scan.
>
> postgres=# set enable_seqscan=0;
> SET
> postgres=# explain select * from t inner join t t1 on t.ctid = t1.ctid
> where t.ctid < '(0,10)' ;
>                                       QUERY PLAN
> ---------------------------------------------------------------------------------------
>  Merge Join  (cost=10000000008.65..10000000009.28 rows=9 width=8)
>    Merge Cond: (t.ctid = t1.ctid)
>    ->  Sort  (cost=3.33..3.35 rows=9 width=10)
>          Sort Key: t.ctid
>          ->  Tid Scan on t  (cost=0.00..3.18 rows=9 width=10)
>                TID Cond: (ctid < '(0,10)'::tid)
>    ->  Sort  (cost=10000000005.32..10000000005.57 rows=100 width=10)
>          Sort Key: t1.ctid
>          ->  Seq Scan on t t1  (cost=10000000000.00..10000000002.00
> rows=100 width=10)
> (9 rows)
>
> On looking at why the planner did this, I see it's down to how you've
> coded create_tidscan_paths(). You're creating a tidpath if there's any
> quals or any useful pathkeys useful to the query's ORDER BY, but only
> including the pathkeys if they're useful for the query's ORDER BY.  I
> think it'll be better to include the forward pathkeys in all cases,
> and just make it a backward Tid Scan if backward keys are useful for
> the ORDER BY.   There's still a problem with this as a Merge Join
> would need a Sort if there was an ORDER BY ctid DESC for one relation
> even if the other relation had some valid ctid quals since the 2nd
> scan would create a forward Tid Scan.  Maybe that's not worth worrying
> about.   The only fix I can imagine is to always create a forward and
> backward Tid Scan path, which is pretty bad as it's two more paths
> that likely won't get used 99.9% of the time.

Two paths seems excessive just to cater for these unlikely plans.  We
don't provide any other support for joining on CTID.

But setting the path keys doesn't cost much, so we should do that.

> This also caused me to notice the costs are pretty broken for this:
>
> postgres=# explain select * from t order by ctid;
>                      QUERY PLAN
> ---------------------------------------------------
>  Tid Scan on t  (cost=0.00..0.00 rows=100 width=10)
> (1 row)

Yeah -- a side effect of treating empty tidquals as a scan over the
whole table.  I've added costing code for this case.


***** Smaller items *****

Compacted for brevity (hope you don't mind):

> 2. You should test for a non-empty List with list != NIL  [...]  Also, can you not just return after that if test? I
thinkthe code
 
> would be easier to read with it like that.
> 5. TidInArrayExprEval() lacks a header comment [...]
> 6. In MergeTidRanges(), you have: [leftover code]
> 7. "fist" -> "first" [...]
> 8. tss_TidRangePtr is a pretty confusingly named field. [...]
> 9. This comment seems to indicate that a range can only have one bound, but that does not seem to be the case. [...]
> 10. In cost_tidscan() I think you should ceil() the following: [...]
> 11. In the comment: /* TODO decide what the costs should be */ [...]
> 13. !rlst -> rlst != NIL
> 15. There's no field named NumTids: [...]
> 16. I think the following comment needs to be updated: [heapam comments]
> 17. Can you make a few changed to build_tidscan_pathkeys(): [...]
> 19. Looks like the ScanDirection's normally get named "scandir": [...]

These are mostly trivial and I've generally gone with your recommendation.

On Sat, 22 Dec 2018 at 07:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, with respect to this bit in 0001:

@@ -1795,6 +1847,15 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
return (Selectivity) 0; /* keep compiler quiet */
}
}
+ else if (vardata.var && IsA(vardata.var, Var) &&
+ ((Var *) vardata.var)->varattno == SelfItemPointerAttributeNumber)
+ {
+ /*
+ * There are no stats for system columns, but we know CTID is never
+ * NULL.
+ */
+ selec = (nulltesttype == IS_NULL) ? 0.0 : 1.0;
+ }
else
{
/*

I'm not entirely sure why you're bothering; surely nulltestsel is
unrelated to what this patch is about? And would anybody really
write "WHERE ctid IS NULL"?

I found that it made a difference with selectivity of range comparisons, because clauselist_selectivity tries to correct for it (clausesel.c:274):

    s2 = rqlist->hibound + rqlist->lobound - 1.0

/* Adjust for double-exclusion of NULLs */

s2 += nulltestsel(root, IS_NULL, rqlist->var,

varRelid, jointype, sjinfo);

It was adding DEFAULT_UNK_SEL = 0.005 to the selectivity, which (while not major) did make the selectivity less accurate.

However, if we do think it's worth adding code to cover this case,
I wouldn't make it specific to CTID. *All* system columns can be
assumed not null, see heap_getsysattr().

I guess we could have a standalone patch to add this for all system columns?

Re: Tid scan improvements

От

Tom Lane

Дата:

15 января 2019 г., 06:54:47

Edmund Horner <ejrh00@gmail.com> writes:
> On Sat, 22 Dec 2018 at 07:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not entirely sure why you're bothering; surely nulltestsel is
>> unrelated to what this patch is about?

> I found that it made a difference with selectivity of range comparisons,
> because clauselist_selectivity tries to correct for it (clausesel.c:274):

Oh, I see.

> I guess we could have a standalone patch to add this for all system columns?

+1

            regards, tom lane

Re: Tid scan improvements

От

Edmund Horner

Дата:

18 января 2019 г., 07:14:48

Hi all,

I am a bit stuck and I think it's best to try to explain where.

I'm still rebasing the patches for the changes Tom made to support parameterised TID paths for joins.  While the addition of join support itself does not really touch the same code, the modernisation -- in particular, returning a list of RestrictInfos rather than raw quals -- does rewrite quite a bit of tidpath.c.

The original code returned:

    List (with OR semantics)
      CTID = ?   or   CTID = ANY (...)   or   IS CURRENT OF
      (more items)

That changed recently to return:

    List (with OR semantics)
      RestrictInfo
        CTID = ?   or ...
      (more items)

My last set of patches extended the tidqual extraction to pull out lists (with AND semantics) of range quals of the form CTID < ?, etc.  Each list of more than one item was converted into an AND clause before being added to the tidqual list; a single range qual can be added to tidquals as is.

This required looking across multiple RestrictInfos at the top level, for example:

  - "WHERE ctid > ? AND ctid < ?" would arrive at tidpath as a list of two RestrictInfos, from which we extract a single tidqual in the form of an AND clause.

  - "WHERE ctid = ? OR (ctid > ? AND ctid < ?)" arrives as only one RestrictInfo, but we extract two tidquals (an OpExpr, and an AND clause).

The code could also ignore additional unusable quals from a list of top-level RestrictInfos, or from a list of quals from an AND clause, for example:

  - "WHERE foo = ? AND ctid > ? AND ctid < ?" gives us the single tidqual "ctid > ? AND ctid < ?".

  - "WHERE (ctid = ? AND bar = ?) OR (foo = ? AND ctid > ? AND ctid < ?)" gives us the two tidquals "ctid = ?" and "ctid > ? AND ctid < ?".

As the extracted tidquals no longer match the original query quals, they aren't removed from scan_clauses in createplan.c, and hence are correctly checked by the filter.

Aside: The analogous situation with an indexed user attribute "x" behaves a bit differently:

  - "WHERE x = ? OR (x > ? AND x < ?)", won't use a regular index scan, but might use a bitmap index scan.

My patch uses the same path type and executor for all extractable tidquals.

This worked pretty well, but I am finding it difficult to reimplement it in the new tidpath.c code.

In the query information given to the path generator, there is no existing RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I am still learning about RestrictInfos, but my understanding is it doesn't make sense to have a RestrictInfo for an AND clause, anyway; you're supposed to have them for the sub-expressions of it.

And it doesn't seem a good idea to try to create new RestrictInfos in the path generation just to pass the tidquals back to plan creation.  They're complicated objects.

There's also the generation of scan_clauses in create_tidscan_plan (createplan.c:3107).  This now uses RestrictInfos -- I'd image we'd need each AND clause to be wrapped in a RestrictInfo to be able to check it properly.

To summarise, I'm not sure what kind of structure I should add to the tidquals list to represent a compound range expression.  Maybe it's better to create a different path (either a new path type, or a flag in TidPath to say what kind of quals are attached) ?

Edmund

Re: Tid scan improvements

От

Tom Lane

Дата:

18 января 2019 г., 19:35:56

Edmund Horner <ejrh00@gmail.com> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node.  I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list.  Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Now admittedly, this wouldn't give us an efficient way to execute
queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
but I find myself quite unable to get excited about supporting that.
I see no reason for the new code to worry about any cases more complex
than one or two TID inequalities at top level of the restriction list.

> In the query information given to the path generator, there is no existing
> RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> am still learning about RestrictInfos, but my understanding is it doesn't
> make sense to have a RestrictInfo for an AND clause, anyway; you're
> supposed to have them for the sub-expressions of it.

FWIW, the actual data structure for cases like that is that there's
a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
and if you look into its "orclause" field, you will find RestrictInfos
attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
old code in tidpath.c didn't know that, because it'd never been rewritten
since RestrictInfos were invented.)  However, I think this new code should
not worry about OR cases at all, but just pull out top-level TID
comparison clauses.

> And it doesn't seem a good idea to try to create new RestrictInfos in the
> path generation just to pass the tidquals back to plan creation.

No, you should avoid that.  There are places that assume there's only
one RestrictInfo for any given original clause (or sub-clause).

            regards, tom lane

Re: Tid scan improvements

От

Edmund Horner

Дата:

19 января 2019 г., 07:04:13

On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Edmund Horner <ejrh00@gmail.com> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node. I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list. Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Thanks for the advice.  This approach resembles my first draft, which had a separate executor type.  However, it did have a combined path type, with an enum TidPathMethod to determine how tidquals was interpreted.  At this point, I think a different path type is clearer, though generation of both types can live in tidpath.c (just as indxpath.c generates different index path types).

Now admittedly, this wouldn't give us an efficient way to execute
queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
but I find myself quite unable to get excited about supporting that.
I see no reason for the new code to worry about any cases more complex
than one or two TID inequalities at top level of the restriction list.

I'm a bit sad to see support for multiple ranges go, though I never saw such queries as ever being particularly common.  (And there was always a nagging feeling that tidpath.c was beginning to perform feats of boolean acrobatics out of proportion to its importance.  Perhaps in some distant future, TID quals will become another way of supplying TIDs to a bitmap heap scan, which would enable complicated boolean queries using both indexes and TID scans.  But that's just musing, not a proposal.)

> In the query information given to the path generator, there is no existing
> RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?". I
> am still learning about RestrictInfos, but my understanding is it doesn't
> make sense to have a RestrictInfo for an AND clause, anyway; you're
> supposed to have them for the sub-expressions of it.

FWIW, the actual data structure for cases like that is that there's
a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
and if you look into its "orclause" field, you will find RestrictInfos
attached to the primitive clauses ctid = X, ctid > Y, ctid < Z. (The
old code in tidpath.c didn't know that, because it'd never been rewritten
since RestrictInfos were invented.) However, I think this new code should
not worry about OR cases at all, but just pull out top-level TID
comparison clauses.

Thanks for the explanation.

> And it doesn't seem a good idea to try to create new RestrictInfos in the
> path generation just to pass the tidquals back to plan creation.

No, you should avoid that. There are places that assume there's only
one RestrictInfo for any given original clause (or sub-clause).

Re: Tid scan improvements

От

Andres Freund

Дата:

03 февраля 2019 г., 13:34:36

Hi,

On 2019-01-19 17:04:13 +1300, Edmund Horner wrote:
> On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > Edmund Horner <ejrh00@gmail.com> writes:
> > > My patch uses the same path type and executor for all extractable
> > tidquals.
> >
> > > This worked pretty well, but I am finding it difficult to reimplement it
> > in
> > > the new tidpath.c code.
> >
> > I didn't like that approach to begin with, and would suggest that you go
> > over to using a separate path type and executor node.  I don't think the
> > amount of commonality for the two cases is all that large, and doing it
> > as you had it required some ugly ad-hoc conventions about the semantics
> > of the tidquals list.  Where I think this should go is that the tidquals
> > list still has OR semantics in the existing path type, but you use AND
> > semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
> > represented as an implicit-AND list of two simple RestrictInfos.
> >
> 
> Thanks for the advice.  This approach resembles my first draft, which had a
> separate executor type.  However, it did have a combined path type, with an
> enum TidPathMethod to determine how tidquals was interpreted.  At this
> point, I think a different path type is clearer, though generation of both
> types can live in tidpath.c (just as indxpath.c generates different index
> path types).
> 
> 
> > Now admittedly, this wouldn't give us an efficient way to execute
> > queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
> > but I find myself quite unable to get excited about supporting that.
> > I see no reason for the new code to worry about any cases more complex
> > than one or two TID inequalities at top level of the restriction list.
> >
> 
> I'm a bit sad to see support for multiple ranges go, though I never saw
> such queries as ever being particularly common.  (And there was always a
> nagging feeling that tidpath.c was beginning to perform feats of boolean
> acrobatics out of proportion to its importance.  Perhaps in some distant
> future, TID quals will become another way of supplying TIDs to a bitmap
> heap scan, which would enable complicated boolean queries using both
> indexes and TID scans.  But that's just musing, not a proposal.)
> 
> > In the query information given to the path generator, there is no existing
> > > RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> > > am still learning about RestrictInfos, but my understanding is it doesn't
> > > make sense to have a RestrictInfo for an AND clause, anyway; you're
> > > supposed to have them for the sub-expressions of it.
> >
> > FWIW, the actual data structure for cases like that is that there's
> > a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
> > and if you look into its "orclause" field, you will find RestrictInfos
> > attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
> > old code in tidpath.c didn't know that, because it'd never been rewritten
> > since RestrictInfos were invented.)  However, I think this new code should
> > not worry about OR cases at all, but just pull out top-level TID
> > comparison clauses.
> >
> 
> Thanks for the explanation.
> 
> > And it doesn't seem a good idea to try to create new RestrictInfos in the
> > > path generation just to pass the tidquals back to plan creation.
> >
> > No, you should avoid that.  There are places that assume there's only
> > one RestrictInfo for any given original clause (or sub-clause).


The commitfest has ended, and you've not updated the patch to address
the feedback yet. Are you planning to do so soon? Otherwise I think we
ought to mark the patch as returned with feedback?

Greetings,

Andres Freund

Re: Tid scan improvements

От

Edmund Horner

Дата:

04 февраля 2019 г., 02:39:05

Hi, my apologies for the delay.

I've finished rebasing and rewriting it for Tom's changes to tidpath.c and his recommendations for tid range scans, but I then found a bug with cursor interaction.  Specifically, FETCH LAST scans through the whole range, and then proceeds to scan backwards to get the last row.  It worked in both my very first draft, and in the most recent draft before the changes to tidpath, but I haven't got it working yet for the new version.

I'm hoping to get that fixed in the next 24 hours, and I'll then post the new patch.

Edmund

On Sun, 3 Feb 2019 at 23:34, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-01-19 17:04:13 +1300, Edmund Horner wrote:
> On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Edmund Horner <ejrh00@gmail.com> writes:
> > > My patch uses the same path type and executor for all extractable
> > tidquals.
> >
> > > This worked pretty well, but I am finding it difficult to reimplement it
> > in
> > > the new tidpath.c code.
> >
> > I didn't like that approach to begin with, and would suggest that you go
> > over to using a separate path type and executor node. I don't think the
> > amount of commonality for the two cases is all that large, and doing it
> > as you had it required some ugly ad-hoc conventions about the semantics
> > of the tidquals list. Where I think this should go is that the tidquals
> > list still has OR semantics in the existing path type, but you use AND
> > semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
> > represented as an implicit-AND list of two simple RestrictInfos.
> >
>
> Thanks for the advice. This approach resembles my first draft, which had a
> separate executor type. However, it did have a combined path type, with an
> enum TidPathMethod to determine how tidquals was interpreted. At this
> point, I think a different path type is clearer, though generation of both
> types can live in tidpath.c (just as indxpath.c generates different index
> path types).
>
>
> > Now admittedly, this wouldn't give us an efficient way to execute
> > queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
> > but I find myself quite unable to get excited about supporting that.
> > I see no reason for the new code to worry about any cases more complex
> > than one or two TID inequalities at top level of the restriction list.
> >
>
> I'm a bit sad to see support for multiple ranges go, though I never saw
> such queries as ever being particularly common. (And there was always a
> nagging feeling that tidpath.c was beginning to perform feats of boolean
> acrobatics out of proportion to its importance. Perhaps in some distant
> future, TID quals will become another way of supplying TIDs to a bitmap
> heap scan, which would enable complicated boolean queries using both
> indexes and TID scans. But that's just musing, not a proposal.)
>
> > In the query information given to the path generator, there is no existing
> > > RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?". I
> > > am still learning about RestrictInfos, but my understanding is it doesn't
> > > make sense to have a RestrictInfo for an AND clause, anyway; you're
> > > supposed to have them for the sub-expressions of it.
> >
> > FWIW, the actual data structure for cases like that is that there's
> > a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
> > and if you look into its "orclause" field, you will find RestrictInfos
> > attached to the primitive clauses ctid = X, ctid > Y, ctid < Z. (The
> > old code in tidpath.c didn't know that, because it'd never been rewritten
> > since RestrictInfos were invented.) However, I think this new code should
> > not worry about OR cases at all, but just pull out top-level TID
> > comparison clauses.
> >
>
> Thanks for the explanation.
>
> > And it doesn't seem a good idea to try to create new RestrictInfos in the
> > > path generation just to pass the tidquals back to plan creation.
> >
> > No, you should avoid that. There are places that assume there's only
> > one RestrictInfo for any given original clause (or sub-clause).

The commitfest has ended, and you've not updated the patch to address
the feedback yet. Are you planning to do so soon? Otherwise I think we
ought to mark the patch as returned with feedback?

Greetings,

Andres Freund

Re: Tid scan improvements

От

Edmund Horner

Дата:

04 февраля 2019 г., 08:37:33

On Sat, 19 Jan 2019 at 17:04, Edmund Horner <ejrh00@gmail.com> wrote:

On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Edmund Horner <ejrh00@gmail.com> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node. I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list. Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Thanks for the advice. This approach resembles my first draft, which had a separate executor type. However, it did have a combined path type, with an enum TidPathMethod to determine how tidquals was interpreted. At this point, I think a different path type is clearer, though generation of both types can live in tidpath.c (just as indxpath.c generates different index path types).

Hi, here's a new set of patches.  This one adds a new path type called TidRangePath and a new execution node called TidRangeScan.  I haven't included any of the patches for adding pathkeys to TidPaths or TidRangePaths.

1. v6-0001-Add-selectivity-estimate-for-CTID-system-variables.patch
2. v6-0002-Support-backward-scans-over-restricted-ranges-in-hea.patch
3. v6-0003-Support-range-quals-in-Tid-Scan.patch
4. v6-0004-TID-selectivity-reduce-the-density-of-the-last-page-.patch

Patches 1, 2, and 4 are basically unchanged from my previous post.  Patch 4 is an optional tweak to the CTID selectivity estimates.

Patch 3 is a substantial rewrite from what I had before.  I've checked David's most recent review and tried to make sure the new code meets his suggestions where applicable, although there is one spot where I left the code as "if (tidrangequals) ..." instead of the preferred "if (tidrangequals != NIL) ...", just for consistency with the surrounding code.

Questions --

1. Tid Range Paths are costed as random_page_cost for the first page, and sequential page cost for the remaining pages.  It made sense when there could be multiple non-overlapping ranges.  Now that there's only one range, it might not, but it has the benefit of making Tid Range Scans a little bit more expensive than Sequential Scans, so that they are less likely to be picked when a Seq Scan will do just as well.  Is there a better cost formula to use?

2. Is it worth trying to get rid of some of the code duplication between the TidPath and TidRangePath handling, such as in costsize.c or createplan.c?

3. TidRangeRecheck (copied from TidRecheck) has an existing comment asking whether it should actually be performing a check on the returned tuple.  It seems to me that as long as TidRangeNext doesn't return a tuple outside the requested range, then the check shouldn't be necessary (and we'd simplify the comment to "nothing to check").  If a range key can change at runtime, it should never have been included in the TidRangePath.  Is my understanding correct?

4. I'm a little uncomfortable with the way heapam.c changes the scan limits ("--scan->rs_numblocks") as it progresses through the pages.  I have the executor node reset the scan limits after scanning all the tuples, which seems to work for the tests I have, but I'm using the heap_setscanlimits feature in a slightly different way from the only existing use, which is for the one-off scans when building a BRIN index.  I have added some tests for cursor fetches which seems to exercise the code, but I'd still appreciate close review of how I'm using heapam.

Edmund

On Thu, 14 Mar 2019 at 23:37, Edmund Horner <ejrh00@gmail.com> wrote:
> On Thu, 14 Mar 2019 at 23:06, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > Just looking again, I think the block of code starting:
> >
> > + if (density > 0.0)
> >
> > needs a comment to mention what it's doing. Perhaps:
> >
> > + /*
> > + * Using the average tuples per page, calculate how far into
> > + * the page the itemptr is likely to be and adjust block
> > + * accordingly.
> > + */
> > + if (density > 0.0)
> >
> > Or some better choice of words.  With that done, I think 0001 is good to go.
>
> Ok, I'll look at it and hopefully get a new patch up soon.

Hullo,

Here's a new set of patches.

It includes new versions of the other patches, which needed to be
rebased because of the introduction of the "tableam" API by
c2fe139c20.

I've had to adapt it to use the table scan API.  I've got it compiling
and passing tests, but I'm uneasy about some things that still use the
heapam API.

1. I call heap_setscanlimits as I'm not sure there is a tableam equivalent.
2. I'm not sure whether non-heap tableam implementations can also be
supported by my TID Range Scan: we need to be able to set the scan
limits.  There may not be any other implementations yet, but when
there are, how do we stop the planner using a TID Range Scan for
non-heap relations?
3. When fetching tuples, I see that nodeSeqscan.c uses
table_scan_getnextslot, which saves dealing with HeapTuples.  But
nodeTidrangescan wants to do some checking of the block and offset
before returning the slot.  So I have it using heap_getnext and
ExecStoreBufferHeapTuple.  Apart from being heapam-specific, it's just
not as clean as the new API calls.

Ideally, we can get to to support general tableam implementations
rather than using heapam-specific calls.  Any advice on how to do
this?

Thanks
Edmund

On Thu, 11 Jul 2019 at 10:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
> So it seems that the plan is to insist that TIDs are tuple identifiers
> for all table AMs, for now.  If that changes in the future, then so be
> it, but I don't think that's cause for delaying any work on TID Range
> Scans.  Also from scanning around tableam.h, I see that there's no
> shortage of usages of BlockNumber, so it seems reasonable to assume
> table AMs must use blocks... It's hard to imagine moving away from
> that given that we have shared buffers.
>
> We do appear to have some table AM methods that are optional, although
> I'm not sure where the documentation is about that. For example, in
> get_relation_info() I see:
>
> info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
> relation->rd_tableam->scan_bitmap_next_block != NULL;
>
> so it appears that at least scan_bitmap_next_block is optional.
>
> I think what I'd do would be to add a table_setscanlimits API method
> for table AM and perhaps have the planner only add TID range scan
> paths if the relation has a non-NULL function pointer for that API
> function.  It would be good to stick a comment at least in tableam.h
> that mentions that the callback is optional.  That might help a bit
> when it comes to writing documentation on each API function and what
> they do.

Hi.  Here's a new patch.

Summary of changes compared to last time:
  - I've added the additional "scan_setlimits" table AM method.  To
check whether it's implemented in the planner, I have added an
additional "has_scan_setlimits" flag to RelOptInfo.  It seems to work.
  - I've also changed nodeTidrangescan to not require anything from heapam now.
  - To simply the patch and avoid changing heapam, I've removed the
backward scan support (which was needed for FETCH LAST/PRIOR) and made
ExecSupportsBackwardScan return false for this plan type.
  - I've removed the vestigial passing of "direction" through
nodeTidrangescan.  If my understanding is correct, the direction
passed to TidRangeNext will always be forward.  But I did leave FETCH
LAST/PRIOR in the regression tests (after adding SCROLL to the
cursor).

The patch now only targets the simple use case of restricting the
range of a table to scan.  I think it would be nice to eventually
support the other major use cases of ORDER BY ctid ASC/DESC and
MIN/MAX(ctid), but that can be another feature...

Edmund

Вложения

v8-0001-Add-a-new-plan-type-Tid-Range-Scan-to-support-range-.patch

Re: Tid scan improvements

От

David Rowley

Дата:

17 июля 2019 г., 14:10:52

On Mon, 15 Jul 2019 at 17:54, Edmund Horner <ejrh00@gmail.com> wrote:
> Summary of changes compared to last time:
>   - I've added the additional "scan_setlimits" table AM method.  To
> check whether it's implemented in the planner, I have added an
> additional "has_scan_setlimits" flag to RelOptInfo.  It seems to work.
>   - I've also changed nodeTidrangescan to not require anything from heapam now.
>   - To simply the patch and avoid changing heapam, I've removed the
> backward scan support (which was needed for FETCH LAST/PRIOR) and made
> ExecSupportsBackwardScan return false for this plan type.
>   - I've removed the vestigial passing of "direction" through
> nodeTidrangescan.  If my understanding is correct, the direction
> passed to TidRangeNext will always be forward.  But I did leave FETCH
> LAST/PRIOR in the regression tests (after adding SCROLL to the
> cursor).

I spent some time today hacking at this.  I fixed a bug in how
has_scan_setlimits set, rewrite a few comments and simplified some of
the code.

When I mentioned up-thread about the optional scan_setlimits table AM
callback, I'd forgotten that you'd not have access to check that
directly during planning. As you mention above, you've added
RelOptInfo has_scan_setlimits so the planner knows if it can use TID
Range scans or not. It would be nice to not have to add this flag, but
that would require either:

1. Making scan_setlimits a non-optional callback function in table AM, or;
2. Allowing the planner to have access to the opened Relation.

#2 is not for this patch, but there has been some talk about it. It
was done for the executor last year in d73f4c74dd3.

I wonder if Andres has any thoughts on #1?

The other thing I was thinking about was if enable_tidscan should be
in charge of TID Range scans too. I see you have it that way, but
should we be adding enable_tidrangescan?  The docs claim that
enable_tidscan: "Enables or disables the query planner's use of TID
scan plan types.". Note: "types" is  plural.  Maybe we could call that
fate and keep it the way the patch has it already.  Does anyone have
another idea about that?

I've attached a delta of the changes I made and also a complete v9 patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

On Fri, 1 Jan 2021 at 14:30, David Fetter <david@fetter.org> wrote:

On Sun, Dec 01, 2019 at 11:34:16AM +0900, Michael Paquier wrote:
> Okay, still nothing has happened after two months. Once this is
> solved a new patch submission could be done. For now I have marked
> the entry as returned with feedback.

I dusted off the last version of the patch without implementing the
suggestions in this thread between here and there.

I think this is a capability worth having, as I was surprised when it
turned out that it didn't exist when I was looking to make an
improvement to pg_dump. My idea, which I'll get back to when this is
in, was to use special knowledge of heap AM tables to make it possible
to parallelize dumps of large tables by working separately on each
underlying file, of which there could be quite a few for a large one.

Will try to understand the suggestions upthread better and implement
same.

Hi David,

Thanks for updating the patch. I'd be very happy if this got picked up again, and I'd certainly follow along and do some review.

+ splan->tidrangequals =

+ fix_scan_list(root, splan->tidrangequals,
+ rtoffset, 1); /* v9_tid XXX Not sure this is right */

I'm pretty sure the parameter num_exec = 1 is fine; it seems to only affect correlated subselects, which shouldn't really make their way into the tidrangequals expressions. It's more or less the same situation as tidquals for TidPath, anyway. We could put a little comment: /* correlated subselects shouldn't get into tidquals/tidrangequals */ or something to that effect.

Edmund

Re: Tid scan improvements

От

David Rowley

Дата:

21 января 2021 г., 08:16:09

On Wed, 13 Jan 2021 at 15:38, Edmund Horner <ejrh00@gmail.com> wrote:
> Thanks for updating the patch. I'd be very happy if this got picked up again, and I'd certainly follow along and do
somereview.

Likewise here. I this patch was pretty close so it seems a shame to
let it slip through the cracks.

I spoke to Andres off-list about this patch. He mentioned that he
wasn't too keen on seeing the setscanlimits being baked into the table
AM API. He mentioned that he'd rather not assume too much about table
AMs having all of their tids in a given range consecutively on a set
of pages. That seems reasonable to me. He suggested that we add a
new callback that just allows a range of ItemPointers to scan and
leave it up to the implementation to decide which pages should be
scanned to fetch the tuples within the given range. I didn't argue. I
just went and coded it all, hopefully to Andres' description. The new
table AM callback is optional.

I've implemented this in the attached.

I also took the time to support backwards TID Range scans and added a
few more tests to test rescans. I just found it annoying that TID
Scans supported backwards scans and TID Range scans did not.

The 0002 patch is the guts of it. The 0001 patch is an existing bug
that needs to be fixed before 0002 could go in (backwards TID Range
Scans are broken without this). I've posted separately about this bug
in [1]

I also didn't really like the idea of adding possibly lots of bool
fields to RelOptInfo to describe what the planner can do in regards to
what the given table AM supports. I know that IndexOptInfo has such
a set of bool fields. I'd rather not repeat that, so I just went with
a single int field named "amflags" and just added a single constant to
define a flag that specifies if the rel supports scanning ranges of
TIDs.

Edmund, will you get time to a look at this?

David

[1] https://www.postgresql.org/message-id/CAApHDvpGc9h0_oVD2CtgBcxCS1N-qDYZSeBRnUh+0CWJA9cMaA@mail.gmail.com

Вложения

Re: Tid scan improvements

От

David Rowley

Дата:

26 января 2021 г., 04:22:42

On Thu, 21 Jan 2021 at 18:16, David Rowley <dgrowleyml@gmail.com> wrote:
> I've implemented this in the attached.

The bug fix in 0001 is now committed, so I'm just attaching the 0002
patch again after having rebased... This is mostly just to keep the
CFbot happy.

David

Вложения

v12-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch

Re: Tid scan improvements

От

Zhihong Yu

Дата:

26 января 2021 г., 05:50:02

Hi,

bq. within this range. Table AMs where scanning ranges of TIDs does not make
sense or is difficult to implement efficiently may choose to not implement

Is there criterion on how to judge efficiency ?

+ if (tidopexpr->exprtype == TIDEXPR_LOWER_BOUND)

...

+ if (tidopexpr->exprtype == TIDEXPR_UPPER_BOUND)

The if statement for upper bound should be prefixed with 'else', right ?

+ * TidRecheck -- access method routine to recheck a tuple in EvalPlanQual

...

+TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)

The method name in the comment doesn't match the real method name.

+ * ExecTidRangeScan(node)

...

+ExecTidRangeScan(PlanState *pstate)

Parameter names don't match.

Cheers

On Mon, Jan 25, 2021 at 5:23 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 21 Jan 2021 at 18:16, David Rowley <dgrowleyml@gmail.com> wrote:
> I've implemented this in the attached.

The bug fix in 0001 is now committed, so I'm just attaching the 0002
patch again after having rebased... This is mostly just to keep the
CFbot happy.

David

Re: Tid scan improvements

От

David Rowley

Дата:

26 января 2021 г., 09:55:20

Thanks for having a look at this.

On Tue, 26 Jan 2021 at 15:48, Zhihong Yu <zyu@yugabyte.com> wrote:
> bq. within this range.  Table AMs where scanning ranges of TIDs does not make
> sense or is difficult to implement efficiently may choose to not implement
>
> Is there criterion on how to judge efficiency ?

For example, if the AM had no way to index such a column and the
method needed to scan the entire table to find TIDs in that range. The
planner may as well just pick a SeqScan. If that's the case, then the
table AM may as well not bother implementing that function.

> +       if (tidopexpr->exprtype == TIDEXPR_LOWER_BOUND)
> ...
> +       if (tidopexpr->exprtype == TIDEXPR_UPPER_BOUND)
>
> The if statement for upper bound should be prefixed with 'else', right ?

Yeah, thanks.

> + * TidRecheck -- access method routine to recheck a tuple in EvalPlanQual
> ...
> +TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)
>
> The method name in the comment doesn't match the real method name.

Well noticed.

> + *     ExecTidRangeScan(node)
> ...
> +ExecTidRangeScan(PlanState *pstate)
>
> Parameter names don't match.

hmm. Looking around it seems there's lots of other places that do
this.  I think the the comment is really just indicating that the
function is taking an executor node state as a parameter.

Have a look at: git grep -E "\s\*.*\(node\)$"   that shows the other
places. Some of these have the parameter named "node", and many others
have some other name.

I've made the two changes locally. Since the two issues were mostly
cosmetic, I'll post an updated patch after some bigger changes are
required.

Thanks again for looking at this.

David

Re: Tid scan improvements

От

Andres Freund

Дата:

04 февраля 2021 г., 00:19:19

Hi,

On 2021-01-26 14:22:42 +1300, David Rowley wrote:
> diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
> index 33bffb6815..d1c608b176 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -325,6 +325,26 @@ typedef struct TableAmRoutine
>                                       ScanDirection direction,
>                                       TupleTableSlot *slot);
>  
> +    /*
> +     * Return next tuple from `scan` where TID is within the defined range.
> +     * This behaves like scan_getnextslot but only returns tuples from the
> +     * given range of TIDs.  Ranges are inclusive.  This function is optional
> +     * and may be set to NULL if TID range scans are not supported by the AM.
> +     *
> +     * Implementations of this function must themselves handle ItemPointers
> +     * of any value. i.e, they must handle each of the following:
> +     *
> +     * 1) mintid or maxtid is beyond the end of the table; and
> +     * 2) mintid is above maxtid; and
> +     * 3) item offset for mintid or maxtid is beyond the maximum offset
> +     * allowed by the AM.
> +     */
> +    bool        (*scan_getnextslot_inrange) (TableScanDesc scan,
> +                                             ScanDirection direction,
> +                                             TupleTableSlot *slot,
> +                                             ItemPointer mintid,
> +                                             ItemPointer maxtid);
> +
>  
>      /* ------------------------------------------------------------------------
>       * Parallel table scan related functions.
> @@ -1015,6 +1035,36 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
>      return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
>  }
>  
> +/*
> + * Return next tuple from defined TID range from `scan` and store in slot.
> + */
> +static inline bool
> +table_scan_getnextslot_inrange(TableScanDesc sscan, ScanDirection direction,
> +                               TupleTableSlot *slot, ItemPointer mintid,
> +                               ItemPointer maxtid)
> +{
> +    /*
> +     * The planner should never make a plan which uses this function when the
> +     * table AM has not defined any function for this callback.
> +     */
> +    Assert(sscan->rs_rd->rd_tableam->scan_getnextslot_inrange != NULL);
> +
> +    slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
> +
> +    /*
> +     * We don't expect direct calls to table_scan_getnextslot_inrange with
> +     * valid CheckXidAlive for catalog or regular tables.  See detailed
> +     * comments in xact.c where these variables are declared.
> +     */
> +    if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> +        elog(ERROR, "unexpected table_scan_getnextslot_inrange call during logical decoding");
> +
> +    return sscan->rs_rd->rd_tableam->scan_getnextslot_inrange(sscan,
> +                                                              direction,
> +                                                              slot,
> +                                                              mintid,
> +                                                              maxtid);
> +}


I don't really like that this API relies on mintid/maxtid to stay the
same across multiple scan_getnextslot_inrange() calls. I think we'd at
least need to document that that's required and what needs to be done to
scan a different set of min/max tid (or re-scan the same min/max from
scratch).

Perhaps something like

typedef struct TableScanTidRange TableScanTidRange;

TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);

would work better? That'd allow an AM to have arbitrarily large state
for a tid range scan, would make it clear what the lifetime of the
ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
level, prevent multiple tid range scans from being in progress at the
same times though :(. Perhaps we could add a TableScanTidRange* pointer
to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?

Greetings,

Andres Freund

Re: Tid scan improvements

От

David Rowley

Дата:

04 февраля 2021 г., 00:31:12

Thanks for looking at this.

On Thu, 4 Feb 2021 at 10:19, Andres Freund <andres@anarazel.de> wrote:
> Perhaps something like
>
> typedef struct TableScanTidRange TableScanTidRange;
>
> TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
> bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
> void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);
>
> would work better? That'd allow an AM to have arbitrarily large state
> for a tid range scan, would make it clear what the lifetime of the
> ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
> level, prevent multiple tid range scans from being in progress at the
> same times though :(. Perhaps we could add a TableScanTidRange* pointer
> to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?

Maybe the TableScanTidRange can just have a field to store the
TableScanDesc. That way table_scan_tid_range_nextslot and
table_scan_tid_range_end can just pass the TableScanTidRange pointer.

That way it seems like it would be ok for multiple scans to be going
on concurrently as nobody should be reusing the TableScanDesc.

Does that seem ok?

David

Re: Tid scan improvements

От

David Rowley

Дата:

04 февраля 2021 г., 13:51:39

On Thu, 4 Feb 2021 at 10:31, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Thanks for looking at this.
>
> On Thu, 4 Feb 2021 at 10:19, Andres Freund <andres@anarazel.de> wrote:
> > Perhaps something like
> >
> > typedef struct TableScanTidRange TableScanTidRange;
> >
> > TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
> > bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
> > void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);
> >
> > would work better? That'd allow an AM to have arbitrarily large state
> > for a tid range scan, would make it clear what the lifetime of the
> > ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
> > level, prevent multiple tid range scans from being in progress at the
> > same times though :(. Perhaps we could add a TableScanTidRange* pointer
> > to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?
>
> Maybe the TableScanTidRange can just have a field to store the
> TableScanDesc. That way table_scan_tid_range_nextslot and
> table_scan_tid_range_end can just pass the TableScanTidRange pointer.
>
> That way it seems like it would be ok for multiple scans to be going
> on concurrently as nobody should be reusing the TableScanDesc.

I ended up adding just two new API functions to table AM.

void (*scan_set_tid_range) (TableScanDesc sscan,
   ItemPointer mintid,
   ItemPointer maxtid);

and
bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot);

I added an additional function in tableam.h that does not have a
corresponding API function:

static inline TableScanDesc
table_tid_range_start(Relation rel, Snapshot snapshot,
  ItemPointer mintid,
  ItemPointer maxtid)

This just calls the standard scan_begin then calls scan_set_tid_range
setting the specified mintid and maxtid.

I also added 2 new fields to TableScanDesc:

ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;

I didn't quite see a need to have a new start and end scan API function.

Updated patch attached.

David

Вложения

v13-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch

Re: Tid scan improvements

От

David Rowley

Дата:

16 февраля 2021 г., 12:22:41

On Thu, 4 Feb 2021 at 23:51, David Rowley <dgrowleyml@gmail.com> wrote:
> Updated patch attached.

I made another pass over this patch and did a bit of renaming work
around the heap_* functions and the tableam functions. I think the new
names are a bit more aligned to the existing names.

I don't really see anything else that I'm unhappy with about this
patch, so pending any objections or last-minute reviews, I plan to
push it later this week.

David

Вложения

v14-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch

Re: Tid scan improvements

От

David Fetter

Дата:

16 февраля 2021 г., 19:31:11

On Tue, Feb 16, 2021 at 10:22:41PM +1300, David Rowley wrote:
> On Thu, 4 Feb 2021 at 23:51, David Rowley <dgrowleyml@gmail.com> wrote:
> > Updated patch attached.
> 
> I made another pass over this patch and did a bit of renaming work
> around the heap_* functions and the tableam functions. I think the new
> names are a bit more aligned to the existing names.

Thanks! I'm looking forward to making use of this :)

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Tid scan improvements

От

Andres Freund

Дата:

17 февраля 2021 г., 01:05:22

Hi,

On 2021-02-04 23:51:39 +1300, David Rowley wrote:

> I ended up adding just two new API functions to table AM.
> 
> void (*scan_set_tid_range) (TableScanDesc sscan,
>    ItemPointer mintid,
>    ItemPointer maxtid);
> 
> and
> bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
> ScanDirection direction,
> TupleTableSlot *slot);
> 
> I added an additional function in tableam.h that does not have a
> corresponding API function:
> 
> static inline TableScanDesc
> table_tid_range_start(Relation rel, Snapshot snapshot,
>   ItemPointer mintid,
>   ItemPointer maxtid)
> 
> This just calls the standard scan_begin then calls scan_set_tid_range
> setting the specified mintid and maxtid.

Hm. But that means we can't rescan?


> I also added 2 new fields to TableScanDesc:
> 
> ItemPointerData rs_mintid;
> ItemPointerData rs_maxtid;
> 
> I didn't quite see a need to have a new start and end scan API function.

Yea. I guess they're not that large. Avoiding that was one of the two
reasons to have a separate scan state somewhere. The other that it
seemed like it'd possibly a bit cleaner API wise to deal with rescan.


> +bool
> +heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
> +                          TupleTableSlot *slot)
> +{
> +    HeapScanDesc scan = (HeapScanDesc) sscan;
> +    ItemPointer mintid = &sscan->rs_mintid;
> +    ItemPointer maxtid = &sscan->rs_maxtid;
> +
> +    /* Note: no locking manipulations needed */
> +    for (;;)
> +    {
> +        if (sscan->rs_flags & SO_ALLOW_PAGEMODE)
> +            heapgettup_pagemode(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> +        else
> +            heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> +
> +        if (scan->rs_ctup.t_data == NULL)
> +        {
> +            ExecClearTuple(slot);
> +            return false;
> +        }
> +
> +        /*
> +         * heap_set_tidrange will have used heap_setscanlimits to limit the
> +         * range of pages we scan to only ones that can contain the TID range
> +         * we're scanning for.  Here we must filter out any tuples from these
> +         * pages that are outwith that range.
> +         */
> +        if (ItemPointerCompare(&scan->rs_ctup.t_self, mintid) < 0)
> +        {
> +            ExecClearTuple(slot);
> +
> +            /*
> +             * When scanning backwards, the TIDs will be in descending order.
> +             * Future tuples in this direction will be lower still, so we can
> +             * just return false to indicate there will be no more tuples.
> +             */
> +            if (ScanDirectionIsBackward(direction))
> +                return false;
> +
> +            continue;
> +        }
> +
> +        /*
> +         * Likewise for the final page, we must filter out TIDs greater than
> +         * maxtid.
> +         */
> +        if (ItemPointerCompare(&scan->rs_ctup.t_self, maxtid) > 0)
> +        {
> +            ExecClearTuple(slot);
> +
> +            /*
> +             * When scanning forward, the TIDs will be in ascending order.
> +             * Future tuples in this direction will be higher still, so we can
> +             * just return false to indicate there will be no more tuples.
> +             */
> +            if (ScanDirectionIsForward(direction))
> +                return false;
> +            continue;
> +        }
> +
> +        break;
> +    }
> +
> +    /*
> +     * if we get here it means we have a new current scan tuple, so point to
> +     * the proper return buffer and return the tuple.
> +     */
> +    pgstat_count_heap_getnext(scan->rs_base.rs_rd);

I wonder if there's an argument for counting the misses above via
pgstat_count_heap_fetch()? Probably not, right?


> +#define IsCTIDVar(node)  \
> +    ((node) != NULL && \
> +     IsA((node), Var) && \
> +     ((Var *) (node))->varattno == SelfItemPointerAttributeNumber && \
> +     ((Var *) (node))->varlevelsup == 0)
> +
> +typedef enum
> +{
> +    TIDEXPR_UPPER_BOUND,
> +    TIDEXPR_LOWER_BOUND
> +} TidExprType;
> +
> +/* Upper or lower range bound for scan */
> +typedef struct TidOpExpr
> +{
> +    TidExprType exprtype;        /* type of op; lower or upper */
> +    ExprState  *exprstate;        /* ExprState for a TID-yielding subexpr */
> +    bool        inclusive;        /* whether op is inclusive */
> +} TidOpExpr;
> +
> +/*
> + * For the given 'expr', build and return an appropriate TidOpExpr taking into
> + * account the expr's operator and operand order.
> + */
> +static TidOpExpr *
> +MakeTidOpExpr(OpExpr *expr, TidRangeScanState *tidstate)
> +{
> +    Node       *arg1 = get_leftop((Expr *) expr);
> +    Node       *arg2 = get_rightop((Expr *) expr);
> +    ExprState  *exprstate = NULL;
> +    bool        invert = false;
> +    TidOpExpr  *tidopexpr;
> +
> +    if (IsCTIDVar(arg1))
> +        exprstate = ExecInitExpr((Expr *) arg2, &tidstate->ss.ps);
> +    else if (IsCTIDVar(arg2))
> +    {
> +        exprstate = ExecInitExpr((Expr *) arg1, &tidstate->ss.ps);
> +        invert = true;
> +    }
> +    else
> +        elog(ERROR, "could not identify CTID variable");
> +
> +    tidopexpr = (TidOpExpr *) palloc(sizeof(TidOpExpr));
> +    tidopexpr->inclusive = false;    /* for now */
> +
> +    switch (expr->opno)
> +    {
> +        case TIDLessEqOperator:
> +            tidopexpr->inclusive = true;
> +            /* fall through */
> +        case TIDLessOperator:
> +            tidopexpr->exprtype = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> +            break;
> +        case TIDGreaterEqOperator:
> +            tidopexpr->inclusive = true;
> +            /* fall through */
> +        case TIDGreaterOperator:
> +            tidopexpr->exprtype = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> +            break;
> +        default:
> +            elog(ERROR, "could not identify CTID operator");
> +    }
> +
> +    tidopexpr->exprstate = exprstate;
> +
> +    return tidopexpr;
> +}
> +
> +/*
> + * Extract the qual subexpressions that yield TIDs to search for,
> + * and compile them into ExprStates if they're ordinary expressions.
> + */
> +static void
> +TidExprListCreate(TidRangeScanState *tidrangestate)
> +{
> +    TidRangeScan *node = (TidRangeScan *) tidrangestate->ss.ps.plan;
> +    List       *tidexprs = NIL;
> +    ListCell   *l;
> +
> +    foreach(l, node->tidrangequals)
> +    {
> +        OpExpr       *opexpr = lfirst(l);
> +        TidOpExpr  *tidopexpr;
> +
> +        if (!IsA(opexpr, OpExpr))
> +            elog(ERROR, "could not identify CTID expression");
> +
> +        tidopexpr = MakeTidOpExpr(opexpr, tidrangestate);
> +        tidexprs = lappend(tidexprs, tidopexpr);
> +    }
> +
> +    tidrangestate->trss_tidexprs = tidexprs;
> +}

Architecturally it feels like this is something that really belongs more
into plan time?


> +/*
> + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> + * TableScanDesc created by table_beginscan_tidrange.
> + */
> +static inline void
> +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> +                   ItemPointer maxtid)
> +{
> +    /* Ensure table_beginscan_tidrange() was used. */
> +    Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> +
> +    sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> +}

How does this interact with rescans?

Greetings,

Andres Freund

Re: Tid scan improvements

От

David Rowley

Дата:

17 февраля 2021 г., 23:45:19

Thanks for having a look at this.

On Wed, 17 Feb 2021 at 11:05, Andres Freund <andres@anarazel.de> wrote:
>
> On 2021-02-04 23:51:39 +1300, David Rowley wrote:
> > and
> > bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
> > ScanDirection direction,
> > TupleTableSlot *slot);
> >
> > I added an additional function in tableam.h that does not have a
> > corresponding API function:
> >
> > static inline TableScanDesc
> > table_tid_range_start(Relation rel, Snapshot snapshot,
> >   ItemPointer mintid,
> >   ItemPointer maxtid)
> >
> > This just calls the standard scan_begin then calls scan_set_tid_range
> > setting the specified mintid and maxtid.
>
> Hm. But that means we can't rescan?

It might not be perfect, but to rescan, we must call table_rescan()
then table_set_tidrange() before calling the
table_scan_getnextslot_tidrange() function.

> > +bool
> > +heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
> > +                                               TupleTableSlot *slot)
> > +{
> > +     HeapScanDesc scan = (HeapScanDesc) sscan;
> > +     ItemPointer mintid = &sscan->rs_mintid;
> > +     ItemPointer maxtid = &sscan->rs_maxtid;
> > +
> > +     /* Note: no locking manipulations needed */
> > +     for (;;)
> > +     {
> > +             if (sscan->rs_flags & SO_ALLOW_PAGEMODE)
> > +                     heapgettup_pagemode(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> > +             else
> > +                     heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> > +
> > +             if (scan->rs_ctup.t_data == NULL)
> > +             {
> > +                     ExecClearTuple(slot);
> > +                     return false;
> > +             }
> > +
> > +             /*
> > +              * heap_set_tidrange will have used heap_setscanlimits to limit the
> > +              * range of pages we scan to only ones that can contain the TID range
> > +              * we're scanning for.  Here we must filter out any tuples from these
> > +              * pages that are outwith that range.
> > +              */
> > +             if (ItemPointerCompare(&scan->rs_ctup.t_self, mintid) < 0)
> > +             {
> > +                     ExecClearTuple(slot);
> > +
> > +                     /*
> > +                      * When scanning backwards, the TIDs will be in descending order.
> > +                      * Future tuples in this direction will be lower still, so we can
> > +                      * just return false to indicate there will be no more tuples.
> > +                      */
> > +                     if (ScanDirectionIsBackward(direction))
> > +                             return false;
> > +
> > +                     continue;
> > +             }
> > +
> > +             /*
> > +              * Likewise for the final page, we must filter out TIDs greater than
> > +              * maxtid.
> > +              */
> > +             if (ItemPointerCompare(&scan->rs_ctup.t_self, maxtid) > 0)
> > +             {
> > +                     ExecClearTuple(slot);
> > +
> > +                     /*
> > +                      * When scanning forward, the TIDs will be in ascending order.
> > +                      * Future tuples in this direction will be higher still, so we can
> > +                      * just return false to indicate there will be no more tuples.
> > +                      */
> > +                     if (ScanDirectionIsForward(direction))
> > +                             return false;
> > +                     continue;
> > +             }
> > +
> > +             break;
> > +     }
> > +
> > +     /*
> > +      * if we get here it means we have a new current scan tuple, so point to
> > +      * the proper return buffer and return the tuple.
> > +      */
> > +     pgstat_count_heap_getnext(scan->rs_base.rs_rd);
>
> I wonder if there's an argument for counting the misses above via
> pgstat_count_heap_fetch()? Probably not, right?

I'm a bit undecided about that. In theory, we're doing the heap
fetches of tuples on the target page which are outside of the range so
we should maybe count them. On the other hand, it might be a little
confusing for very observant users if they see the fetches going up
for the tuples we skip over in TID Range scans.

> > +#define IsCTIDVar(node)  \
> > +     ((node) != NULL && \
> > +      IsA((node), Var) && \
> > +      ((Var *) (node))->varattno == SelfItemPointerAttributeNumber && \
> > +      ((Var *) (node))->varlevelsup == 0)
> > +
> > +typedef enum
> > +{
> > +     TIDEXPR_UPPER_BOUND,
> > +     TIDEXPR_LOWER_BOUND
> > +} TidExprType;
> > +
> > +/* Upper or lower range bound for scan */
> > +typedef struct TidOpExpr
> > +{
> > +     TidExprType exprtype;           /* type of op; lower or upper */
> > +     ExprState  *exprstate;          /* ExprState for a TID-yielding subexpr */
> > +     bool            inclusive;              /* whether op is inclusive */
> > +} TidOpExpr;
> > +
> > +/*
> > + * For the given 'expr', build and return an appropriate TidOpExpr taking into
> > + * account the expr's operator and operand order.
> > + */
> > +static TidOpExpr *
> > +MakeTidOpExpr(OpExpr *expr, TidRangeScanState *tidstate)
> > +{
> > +     Node       *arg1 = get_leftop((Expr *) expr);
> > +     Node       *arg2 = get_rightop((Expr *) expr);
> > +     ExprState  *exprstate = NULL;
> > +     bool            invert = false;
> > +     TidOpExpr  *tidopexpr;
> > +
> > +     if (IsCTIDVar(arg1))
> > +             exprstate = ExecInitExpr((Expr *) arg2, &tidstate->ss.ps);
> > +     else if (IsCTIDVar(arg2))
> > +     {
> > +             exprstate = ExecInitExpr((Expr *) arg1, &tidstate->ss.ps);
> > +             invert = true;
> > +     }
> > +     else
> > +             elog(ERROR, "could not identify CTID variable");
> > +
> > +     tidopexpr = (TidOpExpr *) palloc(sizeof(TidOpExpr));
> > +     tidopexpr->inclusive = false;   /* for now */
> > +
> > +     switch (expr->opno)
> > +     {
> > +             case TIDLessEqOperator:
> > +                     tidopexpr->inclusive = true;
> > +                     /* fall through */
> > +             case TIDLessOperator:
> > +                     tidopexpr->exprtype = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> > +                     break;
> > +             case TIDGreaterEqOperator:
> > +                     tidopexpr->inclusive = true;
> > +                     /* fall through */
> > +             case TIDGreaterOperator:
> > +                     tidopexpr->exprtype = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> > +                     break;
> > +             default:
> > +                     elog(ERROR, "could not identify CTID operator");
> > +     }
> > +
> > +     tidopexpr->exprstate = exprstate;
> > +
> > +     return tidopexpr;
> > +}
> > +
> > +/*
> > + * Extract the qual subexpressions that yield TIDs to search for,
> > + * and compile them into ExprStates if they're ordinary expressions.
> > + */
> > +static void
> > +TidExprListCreate(TidRangeScanState *tidrangestate)
> > +{
> > +     TidRangeScan *node = (TidRangeScan *) tidrangestate->ss.ps.plan;
> > +     List       *tidexprs = NIL;
> > +     ListCell   *l;
> > +
> > +     foreach(l, node->tidrangequals)
> > +     {
> > +             OpExpr     *opexpr = lfirst(l);
> > +             TidOpExpr  *tidopexpr;
> > +
> > +             if (!IsA(opexpr, OpExpr))
> > +                     elog(ERROR, "could not identify CTID expression");
> > +
> > +             tidopexpr = MakeTidOpExpr(opexpr, tidrangestate);
> > +             tidexprs = lappend(tidexprs, tidopexpr);
> > +     }
> > +
> > +     tidrangestate->trss_tidexprs = tidexprs;
> > +}
>
> Architecturally it feels like this is something that really belongs more
> into plan time?

Possibly. It would mean TidOpExpr would have to become a Node type.
TID Range scan is really just following the lead set by TID Scan here.

I'm not sure if it's worth the trouble making these Node types for the
small gains there'd be in the performance of having the planner make
them once for prepared queries rather than them having to be built
during each execution.

Do you think it is?

> > +/*
> > + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> > + * TableScanDesc created by table_beginscan_tidrange.
> > + */
> > +static inline void
> > +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> > +                                ItemPointer maxtid)
> > +{
> > +     /* Ensure table_beginscan_tidrange() was used. */
> > +     Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> > +
> > +     sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> > +}
>
> How does this interact with rescans?

We must call table_rescan() before calling table_set_tidrange() again.
That perhaps could be documented better. I'm just unsure if that
should be documented in tableam.h or if it's a restriction that only
needs to exist in heapam.c

David

Re: Tid scan improvements

От

David Rowley

Дата:

19 февраля 2021 г., 10:37:07

On Thu, 18 Feb 2021 at 09:45, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 17 Feb 2021 at 11:05, Andres Freund <andres@anarazel.de> wrote:
> > Architecturally it feels like this is something that really belongs more
> > into plan time?
>
> Possibly. It would mean TidOpExpr would have to become a Node type.
> TID Range scan is really just following the lead set by TID Scan here.
>
> I'm not sure if it's worth the trouble making these Node types for the
> small gains there'd be in the performance of having the planner make
> them once for prepared queries rather than them having to be built
> during each execution.

I changed the code around and added a new Node type to the planner and
made it create the TidRangeExpr during planning.

However, I'm pretty much set on this being pretty horrible and I ended
up ripping it back out again.  The reason is that there's quite a bit
of extra boilerplate code that goes with the new node type. e.g it
must be handled in setrefs.c.  EXPLAIN also needs to know about the
new Node. That either means teaching the deparse code about
TidRangeExprs or having the Plan node carry along the OpExprs just so
we can make EXPLAIN work.   Translating between the two might be
possible but it just seemed too much code and I started feeling pretty
bad about the whole idea.

> > > +/*
> > > + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> > > + * TableScanDesc created by table_beginscan_tidrange.
> > > + */
> > > +static inline void
> > > +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> > > +                                ItemPointer maxtid)
> > > +{
> > > +     /* Ensure table_beginscan_tidrange() was used. */
> > > +     Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> > > +
> > > +     sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> > > +}
> >
> > How does this interact with rescans?
>
> We must call table_rescan() before calling table_set_tidrange() again.
> That perhaps could be documented better. I'm just unsure if that
> should be documented in tableam.h or if it's a restriction that only
> needs to exist in heapam.c

I've changed things around so that we no longer explicitly call
table_rescan() in nodeTidrangescan.c. Instead table_set_tidrange()
does a rescan call. I also adjusted the documentation to mention that
changing the tid range starts the scan again.  This does mean we'll do
a ->scan_rescan() the first time we do table_set_tidrange(). I'm not
all that sure that matters.

v15 attached.

David

Вложения

v15-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch

Re: Tid scan improvements

От

Peter Eisentraut

Дата:

07 июня 2021 г., 13:11:53

This patch didn't add _outTidRangePath() even though we have outNode() 
coverage for most/all path nodes.  Was this just forgotten?  See 
attached patch.

Вложения

0001-Add-_outTidRangePath.patch

Re: Tid scan improvements

От

Edmund Horner

Дата:

07 июня 2021 г., 14:46:26

On Mon, 7 Jun 2021 at 22:11, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

This patch didn't add _outTidRangePath() even though we have outNode()
coverage for most/all path nodes. Was this just forgotten? See
attached patch.

Yes, it looks like an omission. Thanks for spotting it. Patch looks good to me.

Edmund

Re: Tid scan improvements

От

David Rowley

Дата:

07 июня 2021 г., 14:50:04

On Mon, 7 Jun 2021 at 23:46, Edmund Horner <ejrh00@gmail.com> wrote:
>
> On Mon, 7 Jun 2021 at 22:11, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>>
>> This patch didn't add _outTidRangePath() even though we have outNode()
>> coverage for most/all path nodes.  Was this just forgotten?  See
>> attached patch.
>
>
> Yes, it looks like an omission.  Thanks for spotting it.  Patch looks good to me.

Yeah. That was forgotten.  Patch also looks fine to me.  Do you want
to push it, Peter?

David

Re: Tid scan improvements

От

Peter Eisentraut

Дата:

07 июня 2021 г., 22:38:20

On 07.06.21 13:50, David Rowley wrote:
> On Mon, 7 Jun 2021 at 23:46, Edmund Horner <ejrh00@gmail.com> wrote:
>>
>> On Mon, 7 Jun 2021 at 22:11, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>>>
>>> This patch didn't add _outTidRangePath() even though we have outNode()
>>> coverage for most/all path nodes.  Was this just forgotten?  See
>>> attached patch.
>>
>>
>> Yes, it looks like an omission.  Thanks for spotting it.  Patch looks good to me.
> 
> Yeah. That was forgotten.  Patch also looks fine to me.  Do you want
> to push it, Peter?

done

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Tid scan improvements

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения