Обсуждение: Tid scan improvements

Поиск
Список
Период
Сортировка

Tid scan improvements

От
Edmund Horner
Дата:
Hello,

To scratch an itch, I have been working on teaching TidScan how to do
range queries, i.e. those using >=, <, BETWEEN, etc.  This means we
can write, for instance,

    SELECT * FROM t WHERE ctid >= '(1000,0)' AND ctid < '(2000,0)';

instead of resorting to the old trick:

    SELECT * FROM t WHERE ctid = ANY (ARRAY(SELECT format('(%s,%s)', i, j)::tid
    FROM generate_series(1000,1999) AS gs(i), generate_series(1,200)
AS gs2(j)));

where "200" is some guess at how many tuples can fit on a page for that table.

There's some previous discussion about this at

https://www.postgresql.org/message-id/flat/CAHyXU0zJhg_5RtxKnNbAK%3D4ZzQEFUFi%2B52RjpLrxtkRTD6CDFw%40mail.gmail.com#3ba2c3a6be217f40130655a3250d80a4
.

Since range scan execution is rather different from the existing
TidScan execution, I ended up making a new plan type, TidRangeScan.
There is still only one TidPath, but it has an additional member that
describes which method to use.

As part of the work I also taught TidScan that its results are ordered
by ctid, i.e. to set a pathkey on a TidPath.  The benefit of this is
that queries such as

    SELECT MAX(ctid) FROM t;
    SELECT * FROM t WHERE ctid IN (...) ORDER BY ctid;

are now planned a bit more efficiently.  Execution was already
returning tuples in ascending ctid order; I just had to add support
for descending order.

Attached are a couple of patches:
  - 01_tid_scan_ordering.patch
  - 02_tid_range_scan.patch, to be applied on top of 01.

Can I add this to the next CommitFest?

Obviously the whole thing needs thorough review, and I expect there to
be numerous problems.  (I had to make this prototype to demonstrate to
myself that it wasn't completely beyond me.  I know from experience
how easy it is to enthusiastically volunteer something for an open
source project, discover that one does not have the time or skill
required, and be too embarrassed to show one's face again!)

As well as actual correctness, some aspects that I am particularly
unsure about include:

  - Is it messy to use TidPath for both types of scan?
  - What is the planning cost for plans that don't end up being a
TidScan or TidRangeScan?
  - Have I put the various helper functions in the right files?
  - Is there a less brittle way to create tables of a specific number
of blocks/tuples in the regression tests?
  - Have a got the ScanDirection right during execution?
  - Are my changes to heapam ok?

Cheers,
Edmund

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On 12 August 2018 at 14:29, Edmund Horner <ejrh00@gmail.com> wrote:
> To scratch an itch, I have been working on teaching TidScan how to do
> range queries, i.e. those using >=, <, BETWEEN, etc.  This means we
> can write, for instance,
>
>     SELECT * FROM t WHERE ctid >= '(1000,0)' AND ctid < '(2000,0)';

I think this will be useful to UPDATE records at the end of a bloated
table to move them into space that's been freed up by vacuum to allow
the table to be trimmed back to size again.

> Since range scan execution is rather different from the existing
> TidScan execution, I ended up making a new plan type, TidRangeScan.
> There is still only one TidPath, but it has an additional member that
> describes which method to use.

I always thought that this would be implemented by overloading
TidScan.  I thought that TidListEval() could be modified to remove
duplicates accounting for range scans. For example:

SELECT * FROM t WHERE ctid BETWEEN '(0,1)' AND (0,10') OR ctid
IN('(0,5)','(0,30)');

would first sort all the tids along with their operator and then make
a pass over the sorted array to remove any equality ctids that are
redundant because they're covered in a range.

> As part of the work I also taught TidScan that its results are ordered
> by ctid, i.e. to set a pathkey on a TidPath.  The benefit of this is
> that queries such as
>
>     SELECT MAX(ctid) FROM t;
>     SELECT * FROM t WHERE ctid IN (...) ORDER BY ctid;

I think that can be done as I see you're passing allow_sync as false
in heap_beginscan_strat(), so the scan will start at the beginning of
the heap.

> Attached are a couple of patches:
>   - 01_tid_scan_ordering.patch
>   - 02_tid_range_scan.patch, to be applied on top of 01.
>
> Can I add this to the next CommitFest?

Please do.

> As well as actual correctness, some aspects that I am particularly
> unsure about include:
>
>   - Is it messy to use TidPath for both types of scan?

I wonder if there is a good reason to have a separate node type at
all? I've not looked, but if you've managed to overload the TidPath
struct without it getting out of control, then perhaps the same can be
done with the node type.

>   - What is the planning cost for plans that don't end up being a
> TidScan or TidRangeScan?

I suppose that wouldn't matter if there was just 1 path for a single node type.

>   - Is there a less brittle way to create tables of a specific number
> of blocks/tuples in the regression tests?

Perhaps you could just populate a table with some number of records
then DELETE the ones above ctid (x,100) on each page, where 100 is
whatever you can be certain will fit on a page on any platform. I'm
not quite sure if our regress test would pass with a very small block
size anyway, but probably worth verifying that before you write the
first test that will break it.

I'll try to look in a bit more detail during the commitfest.

It's perhaps a minor detail at this stage, but generally, we don't
have code lines over 80 chars in length. There are some exceptions,
e.g not breaking error message strings so that they're easily
greppable.  src/tools/pgindent has a tool that you can run to fix the
whitespace so it's in line with project standard.

Thanks for working on this. It will great to see improvements made in this area.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Robert Haas
Дата:
On Sun, Aug 12, 2018 at 8:07 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Thanks for working on this. It will great to see improvements made in this area.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Tid scan improvements

От
Edmund Horner
Дата:
On 12 August 2018 at 20:07, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Since range scan execution is rather different from the existing
>> TidScan execution, I ended up making a new plan type, TidRangeScan.
>> There is still only one TidPath, but it has an additional member that
>> describes which method to use.
>
> I always thought that this would be implemented by overloading
> TidScan.  I thought that TidListEval() could be modified to remove
> duplicates accounting for range scans. For example:
>
> SELECT * FROM t WHERE ctid BETWEEN '(0,1)' AND (0,10') OR ctid
> IN('(0,5)','(0,30)');
>
> would first sort all the tids along with their operator and then make
> a pass over the sorted array to remove any equality ctids that are
> redundant because they're covered in a range.

Initially, I figured that 99% of the time, the user either wants to
filter by a specific set of ctids (such as those returned by a
subquery), or wants to process a table in blocks.  Picking up an
OR-set of ctid-conditions and determining which parts should be picked
up row-wise (as in the existing code) versus which parts are blocks
that should be scanned -- and ensuring that any overlaps were removed
-- seemed more complicated than it was worth.

Having thought about it, I think what you propose might be worth it;
at least it limits us to a single TidScan plan to maintain.

The existing code:
  - Looks for a qual that's an OR-list of (ctid = ?) or (ctid IN (?))
  - Costs it by assuming each matching tuple is a separate page.
  - When beginning the scan, evaluates all the ?s and builds an array
of tids to fetch.
  - Sorts and remove duplicates.
  - Iterates over the array, fetching tuples.

So we'd extend that to:
  - Include in the OR-list "range" subquals of the form (ctid > ? AND
ctid < ?) (either side could be optional, and we have to deal with >=
and <= and having ctid on the rhs, etc.).
  - Cost the range subquals by assuming they don't overlap, and
estimating how many blocks and tuples they span.
  - When beginning the scan, evaluate all the ?s and build an array of
"tid ranges" to fetch.  A tid range is a struct with a starting tid,
and an ending tid, and might just be a single tid item.
  - Sort and remove duplicates.
  - Iterate over the array, using a single fetch for single-item tid
ranges, and starting/ending a heap scan for multi-item tid ranges.

I think I'll try implementing this.

>> As part of the work I also taught TidScan that its results are ordered
>> by ctid, i.e. to set a pathkey on a TidPath.  The benefit of this is
>> that queries such as
>>
>>     SELECT MAX(ctid) FROM t;
>>     SELECT * FROM t WHERE ctid IN (...) ORDER BY ctid;
>
> I think that can be done as I see you're passing allow_sync as false
> in heap_beginscan_strat(), so the scan will start at the beginning of
> the heap.

I found that heap scan caters to parallel scans, synchronised scans,
and block range indexing; but it didn't quite work for my case of
specifying a subset of a table and scanning backward or forward over
it.  Hence my changes.  I'm not overly familiar with the heap scan
code though.

>>   - Is there a less brittle way to create tables of a specific number
>> of blocks/tuples in the regression tests?
>
> Perhaps you could just populate a table with some number of records
> then DELETE the ones above ctid (x,100) on each page, where 100 is
> whatever you can be certain will fit on a page on any platform. I'm
> not quite sure if our regress test would pass with a very small block
> size anyway, but probably worth verifying that before you write the
> first test that will break it.

I don't think I've tested with extreme block sizes.

> I'll try to look in a bit more detail during the commitfest.
>
> It's perhaps a minor detail at this stage, but generally, we don't
> have code lines over 80 chars in length. There are some exceptions,
> e.g not breaking error message strings so that they're easily
> greppable.  src/tools/pgindent has a tool that you can run to fix the
> whitespace so it's in line with project standard.

I'll try to get pgindent running before my next patch.

Thanks for the comments!


Re: Tid scan improvements

От
David Rowley
Дата:
On 15 August 2018 at 11:11, Edmund Horner <ejrh00@gmail.com> wrote:
So we'd extend that to:
  - Include in the OR-list "range" subquals of the form (ctid > ? AND
ctid < ?) (either side could be optional, and we have to deal with >=
and <= and having ctid on the rhs, etc.).
  - Cost the range subquals by assuming they don't overlap, and
estimating how many blocks and tuples they span.
  - When beginning the scan, evaluate all the ?s and build an array of
"tid ranges" to fetch.  A tid range is a struct with a starting tid,
and an ending tid, and might just be a single tid item.
  - Sort and remove duplicates.
  - Iterate over the array, using a single fetch for single-item tid
ranges, and starting/ending a heap scan for multi-item tid ranges.

I think I'll try implementing this.

I've set this patch as waiting on author in the commitfest app. 


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Tid scan improvements

От
Edmund Horner
Дата:
On Mon, 17 Sep 2018 at 23:21, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 15 August 2018 at 11:11, Edmund Horner <ejrh00@gmail.com> wrote:
>> So we'd extend that to:
>>   - Include in the OR-list "range" subquals of the form (ctid > ? AND
>> ctid < ?) (either side could be optional, and we have to deal with >=
>> and <= and having ctid on the rhs, etc.).
>>   - Cost the range subquals by assuming they don't overlap, and
>> estimating how many blocks and tuples they span.
>>   - When beginning the scan, evaluate all the ?s and build an array of
>> "tid ranges" to fetch.  A tid range is a struct with a starting tid,
>> and an ending tid, and might just be a single tid item.
>>   - Sort and remove duplicates.
>>   - Iterate over the array, using a single fetch for single-item tid
>> ranges, and starting/ending a heap scan for multi-item tid ranges.
>>
>> I think I'll try implementing this.
>
>
> I've set this patch as waiting on author in the commitfest app.

Thanks David.

Between work I have found time here and there to work on it, but
making a path type that handles all the above turns out to be
surprisingly harder than my tid range scan.

In the earlier discussion from 2012, Tom Lane said:

> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > On Wed, Jun 13, 2012 at 03:21:17PM -0500, Merlin Moncure wrote:
> >> IMNSHO, it's a no-brainer for the todo (but I think it's more
> >> complicated than adding some comparisons -- which are working now):
>
> > I see. Seems we have to add index smarts to those comparisons. That
> > might be complicated.
>
> Uh, the whole point of a TID scan is to *not* need an index.
>
> What would be needed is for tidpath.c to let through more kinds of TID
> comparison quals than it does now, and then for nodeTidscan.c to know
> what to do with them. The latter logic might well look something like
> btree indexscan qual preparation, but it wouldn't be the same code.

I have been generally following this approach (handling more kinds of
TID comparisons), and have found myself doing things like pairing up >
with <, estimating how much of a table is covered by some set of >, <,
or "> AND <" quals, etc.  Things that I'm sure are handled in an
advanced way by index paths; unfortunately I didn't see any easily
reusable code in the index path code.  So I've ended up writing
special-case code for TID scans.  Hopefully it will be worth it.

Edmund


Re: Tid scan improvements

От
David Rowley
Дата:
On 19 September 2018 at 18:04, Edmund Horner <ejrh00@gmail.com> wrote:
> I have been generally following this approach (handling more kinds of
> TID comparisons), and have found myself doing things like pairing up >
> with <, estimating how much of a table is covered by some set of >, <,
> or "> AND <" quals, etc.  Things that I'm sure are handled in an
> advanced way by index paths; unfortunately I didn't see any easily
> reusable code in the index path code.  So I've ended up writing
> special-case code for TID scans.  Hopefully it will be worth it.

I don't think it would need to be as complex as the index matching
code. Just looping over the quals and gathering up all compatible ctid
quals should be fine.  I imagine the complex handling of sorting the
quals by ctid and removal of redundant quals that are covered by some
range would be done in the executor.

Probably the costing will get more complex. At the moment it seems we
add a random_page_cost per ctid, but you'd probably need to make that
better and loop over the quals in each implicitly ANDed set and find
the max ctid for the > / >= quals and the the min < / <= ctid, then
get the page number from each and assume max - min seq_page_cost, then
add random_page_cost for any remaining equality quals.  The costs from
other OR branches can likely just be added on.  This would double
count if someone did WHERE ctid BETWEEN '(0,0') AND '(100,300)' OR
ctid BETWEEN '(0,0') AND '(100,300)';  The current code seems to
double count now for duplicate ctids anyway. It even double counts if
the ctid being compared to is on the same page as another ctid, so I
don't think that would be unacceptable.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Wed, 19 Sep 2018 at 18:56, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On 19 September 2018 at 18:04, Edmund Horner <ejrh00@gmail.com> wrote:
> > I have been generally following this approach (handling more kinds of
> > TID comparisons), and have found myself doing things like pairing up >
> > with <, estimating how much of a table is covered by some set of >, <,
> > or "> AND <" quals, etc.  Things that I'm sure are handled in an
> > advanced way by index paths; unfortunately I didn't see any easily
> > reusable code in the index path code.  So I've ended up writing
> > special-case code for TID scans.  Hopefully it will be worth it.
>
> I don't think it would need to be as complex as the index matching
> code. Just looping over the quals and gathering up all compatible ctid
> quals should be fine.  I imagine the complex handling of sorting the
> quals by ctid and removal of redundant quals that are covered by some
> range would be done in the executor.

I've got the path creation and execution pretty much working, though
with some inefficiencies:
  - Each individual TID is treated as a range of size 1 (but CURRENT
OF is handled as a single fetch)
  - Range scans have to scan whole blocks, and skip over the tuples
that are out of range.
But it's enough to get the tests passing.

Right now I'm looking at costing:

> Probably the costing will get more complex. At the moment it seems we
> add a random_page_cost per ctid, but you'd probably need to make that
> better and loop over the quals in each implicitly ANDed set and find
> the max ctid for the > / >= quals and the the min < / <= ctid, then
> get the page number from each and assume max - min seq_page_cost, then
> add random_page_cost for any remaining equality quals.  The costs from
> other OR branches can likely just be added on.  This would double
> count if someone did WHERE ctid BETWEEN '(0,0') AND '(100,300)' OR
> ctid BETWEEN '(0,0') AND '(100,300)';  The current code seems to
> double count now for duplicate ctids anyway. It even double counts if
> the ctid being compared to is on the same page as another ctid, so I
> don't think that would be unacceptable.

There are two stages of costing:
1. Estimating the number of rows that the relation will return.  This
happens before path generation.
2. Estimating the cost of the path.

In the existing code, (1) goes through the normal clausesel.c
machinery, eventually getting to the restriction function defined in
pg_operator.  For range quals, e.g. >, it looks for a stats entry for
the variable, but since it's a system variable with no stats, it
returns DEFAULT_INEQ_SEL (in function scalarineqsel).  For equality
quals, it does have some special-case code (in function
get_variable_numdistinct) to use stadistinct=-1 for the CTID variable,
resulting in a selectivity estimate of 1/ntuples.

(2), on the other hand, has special-case code in costsize.c (function
cost_tidscan), which estimates each TID as being a separate tuple
fetch from a different page.  (The existing code only has to support
=, IN, and CURRENT OF as quals for a TID path.)

In my work, I have been adding support for range quals to (2), which
includes estimating the selectivity of expressions like (CTID > a AND
CTID < b).  I got tired of handling all the various ways of ordering
the quals, so I thought I would try re-using the clausesel.c
machinery.  In selfuncs.c, I've added special case code for
scalarineqsel and nulltestsel to handle CTID variables.  (This also
improves the row count estimates.)

I'm not 100% sure what the costs of each range should be.  I think the
first block should incur random_page_cost, with subsequent blocks
being seq_page_cost.  Simple "CTID = ?" quals are still estimated as 1
tuple + 1 random block.

Have a look at the attached WIP if you like and tell me if you think
it's going in the right direction.  I'm sorry for the size of the
patch; I couldn't find a nice way to cut it up.  I did run pgindent
over it though. :)

Cheers,
Edmund

Вложения

Re: Tid scan improvements

От
Edmund Horner
Дата:
On Fri, 28 Sep 2018 at 17:02, Edmund Horner <ejrh00@gmail.com> wrote:
> I did run pgindent over it though. :)

But I didn't check if it still applied to master.  Sigh.  Here's one that does.

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On 28 September 2018 at 18:13, Edmund Horner <ejrh00@gmail.com> wrote:
> On Fri, 28 Sep 2018 at 17:02, Edmund Horner <ejrh00@gmail.com> wrote:
>> I did run pgindent over it though. :)
>
> But I didn't check if it still applied to master.  Sigh.  Here's one that does.

I know commit fest is over, but I made a pass of this to hopefully
provide a bit of guidance so that it's closer for the November 'fest.

I've only done light testing on the patch and it does seem to work,
but there are a few things that I think should be changed. Most
importantly #11 below I think needs to be done. That might overwrite
some of the items that come before it in the list as you likely will
have to pull some of code which I mention out out due to changing #11.
I've kept them around anyway just in case some of it remains.

1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()

int nrandompages;
int nseqpages;

2. Should multiply by nseqpages, not add.

run_cost += spc_random_page_cost * nrandompages + spc_seq_page_cost + nseqpages;

3. Should be double:

BlockNumber pages = selectivity * baserel->pages;

4. Comment needs updated to mention what the new code does in
heapgettup() and heapgettup_pagemode()

+
  /* start from last page of the scan */
- if (scan->rs_startblock > 0)
- page = scan->rs_startblock - 1;
+ if (scan->rs_numblocks == InvalidBlockNumber)
+ {
+ if (scan->rs_startblock > 0)
+ page = scan->rs_startblock - 1;
+ else
+ page = scan->rs_nblocks - 1;
+ }
  else
- page = scan->rs_nblocks - 1;
+ page = scan->rs_startblock + scan->rs_numblocks - 1;
+

5. Variables should be called "inclusive". We use "strict" to indicate
an operator comparison cannot match NULL values.

+ bool strict; /* Indicates < rather than <=, or > rather */
+ bool strict2; /* than >= */

Don't break the comment like that. If you need more space don't end
the comment and use a new line and tab the next line out to match the
* of the first line.

6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
the type instead of repeating code

7. It's not very obvious why the following Assert() can't fail.

+ bool invert;
+ bool invert2;
+
+ Assert(list_length(((BoolExpr *) expr)->args) == 2);

I had to hunt around quite a bit to see that
TidQualFromBaseRestrictinfo could only ever make the list have 2
elements, and we'd not form a BoolExpr with just 1. (but see #11)

8. Many instances of the word "strict" are used to mean "inclusive".
Can you please change all of them.

9. Confusing comment:

+ * If the expression was non-strict (<=) and the offset is 0, then just
+ * pretend it was strict, because offset 0 doesn't exist and we may as
+ * well exclude that block.

Shouldn't this be, "If the operator is non-inclusive, then since TID
offsets are 1-based, for simplicity, we can just class the expression
as inclusive.", or something along those lines.

10. Comment talks about LHS, but the first OpExpr in a list of two
OpExprs has nothing to do with left hand side.  You could use LHS if
you were talking about the first arg in an OpExpr, but this is not the
case here.

/* If the LHS is not the lower bound, swap them. */

You could probably just ensure that the >=, > ops is the first in the
list inside TidQualFromBaseRestrictinfo(), but you'd need to clearly
comment that this is the case in both locations. Perhaps use lcons()
for the lower end and lappend() for the upper end, but see #11.

11. I think the qual matching code needs an overhaul.  Really you
should attempt to find the smallest and largest ctid for your
implicitly ANDed ranges.  This would require you getting rid of the
BETWEEN type claused you're trying to build in
TidQualFromBaseRestrictinfo
and instead just include all quals, don't ignore other quals when
you've already found your complete range bounds.

The problem with doing it the way that you're doing it now is in cases like:

create table t1(a int);
insert into t1 select generate_Series(1,10000000);
create index on t1 (a);
select ctid,a from t1 order by a desc limit 1; -- find the max ctid.
    ctid     |    a
-------------+----------
 (44247,178) | 10000000
(1 row)

set max_parallel_workers_per_gather=0;
explain analyze select ctid,* from t1 where ctid > '(0,0)' and ctid <=
'(44247,178)' and ctid <= '(0,1)';
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.01..169248.78 rows=1 width=10) (actual
time=0.042..2123.432 rows=1 loops=1)
   TID Cond: ((ctid > '(0,0)'::tid) AND (ctid <= '(44247,178)'::tid))
   Filter: (ctid <= '(0,1)'::tid)
   Rows Removed by Filter: 9999999
 Planning Time: 4.049 ms
 Execution Time: 2123.464 ms
(6 rows)

Due to how you've coded TidQualFromBaseRestrictinfo(), the ctid <=
'(0,1)' qual does not make it into the range. It's left as a filter in
the Tid Scan.

I think I'm going to stop here as changing this going to cause quite a
bit of churn.

but one more...

12. I think the changes to selfuncs.c to get the selectivity estimate
is a fairly credible idea, but I think it also needs to account for
offsets. You should be able to work out the average number of items
per page with rel->tuples / rel->pages and factor that in to get a
better estimate for cases like:

postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.00..5.00 rows=1 width=10) (actual
time=0.025..0.065 rows=200 loops=1)
   TID Cond: (ctid <= '(0,200)'::tid)
 Planning Time: 0.081 ms
 Execution Time: 0.088 ms
(4 rows)

You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
the selectivity and get a fairly accurate estimate... at least when
there are no dead tuples in the heap

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Wed, 3 Oct 2018 at 17:36, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I know commit fest is over, but I made a pass of this to hopefully
> provide a bit of guidance so that it's closer for the November 'fest.

Hi David.  Thanks for the review.  It's fairly thorough and you must
have put some time into it -- I really appreciate it.


> I've only done light testing on the patch and it does seem to work,
> but there are a few things that I think should be changed. Most
> importantly #11 below I think needs to be done. That might overwrite
> some of the items that come before it in the list as you likely will
> have to pull some of code which I mention out out due to changing #11.
> I've kept them around anyway just in case some of it remains.

> 1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()
> 2. Should multiply by nseqpages, not add.
> 3. Should be double.

I agree with these three.


> 4. Comment needs updated to mention what the new code does in
> heapgettup() and heapgettup_pagemode()
>
> +
>   /* start from last page of the scan */
> - if (scan->rs_startblock > 0)
> - page = scan->rs_startblock - 1;
> + if (scan->rs_numblocks == InvalidBlockNumber)
> + {
> + if (scan->rs_startblock > 0)
> + page = scan->rs_startblock - 1;
> + else
> + page = scan->rs_nblocks - 1;
> + }
>   else
> - page = scan->rs_nblocks - 1;
> + page = scan->rs_startblock + scan->rs_numblocks - 1;
> +

I'm thinking that, as they don't depend on the others, the heapam.c
changes should be a separate preparatory patch?

The heap scan code has to support things like synchonised scans and
parallel scans, but as far as I know, its support for scanning
subranges is currently used only for building BRIN indexes.  I found
that although I could specify a subrange with heap_setscanlimits, I
could not scan backward over it, because the original version of the
above code would start at the end of the whole table.

I'm not especially comfortable with this understanding of heapam, so
close review would be appreciated.

I note that there's a lot of common code in heapgettup and
heapgettup_pagemode, which my changes add to.  It might be worth
trying to factor out somehow.


> 5. Variables should be called "inclusive". We use "strict" to indicate
> an operator comparison cannot match NULL values.
>
> + bool strict; /* Indicates < rather than <=, or > rather */
> + bool strict2; /* than >= */
>
> Don't break the comment like that. If you need more space don't end
> the comment and use a new line and tab the next line out to match the
> * of the first line.

> 8. Many instances of the word "strict" are used to mean "inclusive".
> Can you please change all of them.

I don't mind renaming it.  I took "strict" from "strictly greater/less
than" but I knew it was confusable with the other usages of "strict".


> 9. Confusing comment:
>
> + * If the expression was non-strict (<=) and the offset is 0, then just
> + * pretend it was strict, because offset 0 doesn't exist and we may as
> + * well exclude that block.
>
> Shouldn't this be, "If the operator is non-inclusive, then since TID
> offsets are 1-based, for simplicity, we can just class the expression
> as inclusive.", or something along those lines.

Ok, I'll try to reword it along those lines.


> I think I'm going to stop here as changing this going to cause quite a
> bit of churn.
>
> but one more...

> 12. I think the changes to selfuncs.c to get the selectivity estimate
> is a fairly credible idea, but I think it also needs to account for
> offsets. You should be able to work out the average number of items
> per page with rel->tuples / rel->pages and factor that in to get a
> better estimate for cases like:
>
> postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
>                                           QUERY PLAN
> -----------------------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.00..5.00 rows=1 width=10) (actual
> time=0.025..0.065 rows=200 loops=1)
>    TID Cond: (ctid <= '(0,200)'::tid)
>  Planning Time: 0.081 ms
>  Execution Time: 0.088 ms
> (4 rows)
>
> You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
> the selectivity and get a fairly accurate estimate... at least when
> there are no dead tuples in the heap

I think the changes to selfuncs.c could also be a separate patch?

I'll try to include the offset in the selectivity too.

Related -- what should the selectivity be on an empty table?  My code has:

    /* If the relation's empty, we're going to read all of it. */
    if (vardata->rel->pages == 0)
        return 1.0;

(which needs rewording, since selectivity isn't always about reading).
Is 1.0 the right thing to return?


> 6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
> the type instead of repeating code
> 7. It's not very obvious why the following Assert() can't fail. [...]
> I had to hunt around quite a bit to see that
> TidQualFromBaseRestrictinfo could only ever make the list have 2
> elements, and we'd not form a BoolExpr with just 1. (but see #11)
> 10. Comment talks about LHS, but the first OpExpr in a list of two
> OpExprs has nothing to do with left hand side.  You could use LHS if
> you were talking about the first arg in an OpExpr, but this is not the
> case here.

These three might become non-issues if we change it along the lines of #11:

> 11. I think the qual matching code needs an overhaul.  Really you
> should attempt to find the smallest and largest ctid for your
> implicitly ANDed ranges.  This would require you getting rid of the
> BETWEEN type claused you're trying to build in
> TidQualFromBaseRestrictinfo
> and instead just include all quals, don't ignore other quals when
> you've already found your complete range bounds.
>
> The problem with doing it the way that you're doing it now is in cases like:
>
> create table t1(a int);
> insert into t1 select generate_Series(1,10000000);
> create index on t1 (a);
> select ctid,a from t1 order by a desc limit 1; -- find the max ctid.
>     ctid     |    a
> -------------+----------
>  (44247,178) | 10000000
> (1 row)
>
> set max_parallel_workers_per_gather=0;
> explain analyze select ctid,* from t1 where ctid > '(0,0)' and ctid <=
> '(44247,178)' and ctid <= '(0,1)';
>                                              QUERY PLAN
> -----------------------------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.01..169248.78 rows=1 width=10) (actual
> time=0.042..2123.432 rows=1 loops=1)
>    TID Cond: ((ctid > '(0,0)'::tid) AND (ctid <= '(44247,178)'::tid))
>    Filter: (ctid <= '(0,1)'::tid)
>    Rows Removed by Filter: 9999999
>  Planning Time: 4.049 ms
>  Execution Time: 2123.464 ms
> (6 rows)
>
> Due to how you've coded TidQualFromBaseRestrictinfo(), the ctid <=
> '(0,1)' qual does not make it into the range. It's left as a filter in
> the Tid Scan.

My first thought was to support the fairly-common use case of the
two-bound range "ctid >= ? AND ctid< ?" (or single-bound variations);
hence my original patch for a "TID Range Scan".

Following the comments made earlier, I tried incorporating this into
the existing TID Scan; but I still had the same use case in mind, so
only the first lower and upper bounds were used.  My thoughts were
that, while we need to *correctly* handle more complicated cases like
"ctid > '(0,0)' AND ctid <= '(44247,178)' AND ctid <= '(0,1)'", such
queries will not come up in practice and hence it's OK if those extra
bounds are applied in the filter.  For the same reason, I did not
consider it worthwhile trying to pick which bound to use in the scan.

I've since realised that such queries aren't always redundant.  At
query time we might not know which of the bounds if the "best", but we
will after evaluating them in the executor.  So I quite like the idea
of keeping all of them.

This means a TID path's quals is an OR-list of:
  - "ctid = ?"
  - "ctid = ANY (?)" / "ctid IN (?)"
  - "(ctid op ?) AND ..."   (where op is one of >,>=,<,<=)
  - "CURRENT OF"

I still don't think the scan needs to support quals like "ctid = ? AND
ctid > ?", or "ctid IN (?) AND ctid IN (?)" -- the executor *could*
try to form the intersection but I don't think it's worth the code.
In these cases, picking a simple qual is usually enough for an
efficient scan; the full qual can go into the filter.

I'm part way through implementing this.  It looks like it might
actually be less code than what I had before.


Re: Tid scan improvements

От
Edmund Horner
Дата:
Hi all,

I have managed to split my changes into 4 patches:

v3-0001-Add-selectivity-and-nullness-estimates-for-the-ItemP.patch
v3-0002-Support-range-quals-in-Tid-Scan.patch
v3-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch
v3-0004-Tid-Scan-results-are-ordered.patch

(1) is basically independent, and usefully improves estimates for ctid quals.
(2) is the main patch, adding basic range scan support to TidPath and TidScan.
(3) is a small change to properly support backward scans over a
restricted range in heapam.c, and is needed for (4).
(4) adds Backward Tid Scans, and adds path keys to Tid Paths so that
the planner doesn't have to add a sort for certain queries.

I have tried to apply David's suggestions.

In (1), I've included the offset part of a CTID constant in the
selectivity calculation.  I've not included "allvisfrac" in the
calculation; I'm not sure it's worth it as it would only affect the
offset part.
I have tried to use iseq to differentiate between <=,>= versus <,>,
but I'm not sure I've got this right.  I am also not entirely sure
it's worth it; the changes are already an improvement over the current
behaviour of using hardcoded selectivity constants.

In (2), the planner now picks up a greater variety of TID quals,
including AND-clauses with arbitrary children instead of the original
lower bound/upper bound pair.  These are resolved in the executor into
a list of ranges to scan.

(3) is the same code, but I've added a couple of comments to explain the change.

(4) is basically the same pathkey/direction code as before (but as a
separate patch).

I hope the separation will make it easier to review.  Item (2) is
still quite big, but a third of it is tests.

Cheers.
Edmund

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On 4 November 2018 at 17:20, Edmund Horner <ejrh00@gmail.com> wrote:
> I have managed to split my changes into 4 patches:
>
> v3-0001-Add-selectivity-and-nullness-estimates-for-the-ItemP.patch
> v3-0002-Support-range-quals-in-Tid-Scan.patch
> v3-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch
> v3-0004-Tid-Scan-results-are-ordered.patch

Hi,

I've been looking over 0001 to 0003. I ran out of steam before 0004.

I like the design of the new patch. From what I threw so far at the
selectivity estimation code, it seems pretty good.  I also quite like
the design in nodeTidscan.c for range scans.

I didn't quite manage to wrap my head around the code that removes
redundant quals from the tidquals. For example, with:

postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
                    QUERY PLAN
--------------------------------------------------
 Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
   TID Cond: (ctid <= '(0,10)'::tid)
   Filter: (a = 0)
(3 rows)

and:

postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
and ctid >= '(0,0)';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
   TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
   Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
(3 rows)

I understand why the 2nd query didn't remove the ctid quals from the
filter, and I understand why the first query could. I just didn't
manage to convince myself that the code behaves correctly for all
cases.

During my pass through 0001, 0002 and 0003 I noted the following:

0001:

1. I see a few instances of:

#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
#define ItemPointerGetDatum(X) PointerGetDatum(X)

in both tid.c and ginfuncs.c, and I see you have:

+ itemptr = (ItemPointer) DatumGetPointer(constval);

Do you think it would be worth moving the macros out of tid.c and
ginfuncs.c into postgres.h and use that macro instead?

(I see the code in this file already did this, so it might not matter
about this)

0002:

2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
can just return MakeTidRangeQuals(found_quals); or return NIL.

3. Can you explain why this only needs to take place when list_length() == 1?

/*
* In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
* the various parts will have come from different RestrictInfos.  So
* remove each part separately.
*/
if (list_length(tidquals) == 1)
{
Node    *qual = linitial(tidquals);

if (and_clause(qual))
{
BoolExpr   *and_qual = ((BoolExpr *) qual);

scan_clauses = list_difference(scan_clauses, and_qual->args);
}
}

4. Accidental change?

- tidquals);
+ tidquals
+ );

5. Shouldn't this comment get changed?

- * NumTids    number of tids in this scan
+ * NumRanges    number of tids in this scan

6. There's no longer a field named NumTids

- * TidList    evaluated item pointers (array of size NumTids)
+ * TidRanges    evaluated item pointers (array of size NumTids)

7. The following field is not documented in TidScanState:

+ bool tss_inScan;

8. Can you name this exprtype instead?

+ TidExprType type; /* type of op */

"type" is used by Node types to indicate their type.

9. It would be neater this:

if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
else
tidopexpr->type = TIDEXPR_EQ;

tidopexpr->exprstate = exprstate;

tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
== TIDGreaterEqOperator;

as a switch:

switch (expr->opno)
{
case TIDLessEqOperator:
tidopexpr->inclusive = true;
/* fall through */
case TIDLessOperator:
tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
break;
case TIDGreaterEqOperator:
tidopexpr->inclusive = true;
/* fall through */
case TIDGreaterOperator:
tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
break;
default:
tidopexpr->type = TIDEXPR_EQ;
}
tidopexpr->exprstate = exprstate;

10. I don't quite understand this comment:

+ * Create an ExprState corresponding to the value part of a TID comparison,
+ * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
+ * appropriately, depending on the operator and position of the its arguments.

I don't quite see how the code sets the inclusivity depending on the
position of the arguments.

Maybe the comment should be:

+ * For the given 'expr' build and return an appropriate TidOpExpr taking into
+ * account the expr's operator and operand order.

11. ScalarArrayOpExpr are commonly named "saop":

+static TidOpExpr *
+MakeTidScalarArrayOpExpr(ScalarArrayOpExpr *saex, TidScanState *tidstate)

(Though I see it's saex in other places in that file, so might not matter...)

12. You need to code SetTidLowerBound() with similar wraparound logic
you have in SetTidUpperBound().

It's perhaps unlikely, but the following shows incorrect results.

postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
 ctid
-------
 (0,1)
(1 row)

-- the following is fine.

Time: 1.652 ms
postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
 ctid
-------
 (1,1)
(1 row)

Likely you can just upgrade to the next block when the offset is >
MaxOffsetNumber.

13. It looks like the previous code didn't make the assumption you're making in:

+ * A current-of TidExpr only exists by itself, and we should
+ * already have allocated a tidList entry for it.  We don't
+ * need to check whether the tidList array needs to be
+ * resized.

I'm not sure if it's a good idea to lock the executor code into what
the grammar currently says is possible. The previous code didn't
assume that.

14. we pass 'false' to what?

+ * save the tuple and the buffer returned to us by the access methods in
+ * our scan tuple slot and return the slot.  Note: we pass 'false' because
+ * tuples returned by heap_getnext() are pointers onto disk pages and were
+ * not created with palloc() and so should not be pfree()'d.  Note also
+ * that ExecStoreHeapTuple will increment the refcount of the buffer; the
+ * refcount will not be dropped until the tuple table slot is cleared.
  */
- return ExecClearTuple(slot);
+ if (tuple)
+ ExecStoreBufferHeapTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->rs_cbuf); /* buffer associated
+ * with this tuple */
+ else
+ ExecClearTuple(slot);
+
+ return slot;

0003:

Saw nothing wrong:

0004:

Not yet reviewed.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Alvaro Herrera
Дата:
On 2018-Nov-06, David Rowley wrote:

> 14. we pass 'false' to what?
> 
> + * save the tuple and the buffer returned to us by the access methods in
> + * our scan tuple slot and return the slot.  Note: we pass 'false' because
> + * tuples returned by heap_getnext() are pointers onto disk pages and were
> + * not created with palloc() and so should not be pfree()'d.  Note also
> + * that ExecStoreHeapTuple will increment the refcount of the buffer; the
> + * refcount will not be dropped until the tuple table slot is cleared.
>   */

Seems a mistake stemming from 29c94e03c7d0 ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Tue, 6 Nov 2018 at 16:52, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2018-Nov-06, David Rowley wrote:
> > 14. we pass 'false' to what?
> >
> > + * save the tuple and the buffer returned to us by the access methods in
> > + * our scan tuple slot and return the slot.  Note: we pass 'false' because
> > + * tuples returned by heap_getnext() are pointers onto disk pages and were
> > + * not created with palloc() and so should not be pfree()'d.  Note also
> > + * that ExecStoreHeapTuple will increment the refcount of the buffer; the
> > + * refcount will not be dropped until the tuple table slot is cleared.
> >   */
>
> Seems a mistake stemming from 29c94e03c7d0 ...

Yep -- I copied that bit from nodeSeqscan.c.  Some of the notes were
removed in that change, but nodeSeqscan.c and nodeIndexscan.c still
have them.

I made a little patch to remove them.

Вложения

Re: Tid scan improvements

От
Edmund Horner
Дата:
On Tue, 6 Nov 2018 at 16:40, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've been looking over 0001 to 0003. I ran out of steam before 0004.

Hi David, thanks for another big review with lots of improvements.

> I like the design of the new patch. From what I threw so far at the
> selectivity estimation code, it seems pretty good.  I also quite like
> the design in nodeTidscan.c for range scans.


> I didn't quite manage to wrap my head around the code that removes
> redundant quals from the tidquals. For example, with:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
>                     QUERY PLAN
> --------------------------------------------------
>  Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
>    TID Cond: (ctid <= '(0,10)'::tid)
>    Filter: (a = 0)
> (3 rows)
>
> and:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
> and ctid >= '(0,0)';
>                                   QUERY PLAN
> ------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
>    TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
>    Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
> (3 rows)
>
> I understand why the 2nd query didn't remove the ctid quals from the
> filter, and I understand why the first query could. I just didn't
> manage to convince myself that the code behaves correctly for all
> cases.

I agree it's not obvious.

1. We extract a set of tidquals that can be directly implemented by
the Tid scan.  This set is of the form:  "(CTID op ? AND ...) OR
(...)" (with some limitations).
2. If they happened to come verbatim from the original RestrictInfos,
then they will be found in scan_clauses, and we can remove them.
3. If they're not verbatim, i.e. the original RestrictInfos have
additional criteria that the Tid scan can't use, then tidquals won't
match anything in scan_clauses, and hence scan_clauses will be
unchanged.
4. We do a bit of extra work for the common and useful case of "(CTID
op ? AND ...)".  Since the top-level operation of the input quals is
an AND, it will typically be split into multiple RestrictInfo items.
We remove each part from scan_clauses.

> 1. I see a few instances of:
>
> #define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
> #define ItemPointerGetDatum(X) PointerGetDatum(X)
>
> in both tid.c and ginfuncs.c, and I see you have:
>
> + itemptr = (ItemPointer) DatumGetPointer(constval);
>
> Do you think it would be worth moving the macros out of tid.c and
> ginfuncs.c into postgres.h and use that macro instead?
>
> (I see the code in this file already did this, so it might not matter
> about this)

I'm not sure about this one - - I think it's better as a separate
patch, since we'd also change ginfuncs.c.  I have left it alone for
now.

> 2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
> can just return MakeTidRangeQuals(found_quals); or return NIL.

Yup, gone.

> 3. Can you explain why this only needs to take place when list_length() == 1?
>
> /*
> * In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
> * the various parts will have come from different RestrictInfos.  So
> * remove each part separately.
> */
> ...

I've tried to improve the comment.

> 4. Accidental change?
>
> - tidquals);
> + tidquals
> + );
>
> 5. Shouldn't this comment get changed?
>
> - * NumTids    number of tids in this scan
> + * NumRanges    number of tids in this scan
>
> 6. There's no longer a field named NumTids
>
> - * TidList    evaluated item pointers (array of size NumTids)
> + * TidRanges    evaluated item pointers (array of size NumTids)
>
> 7. The following field is not documented in TidScanState:
>
> + bool tss_inScan;
>
> 8. Can you name this exprtype instead?
>
> + TidExprType type; /* type of op */
>
> "type" is used by Node types to indicate their type.

Yup, yup, yup, yup, yup.

> 9. It would be neater this:
>
> if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
> tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
> tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> else
> tidopexpr->type = TIDEXPR_EQ;
>
> tidopexpr->exprstate = exprstate;
>
> tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
> == TIDGreaterEqOperator;
>
> as a switch: ...

Yup, I think the switch is a bit nicer.

> 10. I don't quite understand this comment:
>
> + * Create an ExprState corresponding to the value part of a TID comparison,
> + * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
> + * appropriately, depending on the operator and position of the its arguments.
>
> I don't quite see how the code sets the inclusivity depending on the
> position of the arguments.
>
> Maybe the comment should be:
>
> + * For the given 'expr' build and return an appropriate TidOpExpr taking into
> + * account the expr's operator and operand order.

I'll go with your wording.

> 11. ScalarArrayOpExpr are commonly named "saop": ...

Yup.

> 12. You need to code SetTidLowerBound() with similar wraparound logic
> you have in SetTidUpperBound().
>
> It's perhaps unlikely, but the following shows incorrect results.
>
> postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
>  ctid
> -------
>  (0,1)
> (1 row)
>
> -- the following is fine.
>
> Time: 1.652 ms
> postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
>  ctid
> -------
>  (1,1)
> (1 row)
>
> Likely you can just upgrade to the next block when the offset is >
> MaxOffsetNumber.

This is important, thanks for spotting it.

I've tried to add some code to handle this case (and also that of
"ctid < '(0,0)'") with a couple of tests too.

> 13. It looks like the previous code didn't make the assumption you're making in:
>
> + * A current-of TidExpr only exists by itself, and we should
> + * already have allocated a tidList entry for it.  We don't
> + * need to check whether the tidList array needs to be
> + * resized.
>
> I'm not sure if it's a good idea to lock the executor code into what
> the grammar currently says is possible. The previous code didn't
> assume that.

Fair enough, I've restored the previous code without the assumption.

> 14. we pass 'false' to what?

Obsolete comment (see reply to Alvaro).

I've applied most of these, and I'll post a new patch soon.


Re: Tid scan improvements

От
Edmund Horner
Дата:
Hi, here's the new patch(s).

Mostly the same, but trying to address your comments from earlier as
well as clean up a few other things I noticed.

Cheers,
Edmund

On Fri, 9 Nov 2018 at 15:01, Edmund Horner <ejrh00@gmail.com> wrote:
>
> On Tue, 6 Nov 2018 at 16:40, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > I've been looking over 0001 to 0003. I ran out of steam before 0004.
>
> Hi David, thanks for another big review with lots of improvements.
>
> > I like the design of the new patch. From what I threw so far at the
> > selectivity estimation code, it seems pretty good.  I also quite like
> > the design in nodeTidscan.c for range scans.
>
>
> > I didn't quite manage to wrap my head around the code that removes
> > redundant quals from the tidquals. For example, with:
> >
> > postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
> >                     QUERY PLAN
> > --------------------------------------------------
> >  Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
> >    TID Cond: (ctid <= '(0,10)'::tid)
> >    Filter: (a = 0)
> > (3 rows)
> >
> > and:
> >
> > postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
> > and ctid >= '(0,0)';
> >                                   QUERY PLAN
> > ------------------------------------------------------------------------------
> >  Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
> >    TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
> >    Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
> > (3 rows)
> >
> > I understand why the 2nd query didn't remove the ctid quals from the
> > filter, and I understand why the first query could. I just didn't
> > manage to convince myself that the code behaves correctly for all
> > cases.
>
> I agree it's not obvious.
>
> 1. We extract a set of tidquals that can be directly implemented by
> the Tid scan.  This set is of the form:  "(CTID op ? AND ...) OR
> (...)" (with some limitations).
> 2. If they happened to come verbatim from the original RestrictInfos,
> then they will be found in scan_clauses, and we can remove them.
> 3. If they're not verbatim, i.e. the original RestrictInfos have
> additional criteria that the Tid scan can't use, then tidquals won't
> match anything in scan_clauses, and hence scan_clauses will be
> unchanged.
> 4. We do a bit of extra work for the common and useful case of "(CTID
> op ? AND ...)".  Since the top-level operation of the input quals is
> an AND, it will typically be split into multiple RestrictInfo items.
> We remove each part from scan_clauses.
>
> > 1. I see a few instances of:
> >
> > #define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
> > #define ItemPointerGetDatum(X) PointerGetDatum(X)
> >
> > in both tid.c and ginfuncs.c, and I see you have:
> >
> > + itemptr = (ItemPointer) DatumGetPointer(constval);
> >
> > Do you think it would be worth moving the macros out of tid.c and
> > ginfuncs.c into postgres.h and use that macro instead?
> >
> > (I see the code in this file already did this, so it might not matter
> > about this)
>
> I'm not sure about this one - - I think it's better as a separate
> patch, since we'd also change ginfuncs.c.  I have left it alone for
> now.
>
> > 2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
> > can just return MakeTidRangeQuals(found_quals); or return NIL.
>
> Yup, gone.
>
> > 3. Can you explain why this only needs to take place when list_length() == 1?
> >
> > /*
> > * In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
> > * the various parts will have come from different RestrictInfos.  So
> > * remove each part separately.
> > */
> > ...
>
> I've tried to improve the comment.
>
> > 4. Accidental change?
> >
> > - tidquals);
> > + tidquals
> > + );
> >
> > 5. Shouldn't this comment get changed?
> >
> > - * NumTids    number of tids in this scan
> > + * NumRanges    number of tids in this scan
> >
> > 6. There's no longer a field named NumTids
> >
> > - * TidList    evaluated item pointers (array of size NumTids)
> > + * TidRanges    evaluated item pointers (array of size NumTids)
> >
> > 7. The following field is not documented in TidScanState:
> >
> > + bool tss_inScan;
> >
> > 8. Can you name this exprtype instead?
> >
> > + TidExprType type; /* type of op */
> >
> > "type" is used by Node types to indicate their type.
>
> Yup, yup, yup, yup, yup.
>
> > 9. It would be neater this:
> >
> > if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
> > tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> > else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
> > tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> > else
> > tidopexpr->type = TIDEXPR_EQ;
> >
> > tidopexpr->exprstate = exprstate;
> >
> > tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
> > == TIDGreaterEqOperator;
> >
> > as a switch: ...
>
> Yup, I think the switch is a bit nicer.
>
> > 10. I don't quite understand this comment:
> >
> > + * Create an ExprState corresponding to the value part of a TID comparison,
> > + * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
> > + * appropriately, depending on the operator and position of the its arguments.
> >
> > I don't quite see how the code sets the inclusivity depending on the
> > position of the arguments.
> >
> > Maybe the comment should be:
> >
> > + * For the given 'expr' build and return an appropriate TidOpExpr taking into
> > + * account the expr's operator and operand order.
>
> I'll go with your wording.
>
> > 11. ScalarArrayOpExpr are commonly named "saop": ...
>
> Yup.
>
> > 12. You need to code SetTidLowerBound() with similar wraparound logic
> > you have in SetTidUpperBound().
> >
> > It's perhaps unlikely, but the following shows incorrect results.
> >
> > postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
> >  ctid
> > -------
> >  (0,1)
> > (1 row)
> >
> > -- the following is fine.
> >
> > Time: 1.652 ms
> > postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
> >  ctid
> > -------
> >  (1,1)
> > (1 row)
> >
> > Likely you can just upgrade to the next block when the offset is >
> > MaxOffsetNumber.
>
> This is important, thanks for spotting it.
>
> I've tried to add some code to handle this case (and also that of
> "ctid < '(0,0)'") with a couple of tests too.
>
> > 13. It looks like the previous code didn't make the assumption you're making in:
> >
> > + * A current-of TidExpr only exists by itself, and we should
> > + * already have allocated a tidList entry for it.  We don't
> > + * need to check whether the tidList array needs to be
> > + * resized.
> >
> > I'm not sure if it's a good idea to lock the executor code into what
> > the grammar currently says is possible. The previous code didn't
> > assume that.
>
> Fair enough, I've restored the previous code without the assumption.
>
> > 14. we pass 'false' to what?
>
> Obsolete comment (see reply to Alvaro).
>
> I've applied most of these, and I'll post a new patch soon.

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On Mon, 12 Nov 2018 at 17:35, Edmund Horner <ejrh00@gmail.com> wrote:
> Hi, here's the new patch(s).
>
> Mostly the same, but trying to address your comments from earlier as
> well as clean up a few other things I noticed.

Thanks for making those changes.

I've now had a look over the latest patches and I've found a few more
things.  Many of these are a bit nitpicky, but certainly not all. I
also reviewed 0004 this time.

0001:

1. The row estimates are not quite right.  This cases the row
estimation to go the wrong way for isgt.

For example, the following gets 24 rows instead of 26.

postgres=# create table t (a int);
CREATE TABLE
postgres=# insert into t select generate_Series(1,100);
INSERT 0 100
postgres=# analyze t;
postgres=# explain analyze select * from t where ctid >= '(0,75)';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..2.25 rows=24 width=4) (actual
time=0.046..0.051 rows=26 loops=1)
   Filter: (ctid >= '(0,75)'::tid)
   Rows Removed by Filter: 74
 Planning Time: 0.065 ms
 Execution Time: 0.074 ms
(5 rows)

The < and <= case is not quite right either. < should have 1 fewer
tuple than the calculated average tuples per page, and <= should have
the same (assuming no gaps)

I've attached a small delta patch that I think improves things here.

0002:

2. You should test for a non-empty List with list != NIL

/*
* If no quals were specified, then a complete scan is assumed.  Make a
* TidExpr with an empty list of TidOpExprs.
*/
if (!node->tidquals)

Also, can you not just return after that if test? I think the code
would be easier to read with it like that.

3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
allocation until it reaches the required size. See how
MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
we always have a power of two sized array which is much nicer if we
ever reach the palloc() limit as if the array is sized at the palloc()
limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
unlikely to be a problem here, but... the question would be how to
decide on the initial size.

4. "at" needs shifted left a couple of words

/*
* If the lower bound was already or above at the maximum block
* number, then there is no valid range.
*/

but I don't see how it could be "or above".  The ctid type does not
have the room for that. Although, that's not to say you should test if
(block == MaxBlockNumber), the >= seems better for the code. I'm just
complaining about the comment.

5. TidInArrayExprEval() lacks a header comment, and any other comments
to mention what it does. The function args also push over the 80 char
line length.  There's also a few other functions in nodeTidscan.c that
are missing a header comment.

6. In MergeTidRanges(), you have:

ItemPointerData a_last = a->last;
ItemPointerData b_last;

if (!ItemPointerIsValid(&a_last))
a_last = a->first;

but I don't see anywhere you're setting ->last to an invalid item
pointer. Is this left over from a previous design of the range scan?
It looks like in TidExprEval() you're setting the upperbound to the
last page on the relation.

7. "fist" -> "first"

* If the tuple is in the fist block of the range and before the first

8. tss_TidRangePtr is a pretty confusingly named field.

if (node->tss_TidRangePtr >= numRanges || node->tss_TidRangePtr < 0)
break;

I'd expect anything with ptr in it to be a pointer, but this seems to
be an array index. Maybe "idx" is better than "ptr", or take note from
nodeAppend.c and have something like "tts_whichRange".

UPDATE: I see you've just altered what's there already. Perhaps it's
okay to leave it as you have it, but it's still not ideal.

9. This comment seems to indicate that a range can only have one
bound, but that does not seem to be the case.

* Ranges with only one item -- including one resulting from a
* CURRENT-OF qual -- are handled by looking up the item directly.

It seems open bounded ranges just have the lowest or highest possible
value for a ctid on the open side.

Perhaps the comment could be written as:

/*
 * For ranges containing a single tuple, we can simply make an
 * attempt to fetch the tuple directly.
 */

10. In cost_tidscan() I think you should ceil() the following:

double pages = selectivity * baserel->pages;

Otherwise, you'll end up partially charging a seq_page_cost, which
seems pretty invalid since you can't partially read a page.

11. In the comment:

/* TODO decide what the costs should be */

I think you can just explain why you're charging 1 random_page_cost
and the remainder in seq_page_cost. Or is there something left to do
here that I've forgotten about?

12. expected_comparison_operator is a bit long a name:

IsTidComparison(OpExpr *node, int varno, Oid expected_comparison_operator)

How about just expected_opno?

13. !rlst -> rlst != NIL

/* if no range qual was found, look for any other TID qual */
if (!rlst)

(Yeah I know there's various cases where it's done incorrectly there
already :-( )

14. This is not great:

#define IsTidEqualClause(node, varno) IsTidComparison(node, varno,
TIDEqualOperator)
#define IsTidLTClause(node, varno) IsTidComparison(node, varno, TIDLessOperator)
#define IsTidLEClause(node, varno) IsTidComparison(node, varno,
TIDLessEqOperator)
#define IsTidGTClause(node, varno) IsTidComparison(node, varno,
TIDGreaterOperator)
#define IsTidGEClause(node, varno) IsTidComparison(node, varno,
TIDGreaterEqOperator)

#define IsTidRangeClause(node, varno) (IsTidLTClause(node, varno) || \
IsTidLEClause(node, varno) || \
IsTidGTClause(node, varno) || \
IsTidGEClause(node, varno))

The 4 macros for >, >=, < and <= are only used by IsTidRangeClause()
which means IsTidComparison() could get called up to 4 times. Most of
the work it does would be redundant in that case.  Maybe it's better
to rethink that?

15. There's no field named NumTids:

 * TidRanges evaluated item pointers (array of size NumTids)

0003:

16. I think the following comment needs to be updated:

/* start from last page of the scan */

to:

/* When scanning the whole relation, start from the last page of the scan */

and drop:

/* Scanning the full relation: start just before start block. */

then maybe change:

/* Scanning a restricted range: start at end of range. */

to

/* Otherwise, if scanning just a subset of the relation, start at the
final block in the range */

0004:

17. Can you make a few changed to build_tidscan_pathkeys():

a. build_index_pathkeys() uses ScanDirectionIsBackward(scandir), can
you set the opno based on that rather than doing "direction ==
ForwardScanDirection"
b. varexpr can be an Expr and just be named expr. Please move the
declaration and assignment out onto separate lines and wrap the long
line.
c. wrap long line with the call to build_expression_pathkey(). Get rid
of the (Expr *) cast.

18. I'd expect the following not to produce a sort above the Tid Scan.

postgres=# set enable_seqscan=0;
SET
postgres=# explain select * from t inner join t t1 on t.ctid = t1.ctid
where t.ctid < '(0,10)' ;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Merge Join  (cost=10000000008.65..10000000009.28 rows=9 width=8)
   Merge Cond: (t.ctid = t1.ctid)
   ->  Sort  (cost=3.33..3.35 rows=9 width=10)
         Sort Key: t.ctid
         ->  Tid Scan on t  (cost=0.00..3.18 rows=9 width=10)
               TID Cond: (ctid < '(0,10)'::tid)
   ->  Sort  (cost=10000000005.32..10000000005.57 rows=100 width=10)
         Sort Key: t1.ctid
         ->  Seq Scan on t t1  (cost=10000000000.00..10000000002.00
rows=100 width=10)
(9 rows)

On looking at why the planner did this, I see it's down to how you've
coded create_tidscan_paths(). You're creating a tidpath if there's any
quals or any useful pathkeys useful to the query's ORDER BY, but only
including the pathkeys if they're useful for the query's ORDER BY.  I
think it'll be better to include the forward pathkeys in all cases,
and just make it a backward Tid Scan if backward keys are useful for
the ORDER BY.   There's still a problem with this as a Merge Join
would need a Sort if there was an ORDER BY ctid DESC for one relation
even if the other relation had some valid ctid quals since the 2nd
scan would create a forward Tid Scan.  Maybe that's not worth worrying
about.   The only fix I can imagine is to always create a forward and
backward Tid Scan path, which is pretty bad as it's two more paths
that likely won't get used 99.9% of the time.

This also caused me to notice the costs are pretty broken for this:

postgres=# explain select * from t order by ctid;
                     QUERY PLAN
---------------------------------------------------
 Tid Scan on t  (cost=0.00..0.00 rows=100 width=10)
(1 row)

19. Looks like the ScanDirection's normally get named "scandir":

static TidScan *make_tidscan(List *qptlist, List *qpqual, Index scanrelid,
List *tidquals, ScanDirection direction);

Likewise for the various .h files you've added that as a new field to
various structs.

Setting back to waiting on author.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Tid scan improvements

От
Tomas Vondra
Дата:
On 11/22/18 8:41 AM, David Rowley wrote:
 >
> ...
> 
> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> allocation until it reaches the required size. See how
> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> we always have a power of two sized array which is much nicer if we
> ever reach the palloc() limit as if the array is sized at the palloc()
> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> unlikely to be a problem here, but... the question would be how to
> decide on the initial size.
> 

I think it kinda tries to do that in some cases, by doing this:

     *numAllocRanges *= 2;

     ...

     tidRanges = (TidRange *)
         repalloc(tidRanges,
                  *numAllocRanges * sizeof(TidRange));

The problem here is that what matters is not numAllocRanges being 2^N, 
but the number of bytes allocated being 2^N. Because that's what ends up 
in AllocSet, which keeps lists of 2^N chunks.

And as TidRange is 12B, so this is guaranteed to waste memory, because 
no matter what the first factor is, the result will never be 2^N.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Fri, 23 Nov 2018 at 07:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 11/22/18 8:41 AM, David Rowley wrote:
> > ...
> > 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> > allocation until it reaches the required size. See how
> > MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> > we always have a power of two sized array which is much nicer if we
> > ever reach the palloc() limit as if the array is sized at the palloc()
> > limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> > unlikely to be a problem here, but... the question would be how to
> > decide on the initial size.
>
> I think it kinda tries to do that in some cases, by doing this:
>
>      *numAllocRanges *= 2;
>      ...
>      tidRanges = (TidRange *)
>          repalloc(tidRanges,
>                   *numAllocRanges * sizeof(TidRange));
>
> The problem here is that what matters is not numAllocRanges being 2^N,
> but the number of bytes allocated being 2^N. Because that's what ends up
> in AllocSet, which keeps lists of 2^N chunks.
>
> And as TidRange is 12B, so this is guaranteed to waste memory, because
> no matter what the first factor is, the result will never be 2^N.

For simplicity, I think making it a strict doubling of capacity each
time is fine.  That's what we see in numerous other places in the
backend code.

What we don't really see is intentionally setting the initial capacity
so that each subsequent capacity is close-to-but-not-exceeding a power
of 2 bytes.  You can't really do that optimally if working in terms of
whole numbers of items that aren't each a power of 2 size.  This step,
there may be 2/3 of an item spare; next step, we'll have a whole item
spare that we're not going to use.  So we could keep track in terms of
bytes allocated, and then figure out how many items we can fit at the
current time.

In my opinion, such complexity is overkill for Tid scans.

Currently, we try to pick an initial size based on the number of
expressions.  We assume each expression will yield one range, and
allow that a saop expression might require us to enlarge the array.

Again, for simplicity, we should scrap that and pick something like
floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.


Re: Tid scan improvements

От
Tomas Vondra
Дата:

On 11/24/18 1:56 AM, Edmund Horner wrote:
> On Fri, 23 Nov 2018 at 07:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 11/22/18 8:41 AM, David Rowley wrote:
>>> ...
>>> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
>>> allocation until it reaches the required size. See how
>>> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
>>> we always have a power of two sized array which is much nicer if we
>>> ever reach the palloc() limit as if the array is sized at the palloc()
>>> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
>>> unlikely to be a problem here, but... the question would be how to
>>> decide on the initial size.
>>
>> I think it kinda tries to do that in some cases, by doing this:
>>
>>      *numAllocRanges *= 2;
>>      ...
>>      tidRanges = (TidRange *)
>>          repalloc(tidRanges,
>>                   *numAllocRanges * sizeof(TidRange));
>>
>> The problem here is that what matters is not numAllocRanges being 2^N,
>> but the number of bytes allocated being 2^N. Because that's what ends up
>> in AllocSet, which keeps lists of 2^N chunks.
>>
>> And as TidRange is 12B, so this is guaranteed to waste memory, because
>> no matter what the first factor is, the result will never be 2^N.
> 
> For simplicity, I think making it a strict doubling of capacity each
> time is fine.  That's what we see in numerous other places in the
> backend code.
> 

Sure.

> What we don't really see is intentionally setting the initial capacity
> so that each subsequent capacity is close-to-but-not-exceeding a power
> of 2 bytes.  You can't really do that optimally if working in terms of
> whole numbers of items that aren't each a power of 2 size.  This step,
> there may be 2/3 of an item spare; next step, we'll have a whole item
> spare that we're not going to use.

Ah, I missed the detail with setting initial size.

> So we could keep track in terms of bytes allocated, and then figure
> out how many items we can fit at the current time.
> 
> In my opinion, such complexity is overkill for Tid scans.
> 
> Currently, we try to pick an initial size based on the number of
> expressions.  We assume each expression will yield one range, and
> allow that a saop expression might require us to enlarge the array.
> 
> Again, for simplicity, we should scrap that and pick something like
> floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.
> 

Probably. I don't think it'd be a lot of code to do the exact sizing,
but you're right 1.5% is close enough. As long as there is a comment
explaining the initial sizing, I'm fine with that.

If I could suggest one more thing, I'd define a struct combining the
array of ranges, numRanges and numAllocRangeslike:

typedef struct TidRanges
{
    int         numRanges;
    int         numAllocRanges;
    TidRange    ranges[FLEXIBLE_ARRAY_MEMBER];
} TidRanges;

and use that instead of the plain array. I find it easier to follow
compared to passing the various fields directly (sometimes as a value,
sometimes pointer to the value, etc.).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Sat, 24 Nov 2018 at 15:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 11/24/18 1:56 AM, Edmund Horner wrote:
> > On Fri, 23 Nov 2018 at 07:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> On 11/22/18 8:41 AM, David Rowley wrote:
> >>> ...
> >>> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> >>> allocation until it reaches the required size. See how
> >>> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> >>> we always have a power of two sized array which is much nicer if we
> >>> ever reach the palloc() limit as if the array is sized at the palloc()
> >>> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> >>> unlikely to be a problem here, but... the question would be how to
> >>> decide on the initial size.
> >>
> >> I think it kinda tries to do that in some cases, by doing this:
> >>
> >>      *numAllocRanges *= 2;
> >>      ...
> >>      tidRanges = (TidRange *)
> >>          repalloc(tidRanges,
> >>                   *numAllocRanges * sizeof(TidRange));
> >>
> >> The problem here is that what matters is not numAllocRanges being 2^N,
> >> but the number of bytes allocated being 2^N. Because that's what ends up
> >> in AllocSet, which keeps lists of 2^N chunks.
> >>
> >> And as TidRange is 12B, so this is guaranteed to waste memory, because
> >> no matter what the first factor is, the result will never be 2^N.
> >
> > For simplicity, I think making it a strict doubling of capacity each
> > time is fine.  That's what we see in numerous other places in the
> > backend code.
> >
>
> Sure.
>
> > What we don't really see is intentionally setting the initial capacity
> > so that each subsequent capacity is close-to-but-not-exceeding a power
> > of 2 bytes.  You can't really do that optimally if working in terms of
> > whole numbers of items that aren't each a power of 2 size.  This step,
> > there may be 2/3 of an item spare; next step, we'll have a whole item
> > spare that we're not going to use.
>
> Ah, I missed the detail with setting initial size.
>
> > So we could keep track in terms of bytes allocated, and then figure
> > out how many items we can fit at the current time.
> >
> > In my opinion, such complexity is overkill for Tid scans.
> >
> > Currently, we try to pick an initial size based on the number of
> > expressions.  We assume each expression will yield one range, and
> > allow that a saop expression might require us to enlarge the array.
> >
> > Again, for simplicity, we should scrap that and pick something like
> > floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.
> >
>
> Probably. I don't think it'd be a lot of code to do the exact sizing,
> but you're right 1.5% is close enough. As long as there is a comment
> explaining the initial sizing, I'm fine with that.
>
> If I could suggest one more thing, I'd define a struct combining the
> array of ranges, numRanges and numAllocRangeslike:
>
> typedef struct TidRanges
> {
>     int         numRanges;
>     int         numAllocRanges;
>     TidRange    ranges[FLEXIBLE_ARRAY_MEMBER];
> } TidRanges;
>
> and use that instead of the plain array. I find it easier to follow
> compared to passing the various fields directly (sometimes as a value,
> sometimes pointer to the value, etc.).

Ok, I've made rewritten it to use a struct:

typedef struct TidRangeArray {
    TidRange *ranges;
    int numRanges;
    int numAllocated;
}           TidRangeArray;

which is slightly different from the flexible array member version you
suggested.  The TidRangeArray is allocated on the stack in the
function that builds it, and then ranges and numRanges are copied into
the TidScanState before the function returns.

Any particular pros/cons of this versus your approach?  With yours, I
presume we'd have a pointer to TidRanges in TidScanState.

My other concern now is that EnsureTidRangeSpace needs a loop to
double the allocated size.  Most such arrays in the backend only ever
grow by 1, so a single doubling is fine, but the TID scan one can grow
by an arbitrary number with a scalar array op, and it's nice to not
have to check the space for each individual item.  Here's what I've
got.

void
EnsureTidRangeSpace(TidRangeArray *tidRangeArray, int numNewItems)
{
    int requiredSpace = tidRangeArray->numRanges + numNewItems;
    if (requiredSpace <= tidRangeArray->numAllocated)
        return;

    /* it's not safe to double the size unless we're less than half MAX_INT */
    if (requiredSpace >= INT_MAX / 2)
        tidRangeArray->numAllocated = requiredSpace;
    else
        while (tidRangeArray->numAllocated < requiredSpace)
            tidRangeArray->numAllocated *= 2;

    tidRangeArray->ranges = (TidRange *)
        repalloc(tidRangeArray->ranges,
                 tidRangeArray->numAllocated * sizeof(TidRange));
}

If you're in danger of overflowing numAllocated with the number of
TIDs in your query, you're probably going to have other problems.  But
I'd prefer to at least not get stuck in an infinite doubling loop.

Note that you don't need any single ScalarArrayOp to return a huge
result, because you can have multiple such ops in your query, and the
results for each all need to get put into the TidRangeArray before
de-duplication occurs.

What's a safe way to check that we're not trying to process too many items?


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 22 Nov 2018 at 20:41, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've now had a look over the latest patches and I've found a few more
> things.  Many of these are a bit nitpicky, but certainly not all. I
> also reviewed 0004 this time.


Whew!  A lot more things to look at.

I've tried to address most of what you've raised, and attach yet
another set of patches.  There are are few things that I'm not settled
on, discussed below under Big Items.

CC'd Tomas, if he wants to check what I've done with the TidRange
array allocation.


***** Big Items *****

> 0001:
>
> 1. The row estimates are not quite right.  This cases the row
> estimation to go the wrong way for isgt.
>
> For example, the following gets 24 rows instead of 26.
>
> postgres=# create table t (a int);
> CREATE TABLE
> postgres=# insert into t select generate_Series(1,100);
> INSERT 0 100
> postgres=# analyze t;
> postgres=# explain analyze select * from t where ctid >= '(0,75)';
>                                          QUERY PLAN
> ---------------------------------------------------------------------------------------------
>  Seq Scan on t  (cost=0.00..2.25 rows=24 width=4) (actual
> time=0.046..0.051 rows=26 loops=1)
>    Filter: (ctid >= '(0,75)'::tid)
>    Rows Removed by Filter: 74
>  Planning Time: 0.065 ms
>  Execution Time: 0.074 ms
> (5 rows)
>
> The < and <= case is not quite right either. < should have 1 fewer
> tuple than the calculated average tuples per page, and <= should have
> the same (assuming no gaps)
>
> I've attached a small delta patch that I think improves things here.

Thanks, I've incorporated your patch.  I think the logic for iseq and
isgt makes sense now.

Since we only have the total number of tuples and the total number of
pages, and no real statistics, this might be the best we can
reasonably do.  There's still a noticable rowcount error for the last
page, and slighter rowcount errors for other pages.  We estimate
density = ntuples/npages for all pages; but in a densely populated
table, we'll average only half the number of tuples in the last page
as earlier pages.

I guess we *could* estimate density = ntuples/(npages - 0.5) for all
but the last page; and half that for the last.  But that adds
complexity, and you'd still only get a good row count when the last
page was about half full.

I implemented this anyway, and it does improve row counts a bit.  I'll
include it in the next patch set and you can take a look.

I also spent some time today agonising over how visiblity would affect
things, but did not come up with anything useful to add to our
formulas.

> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> allocation until it reaches the required size. See how
> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> we always have a power of two sized array which is much nicer if we
> ever reach the palloc() limit as if the array is sized at the palloc()
> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> unlikely to be a problem here, but... the question would be how to
> decide on the initial size.

I've tried to change things that way, but we still need to deal with
excessive numbers of items.

I've defined a constant MaxTidRanges = MaxAllocSize/sizeof(TidRange),
and raise an error if the required size exceeds that.

> 4. "at" needs shifted left a couple of words
>
> /*
> * If the lower bound was already or above at the maximum block
> * number, then there is no valid range.
> */
>
> but I don't see how it could be "or above".  The ctid type does not
> have the room for that. Although, that's not to say you should test if
> (block == MaxBlockNumber), the >= seems better for the code. I'm just
> complaining about the comment.

We have to deal with TIDs entered by the user, which can include
invalid ones like (4294967295,0).  MaxBlockNumber is 4294967294.

> 12. expected_comparison_operator is a bit long a name:
>
> IsTidComparison(OpExpr *node, int varno, Oid expected_comparison_operator)
>
> How about just expected_opno?
>
> 14. This is not great:
>
> [horrible macros in tidpath.c]
>
> The 4 macros for >, >=, < and <= are only used by IsTidRangeClause()
> which means IsTidComparison() could get called up to 4 times. Most of
> the work it does would be redundant in that case.  Maybe it's better
> to rethink that?

Yeah.  I've rewritten all this as two functions, IsTidEqualClause and
IsTidRangeClause, which each check the opno, with a helper function
IsTidBinaryExpression that checks everything else.

> 18. I'd expect the following not to produce a sort above the Tid Scan.
>
> postgres=# set enable_seqscan=0;
> SET
> postgres=# explain select * from t inner join t t1 on t.ctid = t1.ctid
> where t.ctid < '(0,10)' ;
>                                       QUERY PLAN
> ---------------------------------------------------------------------------------------
>  Merge Join  (cost=10000000008.65..10000000009.28 rows=9 width=8)
>    Merge Cond: (t.ctid = t1.ctid)
>    ->  Sort  (cost=3.33..3.35 rows=9 width=10)
>          Sort Key: t.ctid
>          ->  Tid Scan on t  (cost=0.00..3.18 rows=9 width=10)
>                TID Cond: (ctid < '(0,10)'::tid)
>    ->  Sort  (cost=10000000005.32..10000000005.57 rows=100 width=10)
>          Sort Key: t1.ctid
>          ->  Seq Scan on t t1  (cost=10000000000.00..10000000002.00
> rows=100 width=10)
> (9 rows)
>
> On looking at why the planner did this, I see it's down to how you've
> coded create_tidscan_paths(). You're creating a tidpath if there's any
> quals or any useful pathkeys useful to the query's ORDER BY, but only
> including the pathkeys if they're useful for the query's ORDER BY.  I
> think it'll be better to include the forward pathkeys in all cases,
> and just make it a backward Tid Scan if backward keys are useful for
> the ORDER BY.   There's still a problem with this as a Merge Join
> would need a Sort if there was an ORDER BY ctid DESC for one relation
> even if the other relation had some valid ctid quals since the 2nd
> scan would create a forward Tid Scan.  Maybe that's not worth worrying
> about.   The only fix I can imagine is to always create a forward and
> backward Tid Scan path, which is pretty bad as it's two more paths
> that likely won't get used 99.9% of the time.

Two paths seems excessive just to cater for these unlikely plans.  We
don't provide any other support for joining on CTID.

But setting the path keys doesn't cost much, so we should do that.

> This also caused me to notice the costs are pretty broken for this:
>
> postgres=# explain select * from t order by ctid;
>                      QUERY PLAN
> ---------------------------------------------------
>  Tid Scan on t  (cost=0.00..0.00 rows=100 width=10)
> (1 row)

Yeah -- a side effect of treating empty tidquals as a scan over the
whole table.  I've added costing code for this case.


***** Smaller items *****

Compacted for brevity (hope you don't mind):

> 2. You should test for a non-empty List with list != NIL  [...]  Also, can you not just return after that if test? I
thinkthe code
 
> would be easier to read with it like that.
> 5. TidInArrayExprEval() lacks a header comment [...]
> 6. In MergeTidRanges(), you have: [leftover code]
> 7. "fist" -> "first" [...]
> 8. tss_TidRangePtr is a pretty confusingly named field. [...]
> 9. This comment seems to indicate that a range can only have one bound, but that does not seem to be the case. [...]
> 10. In cost_tidscan() I think you should ceil() the following: [...]
> 11. In the comment: /* TODO decide what the costs should be */ [...]
> 13. !rlst -> rlst != NIL
> 15. There's no field named NumTids: [...]
> 16. I think the following comment needs to be updated: [heapam comments]
> 17. Can you make a few changed to build_tidscan_pathkeys(): [...]
> 19. Looks like the ScanDirection's normally get named "scandir": [...]

These are mostly trivial and I've generally gone with your recommendation.

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
Review of v5:

0001: looks good.

0002:

1. I don't think you need palloc0() here. palloc() looks like it would be fine.

if (tidRangeArray->ranges == NULL)
tidRangeArray->ranges = (TidRange *)
palloc0(tidRangeArray->numAllocated * sizeof(TidRange));

if that wasn't the case, then you'll need to also zero the additional
memory when you repalloc().

2. Can't the following code be moved into the correct
forwards/backwards if block inside the if inscan block above?

/* If we've finished iterating over the ranges, exit the loop. */
if (node->tss_CurrentTidRange >= numRanges ||
node->tss_CurrentTidRange < 0)
break;

Something like:

if (bBackward)
{
    if (node->tss_CurrentTidRange < 0)
    {
        /* initialize for backward scan */
        node->tss_CurrentTidRange = numRanges - 1;
    }
    else if (node->tss_CurrentTidRange == 0)
        break;
    else
        node->tss_CurrentTidRange--;
 }
else
{
    if (node->tss_CurrentTidRange < 0)
    {
        /* initialize for forward scan */
        node->tss_CurrentTidRange = 0;
    }
    else if (node->tss_CurrentTidRange >= numRanges - 1)
        break;
    else
        node->tss_CurrentTidRange++;
}

I think that's a few less lines and instructions and (I think) a bit neater too.

3. if (found_quals != NIL) (yeah, I Know there's already lots of
places not doing this)

/* If we found any, make an AND clause out of them. */
if (found_quals)

likewise in:

/* Use a range qual if any were found. */
if (found_quals)

4. The new tests in tidscan.sql should drop the newly created tables.
(I see some get dropped in the 0004 patch, but not all. Best not to
rely on a later patch to do work that this patch should do)

0003: looks okay.

0004:

5. Please add a comment to scandir in:

typedef struct TidScan
{
Scan scan;
List    *tidquals; /* qual(s) involving CTID = something */
ScanDirection scandir;
} TidScan;

/* forward or backward or don't care */ would do.

Likewise for struct TidPath. Likely IndexPath can be used for guidance.

6. Is it worth adding a Merge Join regression test for this patch?

Something like:

postgres=# explain select * from t1 inner join t1 t2 on t1.ctid =
t2.ctid order by t1.ctid desc;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Merge Join  (cost=0.00..21.25 rows=300 width=14)
   Merge Cond: (t1.ctid = t2.ctid)
   ->  Tid Scan Backward on t1  (cost=0.00..8.00 rows=300 width=10)
   ->  Materialize  (cost=0.00..8.75 rows=300 width=10)
         ->  Tid Scan Backward on t1 t2  (cost=0.00..8.00 rows=300 width=10)
(5 rows)

0005:

7. I see the logic behind this new patch, but quite possibly the
majority of the time the relpages will be out of date and you'll
mistakenly apply this to not the final page. I'm neither here nor
there with it. I imagine you might feel the same since you didn't
merge it with 0001. Maybe we can leave it out for now and see what
others think.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Tom Lane
Дата:
Edmund Horner <ejrh00@gmail.com> writes:
> [ tid scan patches ]

I'm having a hard time wrapping my mind around why you'd bother with
backwards TID scans.  The amount of code needed versus the amount of
usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
see that there might be value in knowing that a regular scan has
"ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
on TID without an explicit sort).  It does not, however, follow that
there's any additional value in supporting the DESC case.

            regards, tom lane


Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
> Edmund Horner <ejrh00@gmail.com> writes:
> > [ tid scan patches ]
> 
> I'm having a hard time wrapping my mind around why you'd bother with
> backwards TID scans.  The amount of code needed versus the amount of
> usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> see that there might be value in knowing that a regular scan has
> "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> on TID without an explicit sort).  It does not, however, follow that
> there's any additional value in supporting the DESC case.

I've not followed this thread, but wouldn't that be quite useful to be
able to move old tuples to free space earlier in the table?

I've written multiple scripts that update the later pages in a table, to
force reuse of earlier free pages (in my case by generating ctid = ANY()
style queries with all possible tids for the last few pages, the most
efficient way I could think of).

Greetings,

Andres Freund


Re: Tid scan improvements

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
>> I'm having a hard time wrapping my mind around why you'd bother with
>> backwards TID scans.

> I've not followed this thread, but wouldn't that be quite useful to be
> able to move old tuples to free space earlier in the table?
> I've written multiple scripts that update the later pages in a table, to
> force reuse of earlier free pages (in my case by generating ctid = ANY()
> style queries with all possible tids for the last few pages, the most
> efficient way I could think of).

Sure, but wouldn't you now write those using something on the order of

      WHERE ctid >= '(cutoff_page_here, 1)'

?  I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n"
because you wouldn't know what value of n to use to get all the
tuples on some-number-of-ending-pages.

            regards, tom lane


Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2018-12-20 18:06:41 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
> >> I'm having a hard time wrapping my mind around why you'd bother with
> >> backwards TID scans.
> 
> > I've not followed this thread, but wouldn't that be quite useful to be
> > able to move old tuples to free space earlier in the table?
> > I've written multiple scripts that update the later pages in a table, to
> > force reuse of earlier free pages (in my case by generating ctid = ANY()
> > style queries with all possible tids for the last few pages, the most
> > efficient way I could think of).
> 
> Sure, but wouldn't you now write those using something on the order of
> 
>       WHERE ctid >= '(cutoff_page_here, 1)'
> 
> ?  I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n"
> because you wouldn't know what value of n to use to get all the
> tuples on some-number-of-ending-pages.

I think you'd want both, to make sure there's not more tuples than
estimated. With the limit calculated to ensure there's enough free space
for them to actually fit.

Greetings,

Andres Freund


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Fri, 21 Dec 2018 at 11:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Edmund Horner <ejrh00@gmail.com> writes:
> > [ tid scan patches ]
>
> I'm having a hard time wrapping my mind around why you'd bother with
> backwards TID scans.  The amount of code needed versus the amount of
> usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> see that there might be value in knowing that a regular scan has
> "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> on TID without an explicit sort).  It does not, however, follow that
> there's any additional value in supporting the DESC case.

I have occasionally found myself running "SELECT MAX(ctid) FROM t"
when I was curious about why a table is so big after vacuuming.

Perhaps that's not a common enough use case to justify the amount of
code, especially the changes to heapam.c and explain.c.

We'd still need the pathkeys to make good use of forward scans.  (And
I think the executor still needs to support seeking backward for
cursors.)


Re: Tid scan improvements

От
David Rowley
Дата:
On Fri, 21 Dec 2018 at 13:09, Edmund Horner <ejrh00@gmail.com> wrote:
> On Fri, 21 Dec 2018 at 11:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm having a hard time wrapping my mind around why you'd bother with
> > backwards TID scans.  The amount of code needed versus the amount of
> > usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> > see that there might be value in knowing that a regular scan has
> > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> > on TID without an explicit sort).  It does not, however, follow that
> > there's any additional value in supporting the DESC case.
>
> I have occasionally found myself running "SELECT MAX(ctid) FROM t"
> when I was curious about why a table is so big after vacuuming.
>
> Perhaps that's not a common enough use case to justify the amount of
> code, especially the changes to heapam.c and explain.c.
>
> We'd still need the pathkeys to make good use of forward scans.  (And
> I think the executor still needs to support seeking backward for
> cursors.)

I think the best thing to do here is separate out all the additional
backwards scan code into a separate patch to allow it to be easier
considered and approved, or rejected. I think if there's any hint of
this blocking the main patch then it should be a separate patch to
allow it's worth to be considered independently.

Also, my primary interest in this patch is to find tuples that are
stopping the heap being truncated during a vacuum. Generally, when I'm
looking for that I have a good idea of what size I expect the relation
should be, (otherwise I'd not think it was bloated), in which case I'd
be doing WHERE ctid >= '(N,1)'. However, it might be easier to write
some auto-bloat-removal script if we could have an ORDER BY ctid DESC
LIMIT n.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Fri, 21 Dec 2018 at 13:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On Fri, 21 Dec 2018 at 13:09, Edmund Horner <ejrh00@gmail.com> wrote:
> > On Fri, 21 Dec 2018 at 11:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > I'm having a hard time wrapping my mind around why you'd bother with
> > > backwards TID scans.  The amount of code needed versus the amount of
> > > usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> > > see that there might be value in knowing that a regular scan has
> > > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> > > on TID without an explicit sort).  It does not, however, follow that
> > > there's any additional value in supporting the DESC case.
> >
> > I have occasionally found myself running "SELECT MAX(ctid) FROM t"
> > when I was curious about why a table is so big after vacuuming.
> >
> > Perhaps that's not a common enough use case to justify the amount of
> > code, especially the changes to heapam.c and explain.c.
> >
> > We'd still need the pathkeys to make good use of forward scans.  (And
> > I think the executor still needs to support seeking backward for
> > cursors.)
>
> I think the best thing to do here is separate out all the additional
> backwards scan code into a separate patch to allow it to be easier
> considered and approved, or rejected. I think if there's any hint of
> this blocking the main patch then it should be a separate patch to
> allow it's worth to be considered independently.

Yeah I think you're right.  I'll separate those parts into the basic
forward scan, and then the optional backward scan support.  I think
we'll still only generate a backward scan if the query_pathkeys makes
use of it.

For the forward scan, I seem to recall, from your merge join example,
that it's useful to set the pathkeys even when there are no
query_pathkeys.  We just have to unconditionally set them so that the
larger plan can make use of them.

> Also, my primary interest in this patch is to find tuples that are
> stopping the heap being truncated during a vacuum. Generally, when I'm
> looking for that I have a good idea of what size I expect the relation
> should be, (otherwise I'd not think it was bloated), in which case I'd
> be doing WHERE ctid >= '(N,1)'. However, it might be easier to write
> some auto-bloat-removal script if we could have an ORDER BY ctid DESC
> LIMIT n.


Re: Tid scan improvements

От
Tom Lane
Дата:
Edmund Horner <ejrh00@gmail.com> writes:
> For the forward scan, I seem to recall, from your merge join example,
> that it's useful to set the pathkeys even when there are no
> query_pathkeys.  We just have to unconditionally set them so that the
> larger plan can make use of them.

No.  Look at indxpath.c: it does not worry about pathkeys unless
has_useful_pathkeys is true, and it definitely does not generate
pathkeys that don't get past truncate_useless_pathkeys.  Those
functions are responsible for worrying about whether mergejoin
can use the pathkeys.  It's not tidpath.c's job to outthink them.

            regards, tom lane


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Fri, 21 Dec 2018 at 16:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Edmund Horner <ejrh00@gmail.com> writes:
> > For the forward scan, I seem to recall, from your merge join example,
> > that it's useful to set the pathkeys even when there are no
> > query_pathkeys.  We just have to unconditionally set them so that the
> > larger plan can make use of them.
>
> No.  Look at indxpath.c: it does not worry about pathkeys unless
> has_useful_pathkeys is true, and it definitely does not generate
> pathkeys that don't get past truncate_useless_pathkeys.  Those
> functions are responsible for worrying about whether mergejoin
> can use the pathkeys.  It's not tidpath.c's job to outthink them.

Ok.  I think that will simplify things.  So if I follow you correctly,
we should do:

1. If has_useful_pathkeys is true: generate pathkeys (for CTID ASC),
and use truncate_useless_pathkeys on them.
2. If we have tid quals or pathkeys, emit a TID scan path.

For the (optional) backwards scan support patch, should we separately
emit another path, in the reverse direction?  (My current patch only
creates one path, and tries to decide what the best direction is by
looking at query_pathkeys.  This doesn't fit into the above
algorithm.)


Re: Tid scan improvements

От
Tom Lane
Дата:
Edmund Horner <ejrh00@gmail.com> writes:
> Ok.  I think that will simplify things.  So if I follow you correctly,
> we should do:

> 1. If has_useful_pathkeys is true: generate pathkeys (for CTID ASC),
> and use truncate_useless_pathkeys on them.
> 2. If we have tid quals or pathkeys, emit a TID scan path.

Check.

> For the (optional) backwards scan support patch, should we separately
> emit another path, in the reverse direction?

What indxpath.c does is, if has_useful_pathkeys is true, to generate
pathkeys both ways and then build paths if the pathkeys get past
truncate_useless_pathkeys.  That seems sufficient in this case too.
There are various heuristics about whether it's really useful to
consider both sort directions, but that intelligence is already
built into truncate_useless_pathkeys.  tid quals with no pathkeys
would be reason to generate a forward path, but not reason to
generate a reverse path, because then that would be duplicative.

            regards, tom lane


Re: Tid scan improvements

От
Tom Lane
Дата:
BTW, with respect to this bit in 0001:

@@ -1795,6 +1847,15 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
                 return (Selectivity) 0; /* keep compiler quiet */
         }
     }
+    else if (vardata.var && IsA(vardata.var, Var) &&
+             ((Var *) vardata.var)->varattno == SelfItemPointerAttributeNumber)
+    {
+        /*
+         * There are no stats for system columns, but we know CTID is never
+         * NULL.
+         */
+        selec = (nulltesttype == IS_NULL) ? 0.0 : 1.0;
+    }
     else
     {
         /*

I'm not entirely sure why you're bothering; surely nulltestsel is
unrelated to what this patch is about?  And would anybody really
write "WHERE ctid IS NULL"?

However, if we do think it's worth adding code to cover this case,
I wouldn't make it specific to CTID.  *All* system columns can be
assumed not null, see heap_getsysattr().

            regards, tom lane


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Sat, 22 Dec 2018 at 07:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, with respect to this bit in 0001:

@@ -1795,6 +1847,15 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
                 return (Selectivity) 0; /* keep compiler quiet */
         }
     }
+    else if (vardata.var && IsA(vardata.var, Var) &&
+             ((Var *) vardata.var)->varattno == SelfItemPointerAttributeNumber)
+    {
+        /*
+         * There are no stats for system columns, but we know CTID is never
+         * NULL.
+         */
+        selec = (nulltesttype == IS_NULL) ? 0.0 : 1.0;
+    }
     else
     {
         /*

I'm not entirely sure why you're bothering; surely nulltestsel is
unrelated to what this patch is about?  And would anybody really
write "WHERE ctid IS NULL"?

I found that it made a difference with selectivity of range comparisons, because clauselist_selectivity tries to correct for it (clausesel.c:274):

    s2 = rqlist->hibound + rqlist->lobound - 1.0

    /* Adjust for double-exclusion of NULLs */
    s2 += nulltestsel(root, IS_NULL, rqlist->var,
                      varRelid, jointype, sjinfo);
 
It was adding DEFAULT_UNK_SEL = 0.005 to the selectivity, which (while not major) did make the selectivity less accurate.

However, if we do think it's worth adding code to cover this case,
I wouldn't make it specific to CTID.  *All* system columns can be
assumed not null, see heap_getsysattr().
I guess we could have a standalone patch to add this for all system columns?


Re: Tid scan improvements

От
Tom Lane
Дата:
Edmund Horner <ejrh00@gmail.com> writes:
> On Sat, 22 Dec 2018 at 07:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not entirely sure why you're bothering; surely nulltestsel is
>> unrelated to what this patch is about?

> I found that it made a difference with selectivity of range comparisons,
> because clauselist_selectivity tries to correct for it (clausesel.c:274):

Oh, I see.

> I guess we could have a standalone patch to add this for all system columns?

+1

            regards, tom lane


Re: Tid scan improvements

От
Edmund Horner
Дата:
Hi all,

I am a bit stuck and I think it's best to try to explain where.

I'm still rebasing the patches for the changes Tom made to support parameterised TID paths for joins.  While the addition of join support itself does not really touch the same code, the modernisation -- in particular, returning a list of RestrictInfos rather than raw quals -- does rewrite quite a bit of tidpath.c.

The original code returned:

    List (with OR semantics)
      CTID = ?   or   CTID = ANY (...)   or   IS CURRENT OF
      (more items)

That changed recently to return:

    List (with OR semantics)
      RestrictInfo
        CTID = ?   or ...
      (more items)

My last set of patches extended the tidqual extraction to pull out lists (with AND semantics) of range quals of the form CTID < ?, etc.  Each list of more than one item was converted into an AND clause before being added to the tidqual list; a single range qual can be added to tidquals as is.

This required looking across multiple RestrictInfos at the top level, for example:

  - "WHERE ctid > ? AND ctid < ?" would arrive at tidpath as a list of two RestrictInfos, from which we extract a single tidqual in the form of an AND clause.
  - "WHERE ctid = ? OR (ctid > ? AND ctid < ?)" arrives as only one RestrictInfo, but we extract two tidquals (an OpExpr, and an AND clause).

The code could also ignore additional unusable quals from a list of top-level RestrictInfos, or from a list of quals from an AND clause, for example:

  - "WHERE foo = ? AND ctid > ? AND ctid < ?" gives us the single tidqual "ctid > ? AND ctid < ?".
  - "WHERE (ctid = ? AND bar = ?) OR (foo = ? AND ctid > ? AND ctid < ?)" gives us the two tidquals "ctid = ?" and "ctid > ? AND ctid < ?".

As the extracted tidquals no longer match the original query quals, they aren't removed from scan_clauses in createplan.c, and hence are correctly checked by the filter.

Aside: The analogous situation with an indexed user attribute "x" behaves a bit differently:
  - "WHERE x = ? OR (x > ? AND x < ?)", won't use a regular index scan, but might use a bitmap index scan.

My patch uses the same path type and executor for all extractable tidquals.

This worked pretty well, but I am finding it difficult to reimplement it in the new tidpath.c code.

In the query information given to the path generator, there is no existing RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I am still learning about RestrictInfos, but my understanding is it doesn't make sense to have a RestrictInfo for an AND clause, anyway; you're supposed to have them for the sub-expressions of it.

And it doesn't seem a good idea to try to create new RestrictInfos in the path generation just to pass the tidquals back to plan creation.  They're complicated objects.

There's also the generation of scan_clauses in create_tidscan_plan (createplan.c:3107).  This now uses RestrictInfos -- I'd image we'd need each AND clause to be wrapped in a RestrictInfo to be able to check it properly.

To summarise, I'm not sure what kind of structure I should add to the tidquals list to represent a compound range expression.  Maybe it's better to create a different path (either a new path type, or a flag in TidPath to say what kind of quals are attached) ?

Edmund

Re: Tid scan improvements

От
Tom Lane
Дата:
Edmund Horner <ejrh00@gmail.com> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node.  I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list.  Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Now admittedly, this wouldn't give us an efficient way to execute
queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
but I find myself quite unable to get excited about supporting that.
I see no reason for the new code to worry about any cases more complex
than one or two TID inequalities at top level of the restriction list.

> In the query information given to the path generator, there is no existing
> RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> am still learning about RestrictInfos, but my understanding is it doesn't
> make sense to have a RestrictInfo for an AND clause, anyway; you're
> supposed to have them for the sub-expressions of it.

FWIW, the actual data structure for cases like that is that there's
a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
and if you look into its "orclause" field, you will find RestrictInfos
attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
old code in tidpath.c didn't know that, because it'd never been rewritten
since RestrictInfos were invented.)  However, I think this new code should
not worry about OR cases at all, but just pull out top-level TID
comparison clauses.

> And it doesn't seem a good idea to try to create new RestrictInfos in the
> path generation just to pass the tidquals back to plan creation.

No, you should avoid that.  There are places that assume there's only
one RestrictInfo for any given original clause (or sub-clause).

            regards, tom lane


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Edmund Horner <ejrh00@gmail.com> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node.  I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list.  Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Thanks for the advice.  This approach resembles my first draft, which had a separate executor type.  However, it did have a combined path type, with an enum TidPathMethod to determine how tidquals was interpreted.  At this point, I think a different path type is clearer, though generation of both types can live in tidpath.c (just as indxpath.c generates different index path types).
 
Now admittedly, this wouldn't give us an efficient way to execute
queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
but I find myself quite unable to get excited about supporting that.
I see no reason for the new code to worry about any cases more complex
than one or two TID inequalities at top level of the restriction list.

I'm a bit sad to see support for multiple ranges go, though I never saw such queries as ever being particularly common.  (And there was always a nagging feeling that tidpath.c was beginning to perform feats of boolean acrobatics out of proportion to its importance.  Perhaps in some distant future, TID quals will become another way of supplying TIDs to a bitmap heap scan, which would enable complicated boolean queries using both indexes and TID scans.  But that's just musing, not a proposal.)

> In the query information given to the path generator, there is no existing
> RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> am still learning about RestrictInfos, but my understanding is it doesn't
> make sense to have a RestrictInfo for an AND clause, anyway; you're
> supposed to have them for the sub-expressions of it.

FWIW, the actual data structure for cases like that is that there's
a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
and if you look into its "orclause" field, you will find RestrictInfos
attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
old code in tidpath.c didn't know that, because it'd never been rewritten
since RestrictInfos were invented.)  However, I think this new code should
not worry about OR cases at all, but just pull out top-level TID
comparison clauses.

Thanks for the explanation.

> And it doesn't seem a good idea to try to create new RestrictInfos in the
> path generation just to pass the tidquals back to plan creation.

No, you should avoid that.  There are places that assume there's only
one RestrictInfo for any given original clause (or sub-clause).
 

Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2019-01-19 17:04:13 +1300, Edmund Horner wrote:
> On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > Edmund Horner <ejrh00@gmail.com> writes:
> > > My patch uses the same path type and executor for all extractable
> > tidquals.
> >
> > > This worked pretty well, but I am finding it difficult to reimplement it
> > in
> > > the new tidpath.c code.
> >
> > I didn't like that approach to begin with, and would suggest that you go
> > over to using a separate path type and executor node.  I don't think the
> > amount of commonality for the two cases is all that large, and doing it
> > as you had it required some ugly ad-hoc conventions about the semantics
> > of the tidquals list.  Where I think this should go is that the tidquals
> > list still has OR semantics in the existing path type, but you use AND
> > semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
> > represented as an implicit-AND list of two simple RestrictInfos.
> >
> 
> Thanks for the advice.  This approach resembles my first draft, which had a
> separate executor type.  However, it did have a combined path type, with an
> enum TidPathMethod to determine how tidquals was interpreted.  At this
> point, I think a different path type is clearer, though generation of both
> types can live in tidpath.c (just as indxpath.c generates different index
> path types).
> 
> 
> > Now admittedly, this wouldn't give us an efficient way to execute
> > queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
> > but I find myself quite unable to get excited about supporting that.
> > I see no reason for the new code to worry about any cases more complex
> > than one or two TID inequalities at top level of the restriction list.
> >
> 
> I'm a bit sad to see support for multiple ranges go, though I never saw
> such queries as ever being particularly common.  (And there was always a
> nagging feeling that tidpath.c was beginning to perform feats of boolean
> acrobatics out of proportion to its importance.  Perhaps in some distant
> future, TID quals will become another way of supplying TIDs to a bitmap
> heap scan, which would enable complicated boolean queries using both
> indexes and TID scans.  But that's just musing, not a proposal.)
> 
> > In the query information given to the path generator, there is no existing
> > > RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> > > am still learning about RestrictInfos, but my understanding is it doesn't
> > > make sense to have a RestrictInfo for an AND clause, anyway; you're
> > > supposed to have them for the sub-expressions of it.
> >
> > FWIW, the actual data structure for cases like that is that there's
> > a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
> > and if you look into its "orclause" field, you will find RestrictInfos
> > attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
> > old code in tidpath.c didn't know that, because it'd never been rewritten
> > since RestrictInfos were invented.)  However, I think this new code should
> > not worry about OR cases at all, but just pull out top-level TID
> > comparison clauses.
> >
> 
> Thanks for the explanation.
> 
> > And it doesn't seem a good idea to try to create new RestrictInfos in the
> > > path generation just to pass the tidquals back to plan creation.
> >
> > No, you should avoid that.  There are places that assume there's only
> > one RestrictInfo for any given original clause (or sub-clause).


The commitfest has ended, and you've not updated the patch to address
the feedback yet. Are you planning to do so soon? Otherwise I think we
ought to mark the patch as returned with feedback?

Greetings,

Andres Freund


Re: Tid scan improvements

От
Edmund Horner
Дата:
Hi, my apologies for the delay.

I've finished rebasing and rewriting it for Tom's changes to tidpath.c and his recommendations for tid range scans, but I then found a bug with cursor interaction.  Specifically, FETCH LAST scans through the whole range, and then proceeds to scan backwards to get the last row.  It worked in both my very first draft, and in the most recent draft before the changes to tidpath, but I haven't got it working yet for the new version.

I'm hoping to get that fixed in the next 24 hours, and I'll then post the new patch.

Edmund


On Sun, 3 Feb 2019 at 23:34, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-01-19 17:04:13 +1300, Edmund Horner wrote:
> On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Edmund Horner <ejrh00@gmail.com> writes:
> > > My patch uses the same path type and executor for all extractable
> > tidquals.
> >
> > > This worked pretty well, but I am finding it difficult to reimplement it
> > in
> > > the new tidpath.c code.
> >
> > I didn't like that approach to begin with, and would suggest that you go
> > over to using a separate path type and executor node.  I don't think the
> > amount of commonality for the two cases is all that large, and doing it
> > as you had it required some ugly ad-hoc conventions about the semantics
> > of the tidquals list.  Where I think this should go is that the tidquals
> > list still has OR semantics in the existing path type, but you use AND
> > semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
> > represented as an implicit-AND list of two simple RestrictInfos.
> >
>
> Thanks for the advice.  This approach resembles my first draft, which had a
> separate executor type.  However, it did have a combined path type, with an
> enum TidPathMethod to determine how tidquals was interpreted.  At this
> point, I think a different path type is clearer, though generation of both
> types can live in tidpath.c (just as indxpath.c generates different index
> path types).
>
>
> > Now admittedly, this wouldn't give us an efficient way to execute
> > queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
> > but I find myself quite unable to get excited about supporting that.
> > I see no reason for the new code to worry about any cases more complex
> > than one or two TID inequalities at top level of the restriction list.
> >
>
> I'm a bit sad to see support for multiple ranges go, though I never saw
> such queries as ever being particularly common.  (And there was always a
> nagging feeling that tidpath.c was beginning to perform feats of boolean
> acrobatics out of proportion to its importance.  Perhaps in some distant
> future, TID quals will become another way of supplying TIDs to a bitmap
> heap scan, which would enable complicated boolean queries using both
> indexes and TID scans.  But that's just musing, not a proposal.)
>
> > In the query information given to the path generator, there is no existing
> > > RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> > > am still learning about RestrictInfos, but my understanding is it doesn't
> > > make sense to have a RestrictInfo for an AND clause, anyway; you're
> > > supposed to have them for the sub-expressions of it.
> >
> > FWIW, the actual data structure for cases like that is that there's
> > a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
> > and if you look into its "orclause" field, you will find RestrictInfos
> > attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
> > old code in tidpath.c didn't know that, because it'd never been rewritten
> > since RestrictInfos were invented.)  However, I think this new code should
> > not worry about OR cases at all, but just pull out top-level TID
> > comparison clauses.
> >
>
> Thanks for the explanation.
>
> > And it doesn't seem a good idea to try to create new RestrictInfos in the
> > > path generation just to pass the tidquals back to plan creation.
> >
> > No, you should avoid that.  There are places that assume there's only
> > one RestrictInfo for any given original clause (or sub-clause).


The commitfest has ended, and you've not updated the patch to address
the feedback yet. Are you planning to do so soon? Otherwise I think we
ought to mark the patch as returned with feedback?

Greetings,

Andres Freund

Re: Tid scan improvements

От
Edmund Horner
Дата:
On Sat, 19 Jan 2019 at 17:04, Edmund Horner <ejrh00@gmail.com> wrote:
On Sat, 19 Jan 2019 at 05:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Edmund Horner <ejrh00@gmail.com> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node.  I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list.  Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Thanks for the advice.  This approach resembles my first draft, which had a separate executor type.  However, it did have a combined path type, with an enum TidPathMethod to determine how tidquals was interpreted.  At this point, I think a different path type is clearer, though generation of both types can live in tidpath.c (just as indxpath.c generates different index path types).

Hi, here's a new set of patches.  This one adds a new path type called TidRangePath and a new execution node called TidRangeScan.  I haven't included any of the patches for adding pathkeys to TidPaths or TidRangePaths.

1. v6-0001-Add-selectivity-estimate-for-CTID-system-variables.patch
2. v6-0002-Support-backward-scans-over-restricted-ranges-in-hea.patch
3. v6-0003-Support-range-quals-in-Tid-Scan.patch
4. v6-0004-TID-selectivity-reduce-the-density-of-the-last-page-.patch

Patches 1, 2, and 4 are basically unchanged from my previous post.  Patch 4 is an optional tweak to the CTID selectivity estimates.

Patch 3 is a substantial rewrite from what I had before.  I've checked David's most recent review and tried to make sure the new code meets his suggestions where applicable, although there is one spot where I left the code as "if (tidrangequals) ..." instead of the preferred "if (tidrangequals != NIL) ...", just for consistency with the surrounding code.

Questions --

1. Tid Range Paths are costed as random_page_cost for the first page, and sequential page cost for the remaining pages.  It made sense when there could be multiple non-overlapping ranges.  Now that there's only one range, it might not, but it has the benefit of making Tid Range Scans a little bit more expensive than Sequential Scans, so that they are less likely to be picked when a Seq Scan will do just as well.  Is there a better cost formula to use?

2. Is it worth trying to get rid of some of the code duplication between the TidPath and TidRangePath handling, such as in costsize.c or createplan.c?

3. TidRangeRecheck (copied from TidRecheck) has an existing comment asking whether it should actually be performing a check on the returned tuple.  It seems to me that as long as TidRangeNext doesn't return a tuple outside the requested range, then the check shouldn't be necessary (and we'd simplify the comment to "nothing to check").  If a range key can change at runtime, it should never have been included in the TidRangePath.  Is my understanding correct?

4. I'm a little uncomfortable with the way heapam.c changes the scan limits ("--scan->rs_numblocks") as it progresses through the pages.  I have the executor node reset the scan limits after scanning all the tuples, which seems to work for the tests I have, but I'm using the heap_setscanlimits feature in a slightly different way from the only existing use, which is for the one-off scans when building a BRIN index.  I have added some tests for cursor fetches which seems to exercise the code, but I'd still appreciate close review of how I'm using heapam.

Edmund

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On Mon, 4 Feb 2019 at 18:37, Edmund Horner <ejrh00@gmail.com> wrote:
> 1. v6-0001-Add-selectivity-estimate-for-CTID-system-variables.patch

I think 0001 is good to go. It's a clear improvement over what we do today.

(t1 = 1 million row table with a single int column.)

Patched:
# explain (analyze, timing off) select * from t1 where ctid < '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=315 width=4) (actual
rows=315 loops=1)

# explain (analyze, timing off) select * from t1 where ctid <= '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=316 width=4) (actual
rows=316 loops=1)

Master:
# explain (analyze, timing off) select * from t1 where ctid < '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=333333 width=4) (actual
rows=315 loops=1)

# explain (analyze, timing off) select * from t1 where ctid <= '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=333333 width=4) (actual
rows=316 loops=1)

The only possible risk I can foresee is that it may be more likely we
underestimate the selectivity and that causes something like a nested
loop join due to the estimation being, say 1 row.

It could happen in a case like:

SELECT * FROM bloated_table WHERE ctid >= <last ctid that would exist
without bloat>

but I don't think we should keep using DEFAULT_INEQ_SEL just in case
this happens. We could probably fix 90% of those cases by returning 2
rows instead of 1.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
David Rowley
Дата:
On Mon, 4 Feb 2019 at 18:37, Edmund Horner <ejrh00@gmail.com> wrote:
> 2. v6-0002-Support-backward-scans-over-restricted-ranges-in-hea.patch
> 3. v6-0003-Support-range-quals-in-Tid-Scan.patch
> 4. v6-0004-TID-selectivity-reduce-the-density-of-the-last-page-.patch

These ones need a rebase.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 14 Mar 2019 at 16:46, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> The only possible risk I can foresee is that it may be more likely we
> underestimate the selectivity and that causes something like a nested
> loop join due to the estimation being, say 1 row.
>
> It could happen in a case like:
>
> SELECT * FROM bloated_table WHERE ctid >= <last ctid that would exist
> without bloat>
>
> but I don't think we should keep using DEFAULT_INEQ_SEL just in case
> this happens. We could probably fix 90% of those cases by returning 2
> rows instead of 1.

Thanks for looking at the patch David.

I'm not sure how an unreasonable underestimation would occur here.  If
you have a table bloated to say 10x its minimal size, the estimator
still assumes an even distribution of tuples (I don't think we can do
much better than that).  So the selectivity of "ctid >= <last ctid
that would exist without bloat>" is still going to be 0.9.

Edmund


Re: Tid scan improvements

От
David Rowley
Дата:
On Thu, 14 Mar 2019 at 21:12, Edmund Horner <ejrh00@gmail.com> wrote:

> I'm not sure how an unreasonable underestimation would occur here.  If
> you have a table bloated to say 10x its minimal size, the estimator
> still assumes an even distribution of tuples (I don't think we can do
> much better than that).  So the selectivity of "ctid >= <last ctid
> that would exist without bloat>" is still going to be 0.9.

Okay, think you're right there.  I guess the only risk there is just
varying tuple density per page, and that seems no greater risk than we
have with the existing stats.

Just looking again, I think the block of code starting:

+ if (density > 0.0)

needs a comment to mention what it's doing. Perhaps:

+ /*
+ * Using the average tuples per page, calculate how far into
+ * the page the itemptr is likely to be and adjust block
+ * accordingly.
+ */
+ if (density > 0.0)

Or some better choice of words.  With that done, I think 0001 is good to go.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 14 Mar 2019 at 23:06, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On Thu, 14 Mar 2019 at 21:12, Edmund Horner <ejrh00@gmail.com> wrote:
> > I'm not sure how an unreasonable underestimation would occur here.  If
> > you have a table bloated to say 10x its minimal size, the estimator
> > still assumes an even distribution of tuples (I don't think we can do
> > much better than that).  So the selectivity of "ctid >= <last ctid
> > that would exist without bloat>" is still going to be 0.9.
>
> Okay, think you're right there.  I guess the only risk there is just
> varying tuple density per page, and that seems no greater risk than we
> have with the existing stats.

Yeah that is a risk, and will probably come up in practice.  But at
least we're not just picking a hardcoded selectivity any more.

> Just looking again, I think the block of code starting:
>
> + if (density > 0.0)
>
> needs a comment to mention what it's doing. Perhaps:
>
> + /*
> + * Using the average tuples per page, calculate how far into
> + * the page the itemptr is likely to be and adjust block
> + * accordingly.
> + */
> + if (density > 0.0)
>
> Or some better choice of words.  With that done, I think 0001 is good to go.

Ok, I'll look at it and hopefully get a new patch up soon.

Edmund


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 14 Mar 2019 at 23:37, Edmund Horner <ejrh00@gmail.com> wrote:
> On Thu, 14 Mar 2019 at 23:06, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > Just looking again, I think the block of code starting:
> >
> > + if (density > 0.0)
> >
> > needs a comment to mention what it's doing. Perhaps:
> >
> > + /*
> > + * Using the average tuples per page, calculate how far into
> > + * the page the itemptr is likely to be and adjust block
> > + * accordingly.
> > + */
> > + if (density > 0.0)
> >
> > Or some better choice of words.  With that done, I think 0001 is good to go.
>
> Ok, I'll look at it and hopefully get a new patch up soon.

Hullo,

Here's a new set of patches.

It includes new versions of the other patches, which needed to be
rebased because of the introduction of the "tableam" API by
c2fe139c20.

I've had to adapt it to use the table scan API.  I've got it compiling
and passing tests, but I'm uneasy about some things that still use the
heapam API.

1. I call heap_setscanlimits as I'm not sure there is a tableam equivalent.
2. I'm not sure whether non-heap tableam implementations can also be
supported by my TID Range Scan: we need to be able to set the scan
limits.  There may not be any other implementations yet, but when
there are, how do we stop the planner using a TID Range Scan for
non-heap relations?
3. When fetching tuples, I see that nodeSeqscan.c uses
table_scan_getnextslot, which saves dealing with HeapTuples.  But
nodeTidrangescan wants to do some checking of the block and offset
before returning the slot.  So I have it using heap_getnext and
ExecStoreBufferHeapTuple.  Apart from being heapam-specific, it's just
not as clean as the new API calls.

Ideally, we can get to to support general tableam implementations
rather than using heapam-specific calls.  Any advice on how to do
this?

Thanks
Edmund

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On Fri, 15 Mar 2019 at 18:42, Edmund Horner <ejrh00@gmail.com> wrote:
> I've had to adapt it to use the table scan API.  I've got it compiling
> and passing tests, but I'm uneasy about some things that still use the
> heapam API.
>
> 1. I call heap_setscanlimits as I'm not sure there is a tableam equivalent.
> 2. I'm not sure whether non-heap tableam implementations can also be
> supported by my TID Range Scan: we need to be able to set the scan
> limits.  There may not be any other implementations yet, but when
> there are, how do we stop the planner using a TID Range Scan for
> non-heap relations?
> 3. When fetching tuples, I see that nodeSeqscan.c uses
> table_scan_getnextslot, which saves dealing with HeapTuples.  But
> nodeTidrangescan wants to do some checking of the block and offset
> before returning the slot.  So I have it using heap_getnext and
> ExecStoreBufferHeapTuple.  Apart from being heapam-specific, it's just
> not as clean as the new API calls.
>
> Ideally, we can get to to support general tableam implementations
> rather than using heapam-specific calls.  Any advice on how to do
> this?

The commit message in 8586bf7ed mentions:

> Subsequent commits will incrementally abstract table access
> functionality to be routed through table access methods. That change
> is too large to be reviewed & committed at once, so it'll be done
> incrementally.

and looking at [1] I see patch 0004 introduces some changes in
nodeTidscan.c to call a new tableam API function named
heapam_fetch_row_version. I see this function does have a ItemPointer
argument, so I guess we must be keeping those as unique row
identifiers in the API.

Patch 0001 does change the signature of heap_setscanlimits() (appears
to be committed already), and then in 0010 the only code that calls
heap_setscanlimits() (IndexBuildHeapRangeScan()) is moved and renamed
to heapam_index_build_range_scan() and set to be called via the
index_build_range_scan TableAmRoutine method.  So it looks like out of
that patch series nothing is there to allow you to access
heap_setscanlimits() directly via the TableAmRoutine API, so perhaps
for this to work heap_setscanlimits will need to be interfaced,
however, I'm unsure if that'll violate any assumptions that Andres
wants to keep out of the API...  Andres?

[1] https://www.postgresql.org/message-id/20190311193746.hhv4e4e62nxtq3k6@alap3.anarazel.de

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Re: Tid scan improvements

От
David Steele
Дата:
On 3/18/19 1:35 PM, David Rowley wrote:
> On Fri, 15 Mar 2019 at 18:42, Edmund Horner <ejrh00@gmail.com> wrote:
> 
>> Subsequent commits will incrementally abstract table access
>> functionality to be routed through table access methods. That change
>> is too large to be reviewed & committed at once, so it'll be done
>> incrementally.
> 
> and looking at [1] I see patch 0004 introduces some changes in
> nodeTidscan.c to call a new tableam API function named
> heapam_fetch_row_version. I see this function does have a ItemPointer
> argument, so I guess we must be keeping those as unique row
> identifiers in the API.
> 
> Patch 0001 does change the signature of heap_setscanlimits() (appears
> to be committed already), and then in 0010 the only code that calls
> heap_setscanlimits() (IndexBuildHeapRangeScan()) is moved and renamed
> to heapam_index_build_range_scan() and set to be called via the
> index_build_range_scan TableAmRoutine method.  So it looks like out of
> that patch series nothing is there to allow you to access
> heap_setscanlimits() directly via the TableAmRoutine API, so perhaps
> for this to work heap_setscanlimits will need to be interfaced,
> however, I'm unsure if that'll violate any assumptions that Andres
> wants to keep out of the API...  Andres?

Thoughts on this, Andres?

Regards,
-- 
-David
david@pgmasters.net


Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2019-03-15 18:42:40 +1300, Edmund Horner wrote:
> I've had to adapt it to use the table scan API.  I've got it compiling
> and passing tests, but I'm uneasy about some things that still use the
> heapam API.
> 
> 1. I call heap_setscanlimits as I'm not sure there is a tableam
> equivalent.

There used to be, but it wasn't clear that it was useful. In core pg the
only caller are index range scans, and those are - in a later patch in
the series - moved into the AM as well, as they need to deal with things
like HOT.


> 2. I'm not sure whether non-heap tableam implementations can also be
> supported by my TID Range Scan: we need to be able to set the scan
> limits.  There may not be any other implementations yet, but when
> there are, how do we stop the planner using a TID Range Scan for
> non-heap relations?

I've not yet looked through your code, but if required we'd probably
need to add a new tableam callback. It'd be marked optional, and the
planner could just check for its presence. A later part of the pluggable
storage series does that for bitmap scans, perhaps it's worth looking at
that?


> 3. When fetching tuples, I see that nodeSeqscan.c uses
> table_scan_getnextslot, which saves dealing with HeapTuples.  But
> nodeTidrangescan wants to do some checking of the block and offset
> before returning the slot.  So I have it using heap_getnext and
> ExecStoreBufferHeapTuple.  Apart from being heapam-specific, it's just
> not as clean as the new API calls.

Yea, that's not ok.  Note that, since yesterday, nodeTidscan doesn't
call heap_fetch() anymore (there's still a heap dependency, but that's
just for heap_get_latest_tid(), which I'll move into execMain or such).


> Ideally, we can get to to support general tableam implementations
> rather than using heapam-specific calls.  Any advice on how to do
> this?

Not yet - could you perhaps look at the bitmap scan patch in the tableam
queue, and see if that gives you inspiration?

- Andres


Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2019-03-18 22:35:05 +1300, David Rowley wrote:
> The commit message in 8586bf7ed mentions:
> 
> > Subsequent commits will incrementally abstract table access
> > functionality to be routed through table access methods. That change
> > is too large to be reviewed & committed at once, so it'll be done
> > incrementally.
> 
> and looking at [1] I see patch 0004 introduces some changes in
> nodeTidscan.c to call a new tableam API function named
> heapam_fetch_row_version. I see this function does have a ItemPointer
> argument, so I guess we must be keeping those as unique row
> identifiers in the API.

Right, we are. At least for now - there's some discussions around
allowing different format for TIDs, to allow things like index organized
tables, but that's for later.


> Patch 0001 does change the signature of heap_setscanlimits() (appears
> to be committed already), and then in 0010 the only code that calls
> heap_setscanlimits() (IndexBuildHeapRangeScan()) is moved and renamed
> to heapam_index_build_range_scan() and set to be called via the
> index_build_range_scan TableAmRoutine method.  So it looks like out of
> that patch series nothing is there to allow you to access
> heap_setscanlimits() directly via the TableAmRoutine API, so perhaps
> for this to work heap_setscanlimits will need to be interfaced,
> however, I'm unsure if that'll violate any assumptions that Andres
> wants to keep out of the API...

I was kinda hoping to keep block numbers out of the "main" APIs, to
avoid assuming everything is BLCKSZ based. I don't have a particular
problem allowing an optional setscanlimits type callback that works with
block numbers. The planner could check its presence and just not build
tid range scans if not present.  Alternatively a bespoke scan API for
tid range scans, like the later patches in the tableam series for
bitmap, sample, analyze scans, might be an option.

Greetings,

Andres Freund


Re: Tid scan improvements

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I was kinda hoping to keep block numbers out of the "main" APIs, to
> avoid assuming everything is BLCKSZ based. I don't have a particular
> problem allowing an optional setscanlimits type callback that works with
> block numbers. The planner could check its presence and just not build
> tid range scans if not present.  Alternatively a bespoke scan API for
> tid range scans, like the later patches in the tableam series for
> bitmap, sample, analyze scans, might be an option.

Given Andres' API concerns, and the short amount of time remaining in
this CF, I'm not sure how much of this patch set we can expect to land
in v12.  It seems like it might be a good idea to scale back our ambitions
and see whether there's a useful subset we can push in easily.

With that in mind, I went ahead and pushed 0001+0004, since improving
the planner's selectivity estimate for a "ctid vs constant" qual is
likely to be helpful whether the executor is smart about it or not.

FWIW, I don't really see the point of treating 0002 as a separate patch.
If it had some utility on its own, then it'd be sensible, but what
would that be?  Also, it looks from 0002 like you are trying to overload
rs_startblock with a different meaning than it has for syncscans, and
I think that might be a bad idea. 

            regards, tom lane


Re: Tid scan improvements

От
Edmund Horner
Дата:
On Tue, 26 Mar 2019 at 11:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > I was kinda hoping to keep block numbers out of the "main" APIs, to
> > avoid assuming everything is BLCKSZ based. I don't have a particular
> > problem allowing an optional setscanlimits type callback that works with
> > block numbers. The planner could check its presence and just not build
> > tid range scans if not present.  Alternatively a bespoke scan API for
> > tid range scans, like the later patches in the tableam series for
> > bitmap, sample, analyze scans, might be an option.
>
> Given Andres' API concerns, and the short amount of time remaining in
> this CF, I'm not sure how much of this patch set we can expect to land
> in v12.  It seems like it might be a good idea to scale back our ambitions
> and see whether there's a useful subset we can push in easily.

I agree.  It'll take some time to digest Andres' advice and write a
better patch.

Should I set update CF app to a) set the target version to 13, and/or
move it to next commitfest?

> With that in mind, I went ahead and pushed 0001+0004, since improving
> the planner's selectivity estimate for a "ctid vs constant" qual is
> likely to be helpful whether the executor is smart about it or not.

Cool.

> FWIW, I don't really see the point of treating 0002 as a separate patch.
> If it had some utility on its own, then it'd be sensible, but what
> would that be?  Also, it looks from 0002 like you are trying to overload
> rs_startblock with a different meaning than it has for syncscans, and
> I think that might be a bad idea.

Yeah I don't think either patch is useful without the other.  They
were separate because, initially, only some of the TidRangeScan
functionality depended on it, and I was particularly uncomfortable
with what I was doing to heapam.c.

The changes in heapam.c were required for backward scan support, as
used by ORDER BY ctid DESC and MAX(ctid); and also for FETCH LAST and
FETCH PRIOR.  I have removed the backward scans functionality from the
current set of patches, but support for backward cursor fetches
remains.

I guess to brutally simplify the patch further, we could give up
backward cursor fetches entirely?  This means such cursors that end up
using a TidRangeScan will require SCROLL to go backwards (which is a
small pain for user experience), but TBH I don't think backwards-going
cursors on CTID will be hugely common.

I'm still not familiar enough with heapam.c to have any better ideas
on how to support backward scanning a limited range.


Re: Tid scan improvements

От
David Steele
Дата:
On 3/26/19 8:11 AM, Edmund Horner wrote:
> On Tue, 26 Mar 2019 at 11:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Andres Freund <andres@anarazel.de> writes:
>>> I was kinda hoping to keep block numbers out of the "main" APIs, to
>>> avoid assuming everything is BLCKSZ based. I don't have a particular
>>> problem allowing an optional setscanlimits type callback that works with
>>> block numbers. The planner could check its presence and just not build
>>> tid range scans if not present.  Alternatively a bespoke scan API for
>>> tid range scans, like the later patches in the tableam series for
>>> bitmap, sample, analyze scans, might be an option.
>>
>> Given Andres' API concerns, and the short amount of time remaining in
>> this CF, I'm not sure how much of this patch set we can expect to land
>> in v12.  It seems like it might be a good idea to scale back our ambitions
>> and see whether there's a useful subset we can push in easily.
> 
> I agree.  It'll take some time to digest Andres' advice and write a
> better patch.
> 
> Should I set update CF app to a) set the target version to 13, and/or
> move it to next commitfest?

If you plan to continue working on it in this CF then you can just 
change the target to PG13.  If you plan to take a break and pick up the 
work later then go ahead and push it to the next CF.

Regards,
-- 
-David
david@pgmasters.net


Re: Tid scan improvements

От
Thomas Munro
Дата:
On Tue, Mar 26, 2019 at 7:25 PM David Steele <david@pgmasters.net> wrote:
> On 3/26/19 8:11 AM, Edmund Horner wrote:
> > Should I set update CF app to a) set the target version to 13, and/or
> > move it to next commitfest?
>
> If you plan to continue working on it in this CF then you can just
> change the target to PG13.  If you plan to take a break and pick up the
> work later then go ahead and push it to the next CF.

Hi Edmund,

The new CF is here.  I'm going through poking threads for submissions
that don't apply, but it sounds like this needs more than a rebase?
Perhaps this belongs in the next CF?

--
Thomas Munro
https://enterprisedb.com



Re: Tid scan improvements

От
David Rowley
Дата:
On Mon, 1 Jul 2019 at 23:29, Thomas Munro <thomas.munro@gmail.com> wrote:
> The new CF is here.  I'm going through poking threads for submissions
> that don't apply, but it sounds like this needs more than a rebase?
> Perhaps this belongs in the next CF?

0001 and 0004 of v7 got pushed in PG12. The CFbot will be trying to
apply 0001 still, but on testing 0002, no joy there either.

It would be good to see this back in PG13. For now, I'll mark it as
waiting on author.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 4 Jul 2019 at 15:43, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On Mon, 1 Jul 2019 at 23:29, Thomas Munro <thomas.munro@gmail.com> wrote:
> > The new CF is here.  I'm going through poking threads for submissions
> > that don't apply, but it sounds like this needs more than a rebase?
> > Perhaps this belongs in the next CF?
>
> 0001 and 0004 of v7 got pushed in PG12. The CFbot will be trying to
> apply 0001 still, but on testing 0002, no joy there either.
>
> It would be good to see this back in PG13. For now, I'll mark it as
> waiting on author.

Hi,

I'm not really sure how to proceed.  I started with a fairly pragmatic
solution to "WHERE ctid > ? AND ctid < ?" for tables, and then tableam
came along.

The options I see are:

A.  Continue to target only heapam tables, making the bare minimum
changes necessary for the new tableam api.
B.  Try to do something more general that works on all tableam
implementations for which it may be useful.

There may not be much different between them, but B. means a bit more
research into zheap, zstore and other possible tableams.

Next question, how will the executor access the table?

1. Continue to use the seqscan tableam methods, by setting limits.
2. Use the bitmap scan methods, for instance by faking a BitmapIteratorResuit.
3. Add new tableam methods specially for scanning a range of TIDs.

Any thoughts?



Re: Tid scan improvements

От
David Rowley
Дата:
On Sun, 7 Jul 2019 at 15:32, Edmund Horner <ejrh00@gmail.com> wrote:
> I'm not really sure how to proceed.  I started with a fairly pragmatic
> solution to "WHERE ctid > ? AND ctid < ?" for tables, and then tableam
> came along.
>
> The options I see are:
>
> A.  Continue to target only heapam tables, making the bare minimum
> changes necessary for the new tableam api.
> B.  Try to do something more general that works on all tableam
> implementations for which it may be useful.

Going by the conversation with Andres above:

On Tue, 26 Mar 2019 at 10:52, Andres Freund <andres@anarazel.de> wrote:
>
> On 2019-03-18 22:35:05 +1300, David Rowley wrote:
> > The commit message in 8586bf7ed mentions:
> >
> > > Subsequent commits will incrementally abstract table access
> > > functionality to be routed through table access methods. That change
> > > is too large to be reviewed & committed at once, so it'll be done
> > > incrementally.
> >
> > and looking at [1] I see patch 0004 introduces some changes in
> > nodeTidscan.c to call a new tableam API function named
> > heapam_fetch_row_version. I see this function does have a ItemPointer
> > argument, so I guess we must be keeping those as unique row
> > identifiers in the API.
>
> Right, we are. At least for now - there's some discussions around
> allowing different format for TIDs, to allow things like index organized
> tables, but that's for later.

So it seems that the plan is to insist that TIDs are tuple identifiers
for all table AMs, for now.  If that changes in the future, then so be
it, but I don't think that's cause for delaying any work on TID Range
Scans.  Also from scanning around tableam.h, I see that there's no
shortage of usages of BlockNumber, so it seems reasonable to assume
table AMs must use blocks... It's hard to imagine moving away from
that given that we have shared buffers.

We do appear to have some table AM methods that are optional, although
I'm not sure where the documentation is about that. For example, in
get_relation_info() I see:

info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
relation->rd_tableam->scan_bitmap_next_block != NULL;

so it appears that at least scan_bitmap_next_block is optional.

I think what I'd do would be to add a table_setscanlimits API method
for table AM and perhaps have the planner only add TID range scan
paths if the relation has a non-NULL function pointer for that API
function.  It would be good to stick a comment at least in tableam.h
that mentions that the callback is optional.  That might help a bit
when it comes to writing documentation on each API function and what
they do.

> There may not be much different between them, but B. means a bit more
> research into zheap, zstore and other possible tableams.
>
> Next question, how will the executor access the table?
>
> 1. Continue to use the seqscan tableam methods, by setting limits.
> 2. Use the bitmap scan methods, for instance by faking a BitmapIteratorResuit.
> 3. Add new tableam methods specially for scanning a range of TIDs.
>
> Any thoughts?

I think #1 is fine for now. #3 might be slightly more efficient since
you'd not need to read each tuple in the given page before the given
offset and throw it away, but I don't really think it's worth adding a
new table AM method for.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 11 Jul 2019 at 10:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > A.  Continue to target only heapam tables, making the bare minimum
> > changes necessary for the new tableam api.
> > B.  Try to do something more general that works on all tableam
> > implementations for which it may be useful.
>
> Going by the conversation with Andres above:
>
> On Tue, 26 Mar 2019 at 10:52, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2019-03-18 22:35:05 +1300, David Rowley wrote:
> > > The commit message in 8586bf7ed mentions:
> > >
> > > > Subsequent commits will incrementally abstract table access
> > > > functionality to be routed through table access methods. That change
> > > > is too large to be reviewed & committed at once, so it'll be done
> > > > incrementally.
> > >
> > > and looking at [1] I see patch 0004 introduces some changes in
> > > nodeTidscan.c to call a new tableam API function named
> > > heapam_fetch_row_version. I see this function does have a ItemPointer
> > > argument, so I guess we must be keeping those as unique row
> > > identifiers in the API.
> >
> > Right, we are. At least for now - there's some discussions around
> > allowing different format for TIDs, to allow things like index organized
> > tables, but that's for later.
>
> So it seems that the plan is to insist that TIDs are tuple identifiers
> for all table AMs, for now.  If that changes in the future, then so be
> it, but I don't think that's cause for delaying any work on TID Range
> Scans.  Also from scanning around tableam.h, I see that there's no
> shortage of usages of BlockNumber, so it seems reasonable to assume
> table AMs must use blocks... It's hard to imagine moving away from
> that given that we have shared buffers.
>
> We do appear to have some table AM methods that are optional, although
> I'm not sure where the documentation is about that. For example, in
> get_relation_info() I see:
>
> info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
> relation->rd_tableam->scan_bitmap_next_block != NULL;
>
> so it appears that at least scan_bitmap_next_block is optional.

> I think what I'd do would be to add a table_setscanlimits API method
> for table AM and perhaps have the planner only add TID range scan
> paths if the relation has a non-NULL function pointer for that API
> function.  It would be good to stick a comment at least in tableam.h
> that mentions that the callback is optional.  That might help a bit
> when it comes to writing documentation on each API function and what
> they do.

This seems like a good path forward.

> > There may not be much different between them, but B. means a bit more
> > research into zheap, zstore and other possible tableams.
> >
> > Next question, how will the executor access the table?
> >
> > 1. Continue to use the seqscan tableam methods, by setting limits.
> > 2. Use the bitmap scan methods, for instance by faking a BitmapIteratorResuit.
> > 3. Add new tableam methods specially for scanning a range of TIDs.
> >
> > Any thoughts?
>
> I think #1 is fine for now. #3 might be slightly more efficient since
> you'd not need to read each tuple in the given page before the given
> offset and throw it away, but I don't really think it's worth adding a
> new table AM method for.

Yeah, it's not perfectly efficient in that regard.  But it's at least
a step in the right direction.

Thanks for the guidance David.  I think I'll be able to make some
progress now before hitting the next obstacle!

Edmund



Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 11 Jul 2019 at 10:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
> So it seems that the plan is to insist that TIDs are tuple identifiers
> for all table AMs, for now.  If that changes in the future, then so be
> it, but I don't think that's cause for delaying any work on TID Range
> Scans.  Also from scanning around tableam.h, I see that there's no
> shortage of usages of BlockNumber, so it seems reasonable to assume
> table AMs must use blocks... It's hard to imagine moving away from
> that given that we have shared buffers.
>
> We do appear to have some table AM methods that are optional, although
> I'm not sure where the documentation is about that. For example, in
> get_relation_info() I see:
>
> info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
> relation->rd_tableam->scan_bitmap_next_block != NULL;
>
> so it appears that at least scan_bitmap_next_block is optional.
>
> I think what I'd do would be to add a table_setscanlimits API method
> for table AM and perhaps have the planner only add TID range scan
> paths if the relation has a non-NULL function pointer for that API
> function.  It would be good to stick a comment at least in tableam.h
> that mentions that the callback is optional.  That might help a bit
> when it comes to writing documentation on each API function and what
> they do.

Hi.  Here's a new patch.

Summary of changes compared to last time:
  - I've added the additional "scan_setlimits" table AM method.  To
check whether it's implemented in the planner, I have added an
additional "has_scan_setlimits" flag to RelOptInfo.  It seems to work.
  - I've also changed nodeTidrangescan to not require anything from heapam now.
  - To simply the patch and avoid changing heapam, I've removed the
backward scan support (which was needed for FETCH LAST/PRIOR) and made
ExecSupportsBackwardScan return false for this plan type.
  - I've removed the vestigial passing of "direction" through
nodeTidrangescan.  If my understanding is correct, the direction
passed to TidRangeNext will always be forward.  But I did leave FETCH
LAST/PRIOR in the regression tests (after adding SCROLL to the
cursor).

The patch now only targets the simple use case of restricting the
range of a table to scan.  I think it would be nice to eventually
support the other major use cases of ORDER BY ctid ASC/DESC and
MIN/MAX(ctid), but that can be another feature...

Edmund

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On Mon, 15 Jul 2019 at 17:54, Edmund Horner <ejrh00@gmail.com> wrote:
> Summary of changes compared to last time:
>   - I've added the additional "scan_setlimits" table AM method.  To
> check whether it's implemented in the planner, I have added an
> additional "has_scan_setlimits" flag to RelOptInfo.  It seems to work.
>   - I've also changed nodeTidrangescan to not require anything from heapam now.
>   - To simply the patch and avoid changing heapam, I've removed the
> backward scan support (which was needed for FETCH LAST/PRIOR) and made
> ExecSupportsBackwardScan return false for this plan type.
>   - I've removed the vestigial passing of "direction" through
> nodeTidrangescan.  If my understanding is correct, the direction
> passed to TidRangeNext will always be forward.  But I did leave FETCH
> LAST/PRIOR in the regression tests (after adding SCROLL to the
> cursor).

I spent some time today hacking at this.  I fixed a bug in how
has_scan_setlimits set, rewrite a few comments and simplified some of
the code.

When I mentioned up-thread about the optional scan_setlimits table AM
callback, I'd forgotten that you'd not have access to check that
directly during planning. As you mention above, you've added
RelOptInfo has_scan_setlimits so the planner knows if it can use TID
Range scans or not. It would be nice to not have to add this flag, but
that would require either:

1. Making scan_setlimits a non-optional callback function in table AM, or;
2. Allowing the planner to have access to the opened Relation.

#2 is not for this patch, but there has been some talk about it. It
was done for the executor last year in d73f4c74dd3.

I wonder if Andres has any thoughts on #1?

The other thing I was thinking about was if enable_tidscan should be
in charge of TID Range scans too. I see you have it that way, but
should we be adding enable_tidrangescan?  The docs claim that
enable_tidscan: "Enables or disables the query planner's use of TID
scan plan types.". Note: "types" is  plural.  Maybe we could call that
fate and keep it the way the patch has it already.  Does anyone have
another idea about that?

I've attached a delta of the changes I made and also a complete v9 patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Tid scan improvements

От
Edmund Horner
Дата:
Thanks for the edits and fixing that pretty glaring copy-paste bug.

Regarding enable_tidscan, I couldn't decide whether we really need it,
and erred on the side of not adding yet another setting.

The current patch only creates a tid range path if there's at least
one ctid qual.  But during development of earlier patches I was a bit
concerned about the possibility of tid range scan being picked instead
of seq scan when the whole table is scanned, perhaps due to a tiny
discrepency in costing.  Both scans will scan the whole table, but seq
scan is preferred since it can be parallellised, synchronised with
other scans, and has a bit less overhead with tuple checking.  If a
future change creates tid range paths for more queries, for instance
for MIN/MAX(ctid) or ORDER BY ctid, then it might be more important to
have a separate setting for it.

On Wed, 17 Jul 2019 at 23:11, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On Mon, 15 Jul 2019 at 17:54, Edmund Horner <ejrh00@gmail.com> wrote:
> > Summary of changes compared to last time:
> >   - I've added the additional "scan_setlimits" table AM method.  To
> > check whether it's implemented in the planner, I have added an
> > additional "has_scan_setlimits" flag to RelOptInfo.  It seems to work.
> >   - I've also changed nodeTidrangescan to not require anything from heapam now.
> >   - To simply the patch and avoid changing heapam, I've removed the
> > backward scan support (which was needed for FETCH LAST/PRIOR) and made
> > ExecSupportsBackwardScan return false for this plan type.
> >   - I've removed the vestigial passing of "direction" through
> > nodeTidrangescan.  If my understanding is correct, the direction
> > passed to TidRangeNext will always be forward.  But I did leave FETCH
> > LAST/PRIOR in the regression tests (after adding SCROLL to the
> > cursor).
>
> I spent some time today hacking at this.  I fixed a bug in how
> has_scan_setlimits set, rewrite a few comments and simplified some of
> the code.
>
> When I mentioned up-thread about the optional scan_setlimits table AM
> callback, I'd forgotten that you'd not have access to check that
> directly during planning. As you mention above, you've added
> RelOptInfo has_scan_setlimits so the planner knows if it can use TID
> Range scans or not. It would be nice to not have to add this flag, but
> that would require either:
>
> 1. Making scan_setlimits a non-optional callback function in table AM, or;
> 2. Allowing the planner to have access to the opened Relation.
>
> #2 is not for this patch, but there has been some talk about it. It
> was done for the executor last year in d73f4c74dd3.
>
> I wonder if Andres has any thoughts on #1?
>
> The other thing I was thinking about was if enable_tidscan should be
> in charge of TID Range scans too. I see you have it that way, but
> should we be adding enable_tidrangescan?  The docs claim that
> enable_tidscan: "Enables or disables the query planner's use of TID
> scan plan types.". Note: "types" is  plural.  Maybe we could call that
> fate and keep it the way the patch has it already.  Does anyone have
> another idea about that?
>
> I've attached a delta of the changes I made and also a complete v9 patch.
>
> --
>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2019-07-17 23:10:52 +1200, David Rowley wrote:
> When I mentioned up-thread about the optional scan_setlimits table AM
> callback, I'd forgotten that you'd not have access to check that
> directly during planning. As you mention above, you've added
> RelOptInfo has_scan_setlimits so the planner knows if it can use TID
> Range scans or not. It would be nice to not have to add this flag, but
> that would require either:

Is it really a problem to add that flag?  We've obviously so far not
care about space in RelOptInfo, otherwise it'd have union members for
the per reloptinfo contents...


> 1. Making scan_setlimits a non-optional callback function in table AM, or;
> 2. Allowing the planner to have access to the opened Relation.

> #2 is not for this patch, but there has been some talk about it. It
> was done for the executor last year in d73f4c74dd3.
> 
> I wonder if Andres has any thoughts on #1?

I'm inclined to think that 1) isn't a good idea. I'd very much like to
avoid adding further dependencies on BlockNumber in non-optional APIs
(we ought to go the other way). Most of the current ones are at least
semi-reasonably implementable for most AMs (e.g. converting to PG
pagesize for relation_estimate_size isn't a problem), but it doesn't
seem to make sense to implement this for scan limits: Many AMs won't use
the BlockNumber/Offset split as heap does.

I think the AM part of this patch might be the wrong approach - it won't
do anything meaningful for an AM that doesn't directly map the ctid to a
specific location in a block (e.g. zedstore).  To me it seems the
callback ought to be to get a range of tids, and the tidrange scan
shouldn't do anything but determine the range of tids the AM should
return.

- Andres



Re: Tid scan improvements

От
David Rowley
Дата:
On Thu, 18 Jul 2019 at 14:30, Andres Freund <andres@anarazel.de> wrote:
> I think the AM part of this patch might be the wrong approach - it won't
> do anything meaningful for an AM that doesn't directly map the ctid to a
> specific location in a block (e.g. zedstore).  To me it seems the
> callback ought to be to get a range of tids, and the tidrange scan
> shouldn't do anything but determine the range of tids the AM should
> return.

Sounds like that's going to require adding some new fields to
HeapScanDescData, then some callback similar to heap_setscanlimits to
set those fields.

Then, we'd either need to:

1. Make the table_scan_getnextslot() implementations check the tuple
falls within the range, or
2. add another callback that pays attention to the set TID range.

The problem with #1 is that would add overhead to normal seqscans,
which seems like a bad idea.

Did you imagined two additional callbacks, 1 to set the TID range,
then one to scan it?  Duplicating the logic in heapgettup_pagemode()
and heapgettup() looks pretty horrible, but I guess we could add a
wrapper around it that loops until it gets the first tuple and bails
once it scans beyond the final tuple.

Is that what you had in mind?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2019-07-19 13:54:59 +1200, David Rowley wrote:
> On Thu, 18 Jul 2019 at 14:30, Andres Freund <andres@anarazel.de> wrote:
> > I think the AM part of this patch might be the wrong approach - it won't
> > do anything meaningful for an AM that doesn't directly map the ctid to a
> > specific location in a block (e.g. zedstore).  To me it seems the
> > callback ought to be to get a range of tids, and the tidrange scan
> > shouldn't do anything but determine the range of tids the AM should
> > return.
> 
> Sounds like that's going to require adding some new fields to
> HeapScanDescData, then some callback similar to heap_setscanlimits to
> set those fields.
> 
> Then, we'd either need to:
> 
> 1. Make the table_scan_getnextslot() implementations check the tuple
> falls within the range, or
> 2. add another callback that pays attention to the set TID range.

> The problem with #1 is that would add overhead to normal seqscans,
> which seems like a bad idea.
> 
> Did you imagined two additional callbacks, 1 to set the TID range,
> then one to scan it?  Duplicating the logic in heapgettup_pagemode()
> and heapgettup() looks pretty horrible, but I guess we could add a
> wrapper around it that loops until it gets the first tuple and bails
> once it scans beyond the final tuple.
> 
> Is that what you had in mind?

Yea, I was thinking of something like 2.  We already have a few extra
types of scan nodes (bitmap heap and sample scan), it'd not be bad to
add another. And as you say, they can just share most of the guts: For
heap I'd just implement pagemode, and perhaps split heapgettup_pagemode
into two parts (one to do the page processing, the other to determine
the relevant page).

You say that we'd need new fields in HeapScanDescData - not so sure
about that, it seems feasible to just provide the boundaries in the
call? But I think it'd also just be fine to have the additional fields.

Greetings,

Andres Freund



Re: Tid scan improvements

От
David Rowley
Дата:
On Sat, 20 Jul 2019 at 06:21, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-07-19 13:54:59 +1200, David Rowley wrote:
> > Did you imagined two additional callbacks, 1 to set the TID range,
> > then one to scan it?  Duplicating the logic in heapgettup_pagemode()
> > and heapgettup() looks pretty horrible, but I guess we could add a
> > wrapper around it that loops until it gets the first tuple and bails
> > once it scans beyond the final tuple.
> >
> > Is that what you had in mind?
>
> Yea, I was thinking of something like 2.  We already have a few extra
> types of scan nodes (bitmap heap and sample scan), it'd not be bad to
> add another. And as you say, they can just share most of the guts: For
> heap I'd just implement pagemode, and perhaps split heapgettup_pagemode
> into two parts (one to do the page processing, the other to determine
> the relevant page).
>
> You say that we'd need new fields in HeapScanDescData - not so sure
> about that, it seems feasible to just provide the boundaries in the
> call? But I think it'd also just be fine to have the additional fields.

Thanks for explaining.

I've set the CF entry for the patch back to waiting on author.

I think if we get this part the way Andres would like it, then we're
pretty close.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Tid scan improvements

От
Edmund Horner
Дата:
On Mon, 22 Jul 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com>
> On Sat, 20 Jul 2019 at 06:21, Andres Freund <andres@anarazel.de> wrote:
> > Yea, I was thinking of something like 2.  We already have a few extra
> > types of scan nodes (bitmap heap and sample scan), it'd not be bad to
> > add another. And as you say, they can just share most of the guts: For
> > heap I'd just implement pagemode, and perhaps split heapgettup_pagemode
> > into two parts (one to do the page processing, the other to determine
> > the relevant page).
> >
> > You say that we'd need new fields in HeapScanDescData - not so sure
> > about that, it seems feasible to just provide the boundaries in the
> > call? But I think it'd also just be fine to have the additional fields.
>
> Thanks for explaining.
>
> I've set the CF entry for the patch back to waiting on author.
>
> I think if we get this part the way Andres would like it, then we're
> pretty close.

I've moved the code in question into heapam, with:

  * a new scan type SO_TYPE_TIDRANGE (renumbering some of the other
    enums).

  * a wrapper table_beginscan_tidrange(Relation rel, Snapshot snapshot);
    I'm not sure whether we need scankeys and the other flags?

  * a new optional callback scan_settidrange(TableScanDesc scan,
    ItemPointer startTid, ItemPointer endTid) with wrapper
table_scan_settidrange.
    I thought about combining it with table_beginscan_tidrange but we're not
    really doing that with any of the other beginscan methods.

  * another optional callback scan_getnexttidrangeslot.  The presence of
    these two callbacks indicates that TidRangeScan is supported for
this relation.

Let me know if you can think of better names.

However, the heap_getnexttidrangeslot function is just the same
iterative code calling heap_getnextslot and checking the tuples
against the tid range (two fields added to the ScanDesc).

I'll have to spend a bit of time looking at heapgettup_pagemode to
figure how to split it as Andres suggests.

Thanks,
Edmund



Re: Tid scan improvements

От
Edmund Horner
Дата:
On Mon, 22 Jul 2019 at 19:44, Edmund Horner <ejrh00@gmail.com> wrote:
> On Mon, 22 Jul 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com>
> > On Sat, 20 Jul 2019 at 06:21, Andres Freund <andres@anarazel.de> wrote:
> > > Yea, I was thinking of something like 2.  We already have a few extra
> > > types of scan nodes (bitmap heap and sample scan), it'd not be bad to
> > > add another. And as you say, they can just share most of the guts: For
> > > heap I'd just implement pagemode, and perhaps split heapgettup_pagemode
> > > into two parts (one to do the page processing, the other to determine
> > > the relevant page).
> > >
> > > You say that we'd need new fields in HeapScanDescData - not so sure
> > > about that, it seems feasible to just provide the boundaries in the
> > > call? But I think it'd also just be fine to have the additional fields.
> >
> > Thanks for explaining.
> >
> > I've set the CF entry for the patch back to waiting on author.
> >
> > I think if we get this part the way Andres would like it, then we're
> > pretty close.

> [...]

> I'll have to spend a bit of time looking at heapgettup_pagemode to
> figure how to split it as Andres suggests.

Hi everyone,

Sadly it is the end of the CF and I have not had much time to work on
this.  I'll probably get some time in the next month to look at the
heapam changes.

Should we move to CF?  We have been in the CF cycle for almost a year now...

Edmund



Re: Tid scan improvements

От
Thomas Munro
Дата:
On Thu, Aug 1, 2019 at 5:34 PM Edmund Horner <ejrh00@gmail.com> wrote:
> Sadly it is the end of the CF and I have not had much time to work on
> this.  I'll probably get some time in the next month to look at the
> heapam changes.
>
> Should we move to CF?  We have been in the CF cycle for almost a year now...

Hi Edmund,

No worries, that's how it goes sometimes.  Please move it to the next
CF if you think you'll find some time before September.  Don't worry
if it might have to be moved again.  We want the feature, and good
things take time!

-- 
Thomas Munro
https://enterprisedb.com



Re: Tid scan improvements

От
Edmund Horner
Дата:
On Thu, 1 Aug 2019 at 17:47, Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Aug 1, 2019 at 5:34 PM Edmund Horner <ejrh00@gmail.com> wrote:
> > Should we move to CF?  We have been in the CF cycle for almost a year now...
>
> Hi Edmund,
>
> No worries, that's how it goes sometimes.  Please move it to the next
> CF if you think you'll find some time before September.  Don't worry
> if it might have to be moved again.  We want the feature, and good
> things take time!

Ok thanks.

I tried moving it to the new commitfest, but it seems I cannot from
its current state.

If it's ok, I'll just leave it to you in 7 hours' time!

Edmund



Re: Tid scan improvements

От
Thomas Munro
Дата:
On Thu, Aug 1, 2019 at 5:51 PM Edmund Horner <ejrh00@gmail.com> wrote:
> On Thu, 1 Aug 2019 at 17:47, Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Thu, Aug 1, 2019 at 5:34 PM Edmund Horner <ejrh00@gmail.com> wrote:
> > > Should we move to CF?  We have been in the CF cycle for almost a year now...
> >
> > Hi Edmund,
> >
> > No worries, that's how it goes sometimes.  Please move it to the next
> > CF if you think you'll find some time before September.  Don't worry
> > if it might have to be moved again.  We want the feature, and good
> > things take time!
>
> Ok thanks.
>
> I tried moving it to the new commitfest, but it seems I cannot from
> its current state.

Done.  You have to move it to "Needs review" first, and then move to
next.  (Perhaps we should change that... I don't think that obstacle
achieves anything?)

-- 
Thomas Munro
https://enterprisedb.com



Re: Tid scan improvements

От
David Rowley
Дата:
On Thu, 1 Aug 2019 at 17:57, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Aug 1, 2019 at 5:51 PM Edmund Horner <ejrh00@gmail.com> wrote:
> > I tried moving it to the new commitfest, but it seems I cannot from
> > its current state.
>
> Done.  You have to move it to "Needs review" first, and then move to
> next.  (Perhaps we should change that... I don't think that obstacle
> achieves anything?)

I think it's there as a measure to try to trim down the number of
patches that are constantly bounced to the nest 'fest that are still
waiting on author.  In my experience, it's a little annoying since if
you don't set it to "needs review" first, then it means closing the
patch and having to create a new CF entry when you're ready, with all
the history of the previous one lost.

It seems reasonable to me to keep the patch in the queue if the author
is still actively working on the patch and it seems pretty unfair if a
last-minute review came in just before the end of the CF and that
means their patch must be "returned with feedback" instead of pushed
to the next 'fest.

Perhaps there are other measures we could take to reduce the number of
patches getting kicked out to the next CF all the time. Maybe some
icons that appear if it's been waiting on author for more than 2
months, or if it went through an entire CF as waiting on author.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Tid scan improvements

От
Alvaro Herrera
Дата:
On 2019-Aug-01, Edmund Horner wrote:

> Hi everyone,
> 
> Sadly it is the end of the CF and I have not had much time to work on
> this.  I'll probably get some time in the next month to look at the
> heapam changes.
> 
> Should we move to CF?  We have been in the CF cycle for almost a year now...

Hello, do we have an update on this patch?  Last version that was posted
was v9 from David on July 17th; you said you had made some changes on
July 22nd but didn't attach any patch.  v9 doesn't apply anymore.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Tid scan improvements

От
Edmund Horner
Дата:
On Wed, 4 Sep 2019 at 10:34, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Aug-01, Edmund Horner wrote:
>
> > Hi everyone,
> >
> > Sadly it is the end of the CF and I have not had much time to work on
> > this.  I'll probably get some time in the next month to look at the
> > heapam changes.
> >
> > Should we move to CF?  We have been in the CF cycle for almost a year now...
>
> Hello, do we have an update on this patch?  Last version that was posted
> was v9 from David on July 17th; you said you had made some changes on
> July 22nd but didn't attach any patch.  v9 doesn't apply anymore.

Hi pgsql-hackers,

I have had a lot of difficulty making the changes to heapam.c and I
think it's becoming obvious I won't get them done by myself.

The last *working* patch pushed the work into heapam.c, but it was
just a naive loop over the whole table.  I haven't found how to
rewrite heapgettup_pagemode in the way Andres suggests.

So, I think we need to either get some help from someone familiar with
heapam.c, or maybe shelve the patch.  It has been work in progress for
over a year now.

Edmund



Re: Tid scan improvements

От
Michael Paquier
Дата:
On Thu, Sep 05, 2019 at 01:06:56PM +1200, Edmund Horner wrote:
> So, I think we need to either get some help from someone familiar with
> heapam.c, or maybe shelve the patch.  It has been work in progress for
> over a year now.

Okay, still nothing has happened after two months.  Once this is
solved a new patch submission could be done.  For now I have marked
the entry as returned with feedback.
--
Michael

Вложения

Re: Tid scan improvements

От
David Fetter
Дата:
On Sun, Dec 01, 2019 at 11:34:16AM +0900, Michael Paquier wrote:
> On Thu, Sep 05, 2019 at 01:06:56PM +1200, Edmund Horner wrote:
> > So, I think we need to either get some help from someone familiar with
> > heapam.c, or maybe shelve the patch.  It has been work in progress for
> > over a year now.
> 
> Okay, still nothing has happened after two months.  Once this is
> solved a new patch submission could be done.  For now I have marked
> the entry as returned with feedback.  

I dusted off the last version of the patch without implementing the
suggestions in this thread between here and there.

I think this is a capability worth having, as I was surprised when it
turned out that it didn't exist when I was looking to make an
improvement to pg_dump. My idea, which I'll get back to when this is
in, was to use special knowledge of heap AM tables to make it possible
to parallelize dumps of large tables by working separately on each
underlying file, of which there could be quite a few for a large one.

Will try to understand the suggestions upthread better and implement
same.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Вложения

Re: Tid scan improvements

От
Edmund Horner
Дата:
On Fri, 1 Jan 2021 at 14:30, David Fetter <david@fetter.org> wrote:
On Sun, Dec 01, 2019 at 11:34:16AM +0900, Michael Paquier wrote:
> Okay, still nothing has happened after two months.  Once this is
> solved a new patch submission could be done.  For now I have marked
> the entry as returned with feedback. 

I dusted off the last version of the patch without implementing the
suggestions in this thread between here and there.

I think this is a capability worth having, as I was surprised when it
turned out that it didn't exist when I was looking to make an
improvement to pg_dump. My idea, which I'll get back to when this is
in, was to use special knowledge of heap AM tables to make it possible
to parallelize dumps of large tables by working separately on each
underlying file, of which there could be quite a few for a large one.

Will try to understand the suggestions upthread better and implement
same.

Hi David,

Thanks for updating the patch.  I'd be very happy if this got picked up again, and I'd certainly follow along and do some review.

+                               splan->tidrangequals =
+                                       fix_scan_list(root, splan->tidrangequals,
+                                                                 rtoffset, 1); /* v9_tid XXX Not sure this is right */

I'm pretty sure the parameter num_exec = 1 is fine; it seems to only affect correlated subselects, which shouldn't really make their way into the tidrangequals expressions.  It's more or less the same situation as tidquals for TidPath, anyway.  We could put a little comment:  /* correlated subselects shouldn't get into tidquals/tidrangequals */ or something to that effect.

Edmund

Re: Tid scan improvements

От
David Rowley
Дата:
On Wed, 13 Jan 2021 at 15:38, Edmund Horner <ejrh00@gmail.com> wrote:
> Thanks for updating the patch.  I'd be very happy if this got picked up again, and I'd certainly follow along and do
somereview.
 

Likewise here. I this patch was pretty close so it seems a shame to
let it slip through the cracks.

I spoke to Andres off-list about this patch. He mentioned that he
wasn't too keen on seeing the setscanlimits being baked into the table
AM API.  He mentioned that he'd rather not assume too much about table
AMs having all of their tids in a given range consecutively on a set
of pages.   That seems reasonable to me.  He suggested that we add a
new callback that just allows a range of ItemPointers to scan and
leave it up to the implementation to decide which pages should be
scanned to fetch the tuples within the given range.  I didn't argue. I
just went and coded it all, hopefully to Andres' description. The new
table AM callback is optional.

I've implemented this in the attached.

I also took the time to support backwards TID Range scans and added a
few more tests to test rescans.  I just found it annoying that TID
Scans supported backwards scans and TID Range scans did not.

The 0002 patch is the guts of it. The 0001 patch is an existing bug
that needs to be fixed before 0002 could go in (backwards TID Range
Scans are broken without this). I've posted separately about this bug
in [1]

I also didn't really like the idea of adding possibly lots of bool
fields to RelOptInfo to describe what the planner can do in regards to
what the given table AM supports.   I know that IndexOptInfo has such
a set of bool fields.  I'd rather not repeat that, so I just went with
a single int field named "amflags" and just added a single constant to
define a flag that specifies if the rel supports scanning ranges of
TIDs.

Edmund, will you get time to a look at this?

David

[1] https://www.postgresql.org/message-id/CAApHDvpGc9h0_oVD2CtgBcxCS1N-qDYZSeBRnUh+0CWJA9cMaA@mail.gmail.com

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On Thu, 21 Jan 2021 at 18:16, David Rowley <dgrowleyml@gmail.com> wrote:
> I've implemented this in the attached.

The bug fix in 0001 is now committed, so I'm just attaching the 0002
patch again after having rebased... This is mostly just to keep the
CFbot happy.

David

Вложения

Re: Tid scan improvements

От
Zhihong Yu
Дата:
Hi,

bq. within this range.  Table AMs where scanning ranges of TIDs does not make
sense or is difficult to implement efficiently may choose to not implement

Is there criterion on how to judge efficiency ?

+       if (tidopexpr->exprtype == TIDEXPR_LOWER_BOUND)
...
+       if (tidopexpr->exprtype == TIDEXPR_UPPER_BOUND)

The if statement for upper bound should be prefixed with 'else', right ?

+ * TidRecheck -- access method routine to recheck a tuple in EvalPlanQual
...
+TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)

The method name in the comment doesn't match the real method name.

+ *     ExecTidRangeScan(node)
...
+ExecTidRangeScan(PlanState *pstate)

Parameter names don't match.

Cheers

On Mon, Jan 25, 2021 at 5:23 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 21 Jan 2021 at 18:16, David Rowley <dgrowleyml@gmail.com> wrote:
> I've implemented this in the attached.

The bug fix in 0001 is now committed, so I'm just attaching the 0002
patch again after having rebased... This is mostly just to keep the
CFbot happy.

David

Re: Tid scan improvements

От
David Rowley
Дата:
Thanks for having a look at this.

On Tue, 26 Jan 2021 at 15:48, Zhihong Yu <zyu@yugabyte.com> wrote:
> bq. within this range.  Table AMs where scanning ranges of TIDs does not make
> sense or is difficult to implement efficiently may choose to not implement
>
> Is there criterion on how to judge efficiency ?

For example, if the AM had no way to index such a column and the
method needed to scan the entire table to find TIDs in that range. The
planner may as well just pick a SeqScan. If that's the case, then the
table AM may as well not bother implementing that function.

> +       if (tidopexpr->exprtype == TIDEXPR_LOWER_BOUND)
> ...
> +       if (tidopexpr->exprtype == TIDEXPR_UPPER_BOUND)
>
> The if statement for upper bound should be prefixed with 'else', right ?

Yeah, thanks.

> + * TidRecheck -- access method routine to recheck a tuple in EvalPlanQual
> ...
> +TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)
>
> The method name in the comment doesn't match the real method name.

Well noticed.

> + *     ExecTidRangeScan(node)
> ...
> +ExecTidRangeScan(PlanState *pstate)
>
> Parameter names don't match.

hmm. Looking around it seems there's lots of other places that do
this.  I think the the comment is really just indicating that the
function is taking an executor node state as a parameter.

Have a look at: git grep -E "\s\*.*\(node\)$"   that shows the other
places. Some of these have the parameter named "node", and many others
have some other name.

I've made the two changes locally. Since the two issues were mostly
cosmetic, I'll post an updated patch after some bigger changes are
required.

Thanks again for looking at this.

David



Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2021-01-26 14:22:42 +1300, David Rowley wrote:
> diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
> index 33bffb6815..d1c608b176 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -325,6 +325,26 @@ typedef struct TableAmRoutine
>                                       ScanDirection direction,
>                                       TupleTableSlot *slot);
>  
> +    /*
> +     * Return next tuple from `scan` where TID is within the defined range.
> +     * This behaves like scan_getnextslot but only returns tuples from the
> +     * given range of TIDs.  Ranges are inclusive.  This function is optional
> +     * and may be set to NULL if TID range scans are not supported by the AM.
> +     *
> +     * Implementations of this function must themselves handle ItemPointers
> +     * of any value. i.e, they must handle each of the following:
> +     *
> +     * 1) mintid or maxtid is beyond the end of the table; and
> +     * 2) mintid is above maxtid; and
> +     * 3) item offset for mintid or maxtid is beyond the maximum offset
> +     * allowed by the AM.
> +     */
> +    bool        (*scan_getnextslot_inrange) (TableScanDesc scan,
> +                                             ScanDirection direction,
> +                                             TupleTableSlot *slot,
> +                                             ItemPointer mintid,
> +                                             ItemPointer maxtid);
> +
>  
>      /* ------------------------------------------------------------------------
>       * Parallel table scan related functions.
> @@ -1015,6 +1035,36 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
>      return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
>  }
>  
> +/*
> + * Return next tuple from defined TID range from `scan` and store in slot.
> + */
> +static inline bool
> +table_scan_getnextslot_inrange(TableScanDesc sscan, ScanDirection direction,
> +                               TupleTableSlot *slot, ItemPointer mintid,
> +                               ItemPointer maxtid)
> +{
> +    /*
> +     * The planner should never make a plan which uses this function when the
> +     * table AM has not defined any function for this callback.
> +     */
> +    Assert(sscan->rs_rd->rd_tableam->scan_getnextslot_inrange != NULL);
> +
> +    slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
> +
> +    /*
> +     * We don't expect direct calls to table_scan_getnextslot_inrange with
> +     * valid CheckXidAlive for catalog or regular tables.  See detailed
> +     * comments in xact.c where these variables are declared.
> +     */
> +    if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> +        elog(ERROR, "unexpected table_scan_getnextslot_inrange call during logical decoding");
> +
> +    return sscan->rs_rd->rd_tableam->scan_getnextslot_inrange(sscan,
> +                                                              direction,
> +                                                              slot,
> +                                                              mintid,
> +                                                              maxtid);
> +}


I don't really like that this API relies on mintid/maxtid to stay the
same across multiple scan_getnextslot_inrange() calls. I think we'd at
least need to document that that's required and what needs to be done to
scan a different set of min/max tid (or re-scan the same min/max from
scratch).

Perhaps something like

typedef struct TableScanTidRange TableScanTidRange;

TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);

would work better? That'd allow an AM to have arbitrarily large state
for a tid range scan, would make it clear what the lifetime of the
ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
level, prevent multiple tid range scans from being in progress at the
same times though :(. Perhaps we could add a TableScanTidRange* pointer
to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?

Greetings,

Andres Freund



Re: Tid scan improvements

От
David Rowley
Дата:
Thanks for looking at this.

On Thu, 4 Feb 2021 at 10:19, Andres Freund <andres@anarazel.de> wrote:
> Perhaps something like
>
> typedef struct TableScanTidRange TableScanTidRange;
>
> TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
> bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
> void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);
>
> would work better? That'd allow an AM to have arbitrarily large state
> for a tid range scan, would make it clear what the lifetime of the
> ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
> level, prevent multiple tid range scans from being in progress at the
> same times though :(. Perhaps we could add a TableScanTidRange* pointer
> to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?

Maybe the TableScanTidRange can just have a field to store the
TableScanDesc. That way table_scan_tid_range_nextslot and
table_scan_tid_range_end can just pass the TableScanTidRange pointer.

That way it seems like it would be ok for multiple scans to be going
on concurrently as nobody should be reusing the TableScanDesc.

Does that seem ok?

David



Re: Tid scan improvements

От
David Rowley
Дата:
On Thu, 4 Feb 2021 at 10:31, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Thanks for looking at this.
>
> On Thu, 4 Feb 2021 at 10:19, Andres Freund <andres@anarazel.de> wrote:
> > Perhaps something like
> >
> > typedef struct TableScanTidRange TableScanTidRange;
> >
> > TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
> > bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
> > void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);
> >
> > would work better? That'd allow an AM to have arbitrarily large state
> > for a tid range scan, would make it clear what the lifetime of the
> > ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
> > level, prevent multiple tid range scans from being in progress at the
> > same times though :(. Perhaps we could add a TableScanTidRange* pointer
> > to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?
>
> Maybe the TableScanTidRange can just have a field to store the
> TableScanDesc. That way table_scan_tid_range_nextslot and
> table_scan_tid_range_end can just pass the TableScanTidRange pointer.
>
> That way it seems like it would be ok for multiple scans to be going
> on concurrently as nobody should be reusing the TableScanDesc.

I ended up adding just two new API functions to table AM.

void (*scan_set_tid_range) (TableScanDesc sscan,
   ItemPointer mintid,
   ItemPointer maxtid);

and
bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot);

I added an additional function in tableam.h that does not have a
corresponding API function:

static inline TableScanDesc
table_tid_range_start(Relation rel, Snapshot snapshot,
  ItemPointer mintid,
  ItemPointer maxtid)

This just calls the standard scan_begin then calls scan_set_tid_range
setting the specified mintid and maxtid.

I also added 2 new fields to TableScanDesc:

ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;

I didn't quite see a need to have a new start and end scan API function.

Updated patch attached.

David

Вложения

Re: Tid scan improvements

От
David Rowley
Дата:
On Thu, 4 Feb 2021 at 23:51, David Rowley <dgrowleyml@gmail.com> wrote:
> Updated patch attached.

I made another pass over this patch and did a bit of renaming work
around the heap_* functions and the tableam functions. I think the new
names are a bit more aligned to the existing names.

I don't really see anything else that I'm unhappy with about this
patch, so pending any objections or last-minute reviews, I plan to
push it later this week.

David

Вложения

Re: Tid scan improvements

От
David Fetter
Дата:
On Tue, Feb 16, 2021 at 10:22:41PM +1300, David Rowley wrote:
> On Thu, 4 Feb 2021 at 23:51, David Rowley <dgrowleyml@gmail.com> wrote:
> > Updated patch attached.
> 
> I made another pass over this patch and did a bit of renaming work
> around the heap_* functions and the tableam functions. I think the new
> names are a bit more aligned to the existing names.

Thanks! I'm looking forward to making use of this :)

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Tid scan improvements

От
Andres Freund
Дата:
Hi,

On 2021-02-04 23:51:39 +1300, David Rowley wrote:

> I ended up adding just two new API functions to table AM.
> 
> void (*scan_set_tid_range) (TableScanDesc sscan,
>    ItemPointer mintid,
>    ItemPointer maxtid);
> 
> and
> bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
> ScanDirection direction,
> TupleTableSlot *slot);
> 
> I added an additional function in tableam.h that does not have a
> corresponding API function:
> 
> static inline TableScanDesc
> table_tid_range_start(Relation rel, Snapshot snapshot,
>   ItemPointer mintid,
>   ItemPointer maxtid)
> 
> This just calls the standard scan_begin then calls scan_set_tid_range
> setting the specified mintid and maxtid.

Hm. But that means we can't rescan?


> I also added 2 new fields to TableScanDesc:
> 
> ItemPointerData rs_mintid;
> ItemPointerData rs_maxtid;
> 
> I didn't quite see a need to have a new start and end scan API function.

Yea. I guess they're not that large. Avoiding that was one of the two
reasons to have a separate scan state somewhere. The other that it
seemed like it'd possibly a bit cleaner API wise to deal with rescan.


> +bool
> +heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
> +                          TupleTableSlot *slot)
> +{
> +    HeapScanDesc scan = (HeapScanDesc) sscan;
> +    ItemPointer mintid = &sscan->rs_mintid;
> +    ItemPointer maxtid = &sscan->rs_maxtid;
> +
> +    /* Note: no locking manipulations needed */
> +    for (;;)
> +    {
> +        if (sscan->rs_flags & SO_ALLOW_PAGEMODE)
> +            heapgettup_pagemode(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> +        else
> +            heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> +
> +        if (scan->rs_ctup.t_data == NULL)
> +        {
> +            ExecClearTuple(slot);
> +            return false;
> +        }
> +
> +        /*
> +         * heap_set_tidrange will have used heap_setscanlimits to limit the
> +         * range of pages we scan to only ones that can contain the TID range
> +         * we're scanning for.  Here we must filter out any tuples from these
> +         * pages that are outwith that range.
> +         */
> +        if (ItemPointerCompare(&scan->rs_ctup.t_self, mintid) < 0)
> +        {
> +            ExecClearTuple(slot);
> +
> +            /*
> +             * When scanning backwards, the TIDs will be in descending order.
> +             * Future tuples in this direction will be lower still, so we can
> +             * just return false to indicate there will be no more tuples.
> +             */
> +            if (ScanDirectionIsBackward(direction))
> +                return false;
> +
> +            continue;
> +        }
> +
> +        /*
> +         * Likewise for the final page, we must filter out TIDs greater than
> +         * maxtid.
> +         */
> +        if (ItemPointerCompare(&scan->rs_ctup.t_self, maxtid) > 0)
> +        {
> +            ExecClearTuple(slot);
> +
> +            /*
> +             * When scanning forward, the TIDs will be in ascending order.
> +             * Future tuples in this direction will be higher still, so we can
> +             * just return false to indicate there will be no more tuples.
> +             */
> +            if (ScanDirectionIsForward(direction))
> +                return false;
> +            continue;
> +        }
> +
> +        break;
> +    }
> +
> +    /*
> +     * if we get here it means we have a new current scan tuple, so point to
> +     * the proper return buffer and return the tuple.
> +     */
> +    pgstat_count_heap_getnext(scan->rs_base.rs_rd);

I wonder if there's an argument for counting the misses above via
pgstat_count_heap_fetch()? Probably not, right?


> +#define IsCTIDVar(node)  \
> +    ((node) != NULL && \
> +     IsA((node), Var) && \
> +     ((Var *) (node))->varattno == SelfItemPointerAttributeNumber && \
> +     ((Var *) (node))->varlevelsup == 0)
> +
> +typedef enum
> +{
> +    TIDEXPR_UPPER_BOUND,
> +    TIDEXPR_LOWER_BOUND
> +} TidExprType;
> +
> +/* Upper or lower range bound for scan */
> +typedef struct TidOpExpr
> +{
> +    TidExprType exprtype;        /* type of op; lower or upper */
> +    ExprState  *exprstate;        /* ExprState for a TID-yielding subexpr */
> +    bool        inclusive;        /* whether op is inclusive */
> +} TidOpExpr;
> +
> +/*
> + * For the given 'expr', build and return an appropriate TidOpExpr taking into
> + * account the expr's operator and operand order.
> + */
> +static TidOpExpr *
> +MakeTidOpExpr(OpExpr *expr, TidRangeScanState *tidstate)
> +{
> +    Node       *arg1 = get_leftop((Expr *) expr);
> +    Node       *arg2 = get_rightop((Expr *) expr);
> +    ExprState  *exprstate = NULL;
> +    bool        invert = false;
> +    TidOpExpr  *tidopexpr;
> +
> +    if (IsCTIDVar(arg1))
> +        exprstate = ExecInitExpr((Expr *) arg2, &tidstate->ss.ps);
> +    else if (IsCTIDVar(arg2))
> +    {
> +        exprstate = ExecInitExpr((Expr *) arg1, &tidstate->ss.ps);
> +        invert = true;
> +    }
> +    else
> +        elog(ERROR, "could not identify CTID variable");
> +
> +    tidopexpr = (TidOpExpr *) palloc(sizeof(TidOpExpr));
> +    tidopexpr->inclusive = false;    /* for now */
> +
> +    switch (expr->opno)
> +    {
> +        case TIDLessEqOperator:
> +            tidopexpr->inclusive = true;
> +            /* fall through */
> +        case TIDLessOperator:
> +            tidopexpr->exprtype = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> +            break;
> +        case TIDGreaterEqOperator:
> +            tidopexpr->inclusive = true;
> +            /* fall through */
> +        case TIDGreaterOperator:
> +            tidopexpr->exprtype = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> +            break;
> +        default:
> +            elog(ERROR, "could not identify CTID operator");
> +    }
> +
> +    tidopexpr->exprstate = exprstate;
> +
> +    return tidopexpr;
> +}
> +
> +/*
> + * Extract the qual subexpressions that yield TIDs to search for,
> + * and compile them into ExprStates if they're ordinary expressions.
> + */
> +static void
> +TidExprListCreate(TidRangeScanState *tidrangestate)
> +{
> +    TidRangeScan *node = (TidRangeScan *) tidrangestate->ss.ps.plan;
> +    List       *tidexprs = NIL;
> +    ListCell   *l;
> +
> +    foreach(l, node->tidrangequals)
> +    {
> +        OpExpr       *opexpr = lfirst(l);
> +        TidOpExpr  *tidopexpr;
> +
> +        if (!IsA(opexpr, OpExpr))
> +            elog(ERROR, "could not identify CTID expression");
> +
> +        tidopexpr = MakeTidOpExpr(opexpr, tidrangestate);
> +        tidexprs = lappend(tidexprs, tidopexpr);
> +    }
> +
> +    tidrangestate->trss_tidexprs = tidexprs;
> +}

Architecturally it feels like this is something that really belongs more
into plan time?


> +/*
> + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> + * TableScanDesc created by table_beginscan_tidrange.
> + */
> +static inline void
> +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> +                   ItemPointer maxtid)
> +{
> +    /* Ensure table_beginscan_tidrange() was used. */
> +    Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> +
> +    sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> +}

How does this interact with rescans?

Greetings,

Andres Freund



Re: Tid scan improvements

От
David Rowley
Дата:
Thanks for having a look at this.

On Wed, 17 Feb 2021 at 11:05, Andres Freund <andres@anarazel.de> wrote:
>
> On 2021-02-04 23:51:39 +1300, David Rowley wrote:
> > and
> > bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
> > ScanDirection direction,
> > TupleTableSlot *slot);
> >
> > I added an additional function in tableam.h that does not have a
> > corresponding API function:
> >
> > static inline TableScanDesc
> > table_tid_range_start(Relation rel, Snapshot snapshot,
> >   ItemPointer mintid,
> >   ItemPointer maxtid)
> >
> > This just calls the standard scan_begin then calls scan_set_tid_range
> > setting the specified mintid and maxtid.
>
> Hm. But that means we can't rescan?

It might not be perfect, but to rescan, we must call table_rescan()
then table_set_tidrange() before calling the
table_scan_getnextslot_tidrange() function.

> > +bool
> > +heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
> > +                                               TupleTableSlot *slot)
> > +{
> > +     HeapScanDesc scan = (HeapScanDesc) sscan;
> > +     ItemPointer mintid = &sscan->rs_mintid;
> > +     ItemPointer maxtid = &sscan->rs_maxtid;
> > +
> > +     /* Note: no locking manipulations needed */
> > +     for (;;)
> > +     {
> > +             if (sscan->rs_flags & SO_ALLOW_PAGEMODE)
> > +                     heapgettup_pagemode(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> > +             else
> > +                     heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> > +
> > +             if (scan->rs_ctup.t_data == NULL)
> > +             {
> > +                     ExecClearTuple(slot);
> > +                     return false;
> > +             }
> > +
> > +             /*
> > +              * heap_set_tidrange will have used heap_setscanlimits to limit the
> > +              * range of pages we scan to only ones that can contain the TID range
> > +              * we're scanning for.  Here we must filter out any tuples from these
> > +              * pages that are outwith that range.
> > +              */
> > +             if (ItemPointerCompare(&scan->rs_ctup.t_self, mintid) < 0)
> > +             {
> > +                     ExecClearTuple(slot);
> > +
> > +                     /*
> > +                      * When scanning backwards, the TIDs will be in descending order.
> > +                      * Future tuples in this direction will be lower still, so we can
> > +                      * just return false to indicate there will be no more tuples.
> > +                      */
> > +                     if (ScanDirectionIsBackward(direction))
> > +                             return false;
> > +
> > +                     continue;
> > +             }
> > +
> > +             /*
> > +              * Likewise for the final page, we must filter out TIDs greater than
> > +              * maxtid.
> > +              */
> > +             if (ItemPointerCompare(&scan->rs_ctup.t_self, maxtid) > 0)
> > +             {
> > +                     ExecClearTuple(slot);
> > +
> > +                     /*
> > +                      * When scanning forward, the TIDs will be in ascending order.
> > +                      * Future tuples in this direction will be higher still, so we can
> > +                      * just return false to indicate there will be no more tuples.
> > +                      */
> > +                     if (ScanDirectionIsForward(direction))
> > +                             return false;
> > +                     continue;
> > +             }
> > +
> > +             break;
> > +     }
> > +
> > +     /*
> > +      * if we get here it means we have a new current scan tuple, so point to
> > +      * the proper return buffer and return the tuple.
> > +      */
> > +     pgstat_count_heap_getnext(scan->rs_base.rs_rd);
>
> I wonder if there's an argument for counting the misses above via
> pgstat_count_heap_fetch()? Probably not, right?

I'm a bit undecided about that. In theory, we're doing the heap
fetches of tuples on the target page which are outside of the range so
we should maybe count them. On the other hand, it might be a little
confusing for very observant users if they see the fetches going up
for the tuples we skip over in TID Range scans.

> > +#define IsCTIDVar(node)  \
> > +     ((node) != NULL && \
> > +      IsA((node), Var) && \
> > +      ((Var *) (node))->varattno == SelfItemPointerAttributeNumber && \
> > +      ((Var *) (node))->varlevelsup == 0)
> > +
> > +typedef enum
> > +{
> > +     TIDEXPR_UPPER_BOUND,
> > +     TIDEXPR_LOWER_BOUND
> > +} TidExprType;
> > +
> > +/* Upper or lower range bound for scan */
> > +typedef struct TidOpExpr
> > +{
> > +     TidExprType exprtype;           /* type of op; lower or upper */
> > +     ExprState  *exprstate;          /* ExprState for a TID-yielding subexpr */
> > +     bool            inclusive;              /* whether op is inclusive */
> > +} TidOpExpr;
> > +
> > +/*
> > + * For the given 'expr', build and return an appropriate TidOpExpr taking into
> > + * account the expr's operator and operand order.
> > + */
> > +static TidOpExpr *
> > +MakeTidOpExpr(OpExpr *expr, TidRangeScanState *tidstate)
> > +{
> > +     Node       *arg1 = get_leftop((Expr *) expr);
> > +     Node       *arg2 = get_rightop((Expr *) expr);
> > +     ExprState  *exprstate = NULL;
> > +     bool            invert = false;
> > +     TidOpExpr  *tidopexpr;
> > +
> > +     if (IsCTIDVar(arg1))
> > +             exprstate = ExecInitExpr((Expr *) arg2, &tidstate->ss.ps);
> > +     else if (IsCTIDVar(arg2))
> > +     {
> > +             exprstate = ExecInitExpr((Expr *) arg1, &tidstate->ss.ps);
> > +             invert = true;
> > +     }
> > +     else
> > +             elog(ERROR, "could not identify CTID variable");
> > +
> > +     tidopexpr = (TidOpExpr *) palloc(sizeof(TidOpExpr));
> > +     tidopexpr->inclusive = false;   /* for now */
> > +
> > +     switch (expr->opno)
> > +     {
> > +             case TIDLessEqOperator:
> > +                     tidopexpr->inclusive = true;
> > +                     /* fall through */
> > +             case TIDLessOperator:
> > +                     tidopexpr->exprtype = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> > +                     break;
> > +             case TIDGreaterEqOperator:
> > +                     tidopexpr->inclusive = true;
> > +                     /* fall through */
> > +             case TIDGreaterOperator:
> > +                     tidopexpr->exprtype = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> > +                     break;
> > +             default:
> > +                     elog(ERROR, "could not identify CTID operator");
> > +     }
> > +
> > +     tidopexpr->exprstate = exprstate;
> > +
> > +     return tidopexpr;
> > +}
> > +
> > +/*
> > + * Extract the qual subexpressions that yield TIDs to search for,
> > + * and compile them into ExprStates if they're ordinary expressions.
> > + */
> > +static void
> > +TidExprListCreate(TidRangeScanState *tidrangestate)
> > +{
> > +     TidRangeScan *node = (TidRangeScan *) tidrangestate->ss.ps.plan;
> > +     List       *tidexprs = NIL;
> > +     ListCell   *l;
> > +
> > +     foreach(l, node->tidrangequals)
> > +     {
> > +             OpExpr     *opexpr = lfirst(l);
> > +             TidOpExpr  *tidopexpr;
> > +
> > +             if (!IsA(opexpr, OpExpr))
> > +                     elog(ERROR, "could not identify CTID expression");
> > +
> > +             tidopexpr = MakeTidOpExpr(opexpr, tidrangestate);
> > +             tidexprs = lappend(tidexprs, tidopexpr);
> > +     }
> > +
> > +     tidrangestate->trss_tidexprs = tidexprs;
> > +}
>
> Architecturally it feels like this is something that really belongs more
> into plan time?

Possibly. It would mean TidOpExpr would have to become a Node type.
TID Range scan is really just following the lead set by TID Scan here.

I'm not sure if it's worth the trouble making these Node types for the
small gains there'd be in the performance of having the planner make
them once for prepared queries rather than them having to be built
during each execution.

Do you think it is?

> > +/*
> > + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> > + * TableScanDesc created by table_beginscan_tidrange.
> > + */
> > +static inline void
> > +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> > +                                ItemPointer maxtid)
> > +{
> > +     /* Ensure table_beginscan_tidrange() was used. */
> > +     Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> > +
> > +     sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> > +}
>
> How does this interact with rescans?

We must call table_rescan() before calling table_set_tidrange() again.
That perhaps could be documented better. I'm just unsure if that
should be documented in tableam.h or if it's a restriction that only
needs to exist in heapam.c

David



Re: Tid scan improvements

От
David Rowley
Дата:
On Thu, 18 Feb 2021 at 09:45, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 17 Feb 2021 at 11:05, Andres Freund <andres@anarazel.de> wrote:
> > Architecturally it feels like this is something that really belongs more
> > into plan time?
>
> Possibly. It would mean TidOpExpr would have to become a Node type.
> TID Range scan is really just following the lead set by TID Scan here.
>
> I'm not sure if it's worth the trouble making these Node types for the
> small gains there'd be in the performance of having the planner make
> them once for prepared queries rather than them having to be built
> during each execution.

I changed the code around and added a new Node type to the planner and
made it create the TidRangeExpr during planning.

However, I'm pretty much set on this being pretty horrible and I ended
up ripping it back out again.  The reason is that there's quite a bit
of extra boilerplate code that goes with the new node type. e.g it
must be handled in setrefs.c.  EXPLAIN also needs to know about the
new Node. That either means teaching the deparse code about
TidRangeExprs or having the Plan node carry along the OpExprs just so
we can make EXPLAIN work.   Translating between the two might be
possible but it just seemed too much code and I started feeling pretty
bad about the whole idea.

> > > +/*
> > > + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> > > + * TableScanDesc created by table_beginscan_tidrange.
> > > + */
> > > +static inline void
> > > +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> > > +                                ItemPointer maxtid)
> > > +{
> > > +     /* Ensure table_beginscan_tidrange() was used. */
> > > +     Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> > > +
> > > +     sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> > > +}
> >
> > How does this interact with rescans?
>
> We must call table_rescan() before calling table_set_tidrange() again.
> That perhaps could be documented better. I'm just unsure if that
> should be documented in tableam.h or if it's a restriction that only
> needs to exist in heapam.c

I've changed things around so that we no longer explicitly call
table_rescan() in nodeTidrangescan.c. Instead table_set_tidrange()
does a rescan call. I also adjusted the documentation to mention that
changing the tid range starts the scan again.  This does mean we'll do
a ->scan_rescan() the first time we do table_set_tidrange(). I'm not
all that sure that matters.

v15 attached.

David

Вложения

Re: Tid scan improvements

От
Peter Eisentraut
Дата:
This patch didn't add _outTidRangePath() even though we have outNode() 
coverage for most/all path nodes.  Was this just forgotten?  See 
attached patch.

Вложения

Re: Tid scan improvements

От
Edmund Horner
Дата:
On Mon, 7 Jun 2021 at 22:11, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
This patch didn't add _outTidRangePath() even though we have outNode()
coverage for most/all path nodes.  Was this just forgotten?  See
attached patch.

Yes, it looks like an omission.  Thanks for spotting it.  Patch looks good to me.

Edmund


Re: Tid scan improvements

От
David Rowley
Дата:
On Mon, 7 Jun 2021 at 23:46, Edmund Horner <ejrh00@gmail.com> wrote:
>
> On Mon, 7 Jun 2021 at 22:11, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>>
>> This patch didn't add _outTidRangePath() even though we have outNode()
>> coverage for most/all path nodes.  Was this just forgotten?  See
>> attached patch.
>
>
> Yes, it looks like an omission.  Thanks for spotting it.  Patch looks good to me.

Yeah. That was forgotten.  Patch also looks fine to me.  Do you want
to push it, Peter?

David



Re: Tid scan improvements

От
Peter Eisentraut
Дата:
On 07.06.21 13:50, David Rowley wrote:
> On Mon, 7 Jun 2021 at 23:46, Edmund Horner <ejrh00@gmail.com> wrote:
>>
>> On Mon, 7 Jun 2021 at 22:11, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>>>
>>> This patch didn't add _outTidRangePath() even though we have outNode()
>>> coverage for most/all path nodes.  Was this just forgotten?  See
>>> attached patch.
>>
>>
>> Yes, it looks like an omission.  Thanks for spotting it.  Patch looks good to me.
> 
> Yeah. That was forgotten.  Patch also looks fine to me.  Do you want
> to push it, Peter?

done