Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Поиск
Список
Период
Сортировка
От Matthias van de Meent
Тема Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Дата
Msg-id CAEze2WgRqVrB7G2-oT5iFzyv3u2aiOQD4GjJBiQZ7BrcC6hXJQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On Thu, 28 Dec 2023 at 18:28, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v8, which pretty much rips all of this stuff out.
>
> Attached is v9, which I'm posting just to fix bitrot. The patch
> stopped cleanly applying against HEAD due to recent bugfix commit
> 7e6fb5da. No real changes here compared to v8.

I found 2 major issues; one correctness issue in the arraykey
processing code of _bt_preprocess_array_keys, and one assertion error
in _bt_advance_array_keys; both discussed in the relevant sections
below.

> Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution.
> [...]
> Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
> path keys: path keys on a multicolumn index path, with a SAOP clause on
> any attribute beyond the first/most significant attribute.  These cases
> are now all safe, so we go back to generating path keys without regard
> for the presence of SAOP clauses (just like with any other clause type).
> Also undo changes from follow-up bugfix commit a4523c5a, which taught
> the planner to produce alternative index paths without any low-order
> ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
> We'll no longer generate these alternative paths, which can no longer
> offer any advantage over the index qual paths that we do still generate.

Can you pull these planner changes into their own commit(s)?
As mentioned upthread, it's a significant change in behavior that
should have separate consideration and reference in the commit log. I
really don't think it should be buried in the 5th paragraph of an
"Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
changes of btree are arguably independent of the planner changes, as
the btree changes improve performance even if we ignore that it
implements strict result ordering.

An independent thought when reviewing the finaltup / HIKEY scan
optimization parts of this patch:
The 'use highkey to check for next page's data range' optimization is
useful, but can currently only be used for scans that go to the right.
Could we use a similar optimization in the inverse direction if we
marked the right page of a split with "the left page has a HIKEY based
on this page's (un)truncated leftmost value" or "left page's HIKEY was
truncated to N key attributes"? It'd give one bit of information,
specifically that it does (or does not) share some (or all) key
attributes with this page's minimum value, which allows for earlier
scan termination in boundary cases.

> +++ b/src/include/access/nbtree.h
> +#define SK_BT_RDDNARRAY    0x00040000    /* redundant in array preprocessing */

I think "made redundant" is more appropriate than just "redundant";
the array key is not itself redundant in array preprocessing (indeed
we actually work hard on that key during preprocessing to allow us to
mark it redundant)

> +++ b/src/backend/access/nbtree/nbtsearch.c
>      * We skip this for the first page in the scan to evade the possible
> -     * slowdown of the point queries.
> +     * slowdown of point queries.  Never apply the optimization with a scan
> +     * that uses array keys, either, since that breaks certain assumptions.
> +     * (Our search-type scan keys change whenever _bt_checkkeys advances the
> +     * arrays, invalidating any precheck.  Tracking all that would be tricky.)

I think this was mentioned before, but can't we have an argument or
version of _bt_checkkeys that makes it not advance the array keys, so
that we can keep this optimization? Or, isn't that now
_bt_check_compare?
For an orderlines table with 1000s of lines per order, we would
greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
handles scan keys more efficiently; the optimization which is being
disabled here.

> +++ b/src/backend/access/nbtree/nbtutils.c
> _bt_preprocess_array_keys(IndexScanDesc scan)
The _bt_preprocess_array_keys code is now broken due to type
confusion, as it assumes there is only one array subtype being
compared per attribute in so.orderProcs. Reproducer:
CREATE TABLE test AS
SELECT generate_series(1, 10000, 1::bigint) num;
CREATE INDEX ON test (num); /* bigint typed */

SELECT num FROM test
WHERE num = ANY ('{1}'::smallint[])
  AND num = ANY ('{1}'::int[]) /* ensure matching
lastEqualityArrayAtt, lastOrderProc for next qual
  AND num = ANY ('{65537}'::int[]); /* qual is broken due to
application of smallint compare operator on int values that do equal
mod 2^16, but do not equal in their own type */
 num
-----
   1

I'm also concerned about the implications of this in
_bt_binsrch_array_skey, as this too assumes the same compare operator
is always used for all array operations on each attribute. We may need
one so->orderProcs entry for each array key, but at least one per
sk_subtype of each array key.

I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num = int[] doesn't seem to merge the first and last
array ops. I'm also concerned about being (un)able to merge

> +/*
> + * _bt_merge_arrays() -- merge together duplicate array keys
> + *
> + * Both scan keys have array elements that have already been sorted and
> + * deduplicated.
> + */

As I mentioned upthread, I find this function to be very wasteful, as
it uses N binary searches to merge join two already sorted arrays,
resulting in a O(n log(m)) complexity, whereas a normal merge join
should be O(n + m) once the input datasets are sorted.
Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

> +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> +                             IndexTuple tuple, int sktrig, bool validtrig)

I don't quite understand what the 'validtrig' argument is used for.
There is an assertion that it is false under some conditions in this
code, but it's not at all clear to me why that would have to be the
case - it is called with `true` in one of the three call sites. Could
the meaning of this be clarified?

I also feel that this patch includes several optimizations such as
this sktrig argument which aren't easy to understand. Could you pull
that into a separately reviewable patch?

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

> _bt_advance_array_keys

Thinking about the implementation here:
We require transitivity for btree opclasses, where A < B implies NOT A
= B, etc. Does this also carry over into cross-type operators? E.g. a
type 'truncatedint4' that compares only the highest 16 bits of an
integer would be strictly sorted, and could compare 0::truncatedint4 =
0::int4 as true, as well as 0::truncatedint4 = 2::int4, while 0::int4
= 2::int4 is false.
Would it be valid to add support methods for truncatedint4 to an int4
btree opclass, or is transitivity also required for all operations?
i.e. all values that one operator class considers unique within an
opfamily must be considered unique for all additional operators in the
opfamily, or is that not required?
If not, then that would pose problems for this patch, as the ordering
of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
('{0,65536}'::truncatedint4[]) could potentially skip results.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I notice an assertion error here:
> +            Assert(cur->sk_strategy != BTEqualStrategyNumber);
> +            Assert(all_required_sk_satisfied);
> +            Assert(!foundRequiredOppositeDirOnly);
> +
> +            foundRequiredOppositeDirOnly = true;

This assertion can be hit with the following test case:

CREATE TABLE test AS
SELECT i a, i b, i c FROM generate_series(1, 1000) i;
CREATE INDEX ON test(a, b, c); ANALYZE;
SELECT count(*) FROM test
WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
AND b = ANY ('{1,2,3}');

> +_bt_update_keys_with_arraykeys(IndexScanDesc scan)

I keep getting confused by the mixing of integer increments and
pointer increments. Could you explain why in this code you chose to
increment a pointer for "ScanKey cur", while using arrray indexing for
other fields? It feels very arbitrary to me, and that makes the code
difficult to follow.

> +++ b/src/test/regress/sql/btree_index.sql
> +-- Add tests to give coverage of various subtle issues.
> +--
> +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> +--
> +-- Here we don't remember the scan's array keys before processing a page, only
> +-- after processing a page (which is implicit, it's just the scan's current
> +-- keys).  So when we move the scan backwards we think that the top-level scan
> +-- should terminate, when in reality it should jump backwards to the leaf page
> +-- that we last visited.

I notice this adds a complex test case that outputs many rows. Can we
do with less rows if we build the index after data insertion, and with
a lower (non-default) fillfactor?

Note: I did not yet do any in-depth review of the planner changes in
indxpath.c/selfuncs.c.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: ALTER TYPE OWNER fails to recurse to multirange
Следующее
От: Jeff Davis
Дата:
Сообщение: Re: Built-in CTYPE provider