Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Поиск
Список
Период
Сортировка
От Matthias van de Meent
Тема Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Дата
Msg-id CAEze2WiR5eJgVqDHCEaO0=mDiUyXY_o-8aK-PTKVPCKba_E4Vg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:
> > > I should be able to post v6 later this week. My current plan is to
> > > commit the other nbtree patch first (the backwards scan "boundary
> > > cases" one from the ongoing CF) -- since I saw your review earlier
> > > today. I think that you should probably wait for this v6 before
> > > starting your review.
> >
> > Okay, thanks for the update, then I'll wait for v6 to be posted.
>
> On second thought, I'll just post v6 now (there won't be conflicts
> against the master branch once the other patch is committed anyway).

Thanks. Here's my review of the btree-related code:

> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
>          * set flag to true if all required keys are satisfied and false
>          * otherwise.
>          */
> -        (void) _bt_checkkeys(scan, itup, indnatts, dir,
> -                             &requiredMatchedByPrecheck, false);
> +        _bt_checkkeys(scan, &pstate, itup, false, false);
> +        requiredMatchedByPrecheck = pstate.continuescan;
> +        pstate.continuescan = true; /* reset */

The comment above the updated section needs to be updated.

> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
>          * set flag to true if all required keys are satisfied and false
>          * otherwise.
>          */
> -        (void) _bt_checkkeys(scan, itup, indnatts, dir,
> -                             &requiredMatchedByPrecheck, false);
> +        _bt_checkkeys(scan, &pstate, itup, false, false);

This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?

> +++ b/src/backend/access/nbtree/nbtutils.c
> @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> +            /* We could pfree(elem_values) after, but not worth the cycles */
> +            num_elems = _bt_merge_arrays(scan, cur,
> +                                         (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
> +                                         prev->elem_values, prev->num_elems,
> +                                         elem_values, num_elems);

This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.

> @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
>                        _bt_compare_array_elements, &cxt);
> +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
> +                 Datum *elems_orig, int nelems_orig,
> +                 Datum *elems_next, int nelems_next)
> [...]
> +    /*
> +     * Incrementally copy the original array into a temp buffer, skipping over
> +     * any items that are missing from the "next" array
> +     */

Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.

We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [1....1000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.


> @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
> [...]
> +_bt_binsrch_array_skey(FmgrInfo *orderproc,

Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:

+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+     if (ScanDirectionIsForward(dir))
+         low_elem = array->cur_elem;
+     else
+         high_elem = array->cur_elem;
+ }


> @@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan)
> [...]
> + _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
> [...]
> +    if (scan->parallel_scan != NULL)
> +        _bt_parallel_done(scan);
> +
> +    /*
> +     * No more primitive index scans.  Terminate the top-level scan.
> +     */
> +    return false;

I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the scan is to be
terminated after that comment. So, either move this _bt_parallel_done
call outside the function (which by name would imply it is read-only,
without side effects like this) or move it below the comment
"terminate the top-level scan".

> +_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
> [...]
> +         * Set up ORDER 3-way comparison function and array state
> [...]
> +         * Optimization: Skip over non-required scan keys when we know that

These two sections should probably be swapped, as the skip makes the
setup useless.
Also, the comment here is wrong; the scan keys that are skipped are
'required', not 'non-required'.


> +++ b/src/test/regress/expected/join.out
> @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
>    Merge Cond: (j1.id1 = j2.id1)
>    Join Filter: (j2.id2 = j1.id2)
>    ->  Index Scan using j1_id1_idx on j1
> -   ->  Index Only Scan using j2_pkey on j2
> +   ->  Index Scan using j2_id1_idx on j2
>          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
> -         Filter: ((id1 % 1000) = 1)
> -(7 rows)
> +(6 rows)

I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
anymore. The output otherwise matches (quite possibly because the
other join conditions don't match) and I don't have time to
investigate the intricacies between IOS vs normal IS, but this feels
off.

----

As for the planner changes, I don't think I'm familiar enough with the
planner to make any authorative comments on this. However, it does
look like you've changed the meaning of 'amsearcharray', and I'm not
sure it's OK to assume all indexes that support amsearcharray will
also support for this new assumption of ordered retrieval of SAOPs.
For one, the pgroonga extension [0] does mark
amcanorder+amsearcharray.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://github.com/pgroonga/pgroonga/blob/115414723c7eb8ce9eb667da98e008bd10fbae0a/src/pgroonga.c#L8782-L8788



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Christoph Berg
Дата:
Сообщение: Re: pgsql: Don't trust unvalidated xl_tot_len.
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: autovectorize page checksum code included elsewhere