Обсуждение: Return rows in input array's order?

Поиск
Список
Период
Сортировка

Return rows in input array's order?

От
Dominique Devienne
Дата:
Hi. With an integer identity primary key table,
we fetch a number of rows with WHERE id = ANY($1),
with $1 an int[] array. The API using that query must return
rows in the input int[] array order, and uses a client-side
mapping to achieve that currently.

Is it possible to maintain $1's order directly in SQL? Efficiently?

Currently the code assumes all PKs will be found. I.e. no "holes".
Would that assumption change the way it would be done in SQL?

Thanks, --DD

Re: Return rows in input array's order?

От
David Wheeler
Дата:
> Hi. With an integer identity primary key table,
> we fetch a number of rows with WHERE id = ANY($1),
> with $1 an int[] array. The API using that query must return
> rows in the input int[] array order, and uses a client-side
> mapping to achieve that currently.
>
> Is it possible to maintain $1's order directly in SQL? Efficiently?

We’ve done this before with an “order by array_index(id, input_array)”. I forget the actual function consider that
pseudocode 

It was only used for small arrays but never noticed any performance issues


Re: Return rows in input array's order?

От
Dominique Devienne
Дата:
On Tue, May 9, 2023 at 11:23 AM David Wheeler <hippysoyboy@gmail.com> wrote:
> Hi. With an integer identity primary key table,
> we fetch a number of rows with WHERE id = ANY($1),
> with $1 an int[] array. The API using that query must return
> rows in the input int[] array order, and uses a client-side
> mapping to achieve that currently.
>
> Is it possible to maintain $1's order directly in SQL? Efficiently?

We’ve done this before with an “order by array_index(id, input_array)”. I forget the actual function consider that pseudo code

Thanks David. I see how this would work. 

It was only used for small arrays but never noticed any performance issues

Hmmm, sounds like this would be quadratic though...

Each call to array_index() will be O(N), so turn the sort into O(N^2) just from the array_index() calls,
without even considering the sorting itself (which I assume is O(N log N)).

I wonder whether the int[] can be turned into a pseudo table with a ROWNUM extra generated column that
would then be (LEFT) JOIN'd to the accessed table, so that the original array index is readily accessible.
Would something like this be possible in Postgres' SQL?

I could then skip the sort, return that original index as part of the select,
and thus be able to read the other columns directly in the correct client-side re-allocated vector-slot / structure...

Re: Return rows in input array's order?

От
negora
Дата:

Hi Dominique:

Take a look to the "unnest()" function. It transforms an array into a set of rows. I believe I used it in the past to do something similar to what you need.

Another option is to use a "values" expression (in a subquery) instead of an array, and build the query dynamically.

Best regards.


On 09/05/2023 11:37, Dominique Devienne wrote:
On Tue, May 9, 2023 at 11:23 AM David Wheeler <hippysoyboy@gmail.com> wrote:
> Hi. With an integer identity primary key table,
> we fetch a number of rows with WHERE id = ANY($1),
> with $1 an int[] array. The API using that query must return
> rows in the input int[] array order, and uses a client-side
> mapping to achieve that currently.
>
> Is it possible to maintain $1's order directly in SQL? Efficiently?

We’ve done this before with an “order by array_index(id, input_array)”. I forget the actual function consider that pseudo code

Thanks David. I see how this would work. 

It was only used for small arrays but never noticed any performance issues

Hmmm, sounds like this would be quadratic though...

Each call to array_index() will be O(N), so turn the sort into O(N^2) just from the array_index() calls,
without even considering the sorting itself (which I assume is O(N log N)).

I wonder whether the int[] can be turned into a pseudo table with a ROWNUM extra generated column that
would then be (LEFT) JOIN'd to the accessed table, so that the original array index is readily accessible.
Would something like this be possible in Postgres' SQL?

I could then skip the sort, return that original index as part of the select,
and thus be able to read the other columns directly in the correct client-side re-allocated vector-slot / structure...

Re: Return rows in input array's order?

От
David Wheeler
Дата:
>> It was only used for small arrays but never noticed any performance issues
>
> Hmmm, sounds like this would be quadratic though...

True, but it’s cpu time not io, which tends to be orders of magnitude slower

> I wonder whether the int[] can be turned into a pseudo table with a ROWNUM extra generated column that
> would then be (LEFT) JOIN'd to the accessed table, so that the original array index is readily accessible.
> Would something like this be possible in Postgres' SQL?

The unnest function mentioned above has a “with ordinality” option which gives easy access to the array index so a
simplejoin would do the trick 



Re: Return rows in input array's order?

От
Steven Lembark
Дата:
On Tue, 9 May 2023 11:37:29 +0200
Dominique Devienne <ddevienne@gmail.com> wrote:

> On Tue, May 9, 2023 at 11:23 AM David Wheeler <hippysoyboy@gmail.com>
> wrote:
>
> > > Hi. With an integer identity primary key table,
> > > we fetch a number of rows with WHERE id = ANY($1),
> > > with $1 an int[] array. The API using that query must return
> > > rows in the input int[] array order, and uses a client-side
> > > mapping to achieve that currently.
> > >
> > > Is it possible to maintain $1's order directly in SQL?
> > > Efficiently?
> >
> > We’ve done this before with an “order by array_index(id,
> > input_array)”. I forget the actual function consider that pseudo
> > code
>
> Thanks David. I see how this would work.
>
> It was only used for small arrays but never noticed any performance
> issues
>

Depending on your PG version:

    Create a temp table via unnest, join that with what you need
    and order by tmp.seq.

Forgot which version allows inlining of CTE's but you can
use a CTE (12?):

    with int_seq
    as
    (
        select  unnest( int_array_col ) "order_by"
        from    whatever
        where   blah
    )
    select
        <whatever>
    from
        foobar  a
        join
        int_seq b
        on
        a.foo = b.order_by
    order by
        b.order_by
      , <whatever else>


This dodges the tmp table and the optimizer can inline the
results, probably gets you the fastest result.


--
Steven Lembark
Workhorse Computing
lembark@wrkhors.com
+1 888 359 3508



Re: Return rows in input array's order?

От
Kirk Wolak
Дата:
On Tue, May 9, 2023 at 6:36 AM David Wheeler <hippysoyboy@gmail.com> wrote:

>> It was only used for small arrays but never noticed any performance issues
>
> Hmmm, sounds like this would be quadratic though...

True, but it’s cpu time not io, which tends to be orders of magnitude slower

> I wonder whether the int[] can be turned into a pseudo table with a ROWNUM extra generated column that
> would then be (LEFT) JOIN'd to the accessed table, so that the original array index is readily accessible.
> Would something like this be possible in Postgres' SQL?

The unnest function mentioned above has a “with ordinality” option which gives easy access to the array index so a simple join would do the trick

I've actually used this approach (array_index) for hundreds of items without even noticing.
The other approach would be to sort your $1 list, and then just do ORDER BY ID? 

Re: Return rows in input array's order?

От
Andrew Gierth
Дата:
>>>>> "Dominique" == Dominique Devienne <ddevienne@gmail.com> writes:

 Dominique> Hi. With an integer identity primary key table,
 Dominique> we fetch a number of rows with WHERE id = ANY($1),
 Dominique> with $1 an int[] array. The API using that query must return
 Dominique> rows in the input int[] array order, and uses a client-side
 Dominique> mapping to achieve that currently.

 Dominique> Is it possible to maintain $1's order directly in SQL?
 Dominique> Efficiently?

This is the correct way:

SELECT ... FROM unnest($1) WITH ORDINALITY AS u(id,ord)
           JOIN yourtable t ON t.id=u.id
 ORDER BY u.ord;

This doesn't assume there won't be holes (if you want, you can change it
to a left join to get a null row instead for missing ids).

The query plan you get for this should be something like:

  Nested Loop
    Function Scan on unnest
    Index Scan on yourtable_pkey

(less likely, depending on table sizes, would be a Merge Join with
similar inputs. If your table is very small you might get a hashjoin and
separate sort, but that shouldn't happen with realistic data sizes.)

Notice that this is entirely deterministic about the output ordering
without needing to do any sorting. (The planner knows that the output of
WITH ORDINALITY function scans is automatically ordered by the ordinal
column, so it will usually generate plans that take advantage of that.)
The presence of "ORDER BY u.ord" ensures that the output order is
correct regardless of plan choice.

-- 
Andrew (irc:RhodiumToad)



Re: Return rows in input array's order?

От
Dominique Devienne
Дата:
On Wed, May 10, 2023 at 9:49 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
 Dominique> Is it possible to maintain $1's order directly in SQL?

This is the correct way:

SELECT ... FROM unnest($1) WITH ORDINALITY AS u(id,ord)
           JOIN yourtable t ON t.id=u.id
 ORDER BY u.ord;

Thanks Andrew, for spelling it out for me. Appreciated.
Also thanks to others who chimed in.

I assume that if the PK is composite, and I pass the PK tuples as separate
same-cardinality "parallel" arrays, I can "zip" those arrays back via a multi-join
using their ordinals before joining with the composite-PK table? --DD

PS: I guess the ideal plan depends both on the table itself, but also the cardinality
  of the array(s) passed in as bind variable(s) at runtime to the prepared statement, right?
  But from past posts, I got the impression the plan of a prepared statement is "fixed",
  and does not depend on "bind peeking" like it can in Oracle, to take those bound
  array's cardinality into account at PQexecPrepared-time?

PPS: This is something I actually failed to do in Oracle in the past...

Re: Return rows in input array's order?

От
Andrew Gierth
Дата:
>>>>> "Dominique" == Dominique Devienne <ddevienne@gmail.com> writes:

 Dominique> Is it possible to maintain $1's order directly in SQL?

 >> This is the correct way:
 >> 
 >> SELECT ... FROM unnest($1) WITH ORDINALITY AS u(id,ord)
 >> JOIN yourtable t ON t.id=u.id
 >> ORDER BY u.ord;

 Dominique> Thanks Andrew, for spelling it out for me. Appreciated.
 Dominique> Also thanks to others who chimed in.

 Dominique> I assume that if the PK is composite, and I pass the PK
 Dominique> tuples as separate same-cardinality "parallel" arrays, I can
 Dominique> "zip" those arrays back via a multi-join using their
 Dominique> ordinals before joining with the composite-PK table?

You don't need to, because unnest can do it for you:

SELECT ... FROM unnest($1,$2,$3) WITH ORDINALITY AS u(id1,id2,id3,ord)
           JOIN yourtable t ON t.id1=u.id1
                               AND t.id2=u.id2
                               AND t.id3=u.id3
 ORDER BY u.ord;

(I did actually consider using a join on the ordinal column to implement
multi-arg unnest internally, but the overhead was too much. So instead
the executor knows how to do the zipping itself.)

 Dominique> PS: I guess the ideal plan depends both on the table itself,
 Dominique> but also the cardinality of the array(s) passed in as bind
 Dominique> variable(s) at runtime to the prepared statement, right?

Yes, in the sense that what matters is what proportion of the table is
being fetched. Is it likely that you'll be passing in very long lists of
ids relative to the table size?

 Dominique> But from past posts, I got the impression the plan of a
 Dominique> prepared statement is "fixed", and does not depend on "bind
 Dominique> peeking" like it can in Oracle, to take those bound array's
 Dominique> cardinality into account at PQexecPrepared-time?

It's a bit more complicated than that and it often depends on what the
client library is doing; many clients don't do a protocol-level named
prepare until after a client-side prepared statement has been used
several times; and even after doing a named prepare, the planner won't
try a generic plan until after several more uses.

We distinguish between "generic plan" and "custom plan"; a generic plan
is one produced without knowledge of the parameter values and must work
for any parameter values, while a custom plan only works for one
specific set of parameter values and can't usually be re-used. Custom
plans take the parameter values into account both for estimation and for
constant-folding optimizations. Generic plans are used after about the
5th use of a statement if the cost of the generic plan isn't worse than
the average costs of the custom plans from the previous uses, plus a
fudge factor representing the CPU cost of custom planning.

The planning hazard in cases like this is that when doing a generic
plan, the planner has no idea at all what the array cardinalities will
be; it doesn't try and cache information like that from the custom
plans. So it will make a zeroth-order approximation (i.e. a constant)
derived by the time-honoured method of rectal extraction, and this may
make the generic plan look a lot cheaper than it should.

-- 
Andrew (irc:RhodiumToad)



Re: Return rows in input array's order?

От
Dominique Devienne
Дата:
On Wed, May 10, 2023 at 1:08 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "Dominique" == Dominique Devienne <ddevienne@gmail.com> writes:
 Dominique> I assume that if the PK is composite, and I pass the PK
 Dominique> tuples as separate same-cardinality "parallel" arrays, I can
 Dominique> "zip" those arrays back via a multi-join using their
 Dominique> ordinals before joining with the composite-PK table?

You don't need to, because unnest can do it for you:

Wow, that's fantastic. Thanks!

 Dominique> PS: I guess the ideal plan depends both on the table itself,
 Dominique> but also the cardinality of the array(s) passed in as bind
 Dominique> variable(s) at runtime to the prepared statement, right?

Yes, in the sense that what matters is what proportion of the table is
being fetched. Is it likely that you'll be passing in very long lists of
ids relative to the table size?

I'm writing a new mid-tier implementation of an existing protocol / API,
so I don't decide "how much" the clients ask for. The API certainly allows
a small request to return a large amount data / rows from several tables.

Although the queries using list of IDs (SKs) as where-clauses are typically
internal implementation details, and not per-se client requests.
 
 Dominique> But from past posts, I got the impression the plan of a
 Dominique> prepared statement is "fixed", and does not depend on "bind
 Dominique> peeking" like it can in Oracle, to take those bound array's
 Dominique> cardinality into account at PQexecPrepared-time?

It's a bit more complicated than that and it often depends on what the
client library is doing; many clients don't do a protocol-level named
prepare until after a client-side prepared statement has been used
several times; and even after doing a named prepare, the planner won't
try a generic plan until after several more uses.

I'm in C++ using my own thin wrapper on top of libpq directly.

And I do tend to PQprepare extensively, since my mid-tier implementation
is long lived and many queries will be used many many times. This used to
matter a lot with Oracle OCI, but maybe lack of bind-peeking to re-plan or
select among a choices of plan makes always preparing statements a bad
choice with PostgreSQL / LibPQ?
 
We distinguish between "generic plan" and "custom plan"; a generic plan
is one produced without knowledge of the parameter values and must work
for any parameter values, while a custom plan only works for one
specific set of parameter values and can't usually be re-used. Custom
plans take the parameter values into account both for estimation and for
constant-folding optimizations. Generic plans are used after about the
5th use of a statement if the cost of the generic plan isn't worse than
the average costs of the custom plans from the previous uses, plus a
fudge factor representing the CPU cost of custom planning.

Indeed it's more complicated than I thought... Interesting though.
 
The planning hazard in cases like this is that when doing a generic
plan, the planner has no idea at all what the array cardinalities will
be; it doesn't try and cache information like that from the custom
plans. So it will make a zeroth-order approximation (i.e. a constant)
derived by the time-honoured method of rectal extraction, and this may
make the generic plan look a lot cheaper than it should.

Funny colorful language :). Thanks again, you've been tremendously helpful. --DD