Обсуждение: Re: Proposal to introduce a shuffle function to intarray extension

Поиск

Список

Период

Сортировка

Re: Proposal to introduce a shuffle function to intarray extension

От

Thomas Munro

Дата:

17 июля 2022 г., 00:56:27

On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
> I would like to see a function like this inside the intarray extension.
> Is there any way to get to this point? How is the process to deal with
> such proposals?

Hi Martin,

I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
about code.  For the archives, Martin's initial message to -general
was:

https://www.postgresql.org/message-id/9d160a44-7675-51e8-60cf-6d64b76db831%40aboutsource.net

The first question is whether such a thing belongs in an external
extension, or in the contrib/intarray module.  The latter seems like a
reasonable thing to want to me.  The first step towards that will be
to get your code into .patch format, as in git format-patch or git
diff, that can be applied to the master branch.

https://wiki.postgresql.org/wiki/Submitting_a_Patch

Some initial feedback from me: I'd recommend adding a couple of
comments to the code, like the algorithm name for someone who wants to
read more about it  (I think it's a Fisher-Yates shuffle?). You'll
need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
creates the function.  You might want to add something to
control/intarray/sql/_int.sql that invokes the function when you run
make check in there (but doesn't display the results, since that'd be
unstable across machines?), just to have 'code coverage' (I mean, it'd
prove it doesn't crash at least).  Once details are settled, you'd
also want to add documentation in doc/src/sgml/intarray.sgml.  I
understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

17 июля 2022 г., 05:25:27

Am 16.07.22 um 23:56 schrieb Thomas Munro:
> On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
> <martin.kalcher@aboutsource.net> wrote:
>> I would like to see a function like this inside the intarray extension.
>> Is there any way to get to this point? How is the process to deal with
>> such proposals?
> 
> Hi Martin,
> 
> I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
> about code.  For the archives, Martin's initial message to -general
> was:
> 
> https://www.postgresql.org/message-id/9d160a44-7675-51e8-60cf-6d64b76db831%40aboutsource.net
> 
> The first question is whether such a thing belongs in an external
> extension, or in the contrib/intarray module.  The latter seems like a
> reasonable thing to want to me.  The first step towards that will be
> to get your code into .patch format, as in git format-patch or git
> diff, that can be applied to the master branch.
> 
> https://wiki.postgresql.org/wiki/Submitting_a_Patch
> 
> Some initial feedback from me: I'd recommend adding a couple of
> comments to the code, like the algorithm name for someone who wants to
> read more about it  (I think it's a Fisher-Yates shuffle?). You'll
> need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
> creates the function.  You might want to add something to
> control/intarray/sql/_int.sql that invokes the function when you run
> make check in there (but doesn't display the results, since that'd be
> unstable across machines?), just to have 'code coverage' (I mean, it'd
> prove it doesn't crash at least).  Once details are settled, you'd
> also want to add documentation in doc/src/sgml/intarray.sgml.  I
> understand that this is a specialised int[] shuffle, but I wonder if
> someone would ever want to have a general array shuffle, and how that
> would work, in terms of naming convention etc.

Hello Thomas,

Thank you for pointing me towards the correct list.

I do not feel qualified to answer the question wether this belongs in an 
external extension or in contrib/intarray. That reason i would like to 
see it in contrib/intarray is, that it is lot easier for me to get our 
operations team to upgrade our database system, because of new features 
we need, than to get them to install a self-maintained extension on our 
database servers.

Thank you for your feedback. I tried to address all your points and made 
a first patch. Some points are still open:

- Documentation is postponed until further feedback.

- I am not shure about the naming. intarray has generic names like
   sort() and uniq() and specialised names like icount(). I guess in case
   someone wants to have a general array shuffle it could be accomplished
   with function overloading. Or am i wrong here?

- I added a second function sample(), because it is a lot faster to take
   some elements from an array than to shuffle the whole array and
   slice it. This function can be removed if it is not wanted. The
   important one is shuffle().

Martin

Вложения

0001-Introduce-shuffle-and-sample-to-intarray-extension.patch

Re: Proposal to introduce a shuffle function to intarray extension

От

Tom Lane

Дата:

17 июля 2022 г., 06:18:29

Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> Am 16.07.22 um 23:56 schrieb Thomas Munro:
>> I understand that this is a specialised int[] shuffle, but I wonder if
>> someone would ever want to have a general array shuffle, and how that
>> would work, in terms of naming convention etc.

> - I am not shure about the naming. intarray has generic names like
>    sort() and uniq() and specialised names like icount(). I guess in case
>    someone wants to have a general array shuffle it could be accomplished
>    with function overloading. Or am i wrong here?

I suppose this is exactly the point Thomas was wondering about: if we
use a generic function name for this, will it cause problems for someone
trying to add a generic function later?

We can investigate that question with a couple of toy functions:

regression=# create function foo(int[]) returns text as 'select ''int[] version''' language sql;
CREATE FUNCTION
regression=# create function foo(anyarray) returns text as 'select ''anyarray version''' language sql;
CREATE FUNCTION
regression=# select foo('{1,2,3}');
ERROR:  function foo(unknown) is not unique
LINE 1: select foo('{1,2,3}');
               ^
HINT:  Could not choose a best candidate function. You might need to add explicit type casts.

OK, that's not too surprising: with an unknown input there's just not
anything that the parser can use to disambiguate.  But this happens
with just about any overloaded name, so I don't think it's a showstopper.

regression=# select foo('{1,2,3}'::int[]);
      foo
---------------
 int[] version
(1 row)

regression=# select foo('{1,2,3}'::int8[]);
       foo
------------------
 anyarray version
(1 row)

Good, that's more or less the minimum functionality we should expect.

regression=# select foo('{1,2,3}'::int2[]);
ERROR:  function foo(smallint[]) is not unique
LINE 1: select foo('{1,2,3}'::int2[]);
               ^
HINT:  Could not choose a best candidate function. You might need to add explicit type casts.

Oh, that's slightly annoying ...

regression=# select foo('{1,2,3}'::oid[]);
       foo
------------------
 anyarray version
(1 row)

regression=# select foo('{1,2,3}'::text[]);
       foo
------------------
 anyarray version
(1 row)

regression=# select foo('{1,2,3}'::float8[]);
       foo
------------------
 anyarray version
(1 row)

I couldn't readily find any case that misbehaves except for int2[].
You can force that to work, at least in one direction:

regression=# select foo('{1,2,3}'::int2[]::int[]);
      foo
---------------
 int[] version
(1 row)

On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version.  That's
clearly the easiest-to-remember definition, and it doesn't seem
like the gotchas are bad enough to justify using separate names.

> - I added a second function sample(), because it is a lot faster to take
>    some elements from an array than to shuffle the whole array and
>    slice it. This function can be removed if it is not wanted.

I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE.  Can we think of a
more on-point name?  Something like "random_subset" would be pretty
clear, but it's also clunky.  It's too late here for me to think of
le mot juste...

            regards, tom lane

Re: Proposal to introduce a shuffle function to intarray extension

От

"David G. Johnston"

Дата:

17 июля 2022 г., 06:32:34

On Sat, Jul 16, 2022 at 7:25 PM Martin Kalcher <martin.kalcher@aboutsource.net> wrote:

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().

+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) != sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);

+ ?column?
+----------
+ t
+(1 row)
+

While small, there is a non-zero chance for both samples to be equal. This test should probably just go, I don't see what it tests that isn't covered by other tests or even trivial usage.

Same goes for:

+SELECT shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}') != shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}');
+ ?column?
+----------
+ t
+(1 row)
+

David J.

Re: Proposal to introduce a shuffle function to intarray extension

От

"David G. Johnston"

Дата:

17 июля 2022 г., 06:36:48

On Sat, Jul 16, 2022 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

> - I added a second function sample(), because it is a lot faster to take
> some elements from an array than to shuffle the whole array and
> slice it. This function can be removed if it is not wanted.

I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...

choose(input anyarray, size integer, with_replacement boolean default false, algorithm text default 'default')?

David J.

Re: Proposal to introduce a shuffle function to intarray extension

От

Tom Lane

Дата:

17 июля 2022 г., 06:37:26

I wrote:
> On the whole, I'd vote for calling it shuffle(), and expecting that
> we'd also use that name for any future generic version.

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

            regards, tom lane

Re: Proposal to introduce a shuffle function to intarray extension

От

Thomas Munro

Дата:

17 июля 2022 г., 09:00:09

On Sun, Jul 17, 2022 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > On the whole, I'd vote for calling it shuffle(), and expecting that
> > we'd also use that name for any future generic version.
>
> Actually ... is there a reason to bother with an intarray version
> at all, rather than going straight for an in-core anyarray function?
> It's not obvious to me that an int4-only version would have
> major performance advantages.

Yeah, that seems like a good direction.  If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type.  Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.

I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch.  He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1].  I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias.  Anyway, file that under go-faster ideas for later.

[1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

17 июля 2022 г., 12:16:28

Am 17.07.22 um 05:37 schrieb Tom Lane:
> 
> Actually ... is there a reason to bother with an intarray version
> at all, rather than going straight for an in-core anyarray function?
> It's not obvious to me that an int4-only version would have
> major performance advantages.
> 
>             regards, tom lane

Hi Tom,

thank you for your thoughts. There are two reasons for choosing an 
int4-only version. I am not familiar with postgres development (yet) and 
i was not sure how open you are about such changes to core and if the 
proposed feature is considered valuable enough to go into core. The 
second reason was ease of implementation. The intarray extension does 
not allow any NULL elements in arrays and treats multidimensional arrays 
as though they were linear. Which makes the implementation straight 
forward, because there are fewer cases to consider.

However, i will take a look at an implementation for anyarray in core.

Martin

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

17 июля 2022 г., 12:38:55

Am 17.07.22 um 08:00 schrieb Thomas Munro:
> 
> I went to see what Professor Lemire would have to say about all this,
> expecting to find a SIMD rabbit hole to fall down for some Sunday
> evening reading, but the main thing that jumped out was this article
> about the modulo operation required by textbook Fisher-Yates to get a
> bounded random number, the random() % n that appears in the patch.  He
> talks about shuffling twice as fast by using a no-division trick to
> get bounded random numbers[1].  I guess you might need to use our
> pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
> introduce bias.  Anyway, file that under go-faster ideas for later.
> 
> [1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/

Hi Thomas,

the small bias of random() % n is not a problem for my use case, but 
might be for others. Its easily replaceable with

   (int) pg_prng_uint64_range(&pg_global_prng_state, 0, n-1)

Unfortunately it is a bit slower (on my machine), but thats negligible.

Martin

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

17 июля 2022 г., 12:54:22

Am 17.07.22 um 05:32 schrieb David G. Johnston:
>
>   +SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) !=
> sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
> + ?column?
> +----------
> + t
> +(1 row)
> +
> 
> While small, there is a non-zero chance for both samples to be equal.  This
> test should probably just go, I don't see what it tests that isn't covered
> by other tests or even trivial usage.
> 

Hey David,

you are right. There is a small chance for this test to fail. I wanted 
to test, that two invocations produce different results (after all the 
main feature of the function). But it can probably go.

Martin

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

17 июля 2022 г., 19:15:51

Am 17.07.22 um 08:00 schrieb Thomas Munro:
>> Actually ... is there a reason to bother with an intarray version
>> at all, rather than going straight for an in-core anyarray function?
>> It's not obvious to me that an int4-only version would have
>> major performance advantages.
> 
> Yeah, that seems like a good direction.  If there is a performance
> advantage to specialising, then perhaps we only have to specialise on
> size, not type.  Perhaps there could be a general function that
> internally looks out for typbyval && typlen == 4, and dispatches to a
> specialised 4-byte, and likewise for 8, if it can, and that'd already
> be enough to cover int, bigint, float etc, without needing
> specialisations for each type.

I played around with the idea of an anyarray shuffle(). The hard part 
was to deal with arrays with variable length elements, as they can not 
be swapped easily in place. I solved it by creating an intermediate 
array of references to the elements. I'll attach a patch with the proof 
of concept. Unfortunatly it is already about 5 times slower than the 
specialised version and i am not sure if it is worth going down that road.

Martin

Вложения

0001-introduce-array_shuffle.patch

Re: Proposal to introduce a shuffle function to intarray extension

От

Thomas Munro

Дата:

18 июля 2022 г., 01:37:04

On Mon, Jul 18, 2022 at 4:15 AM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
> Am 17.07.22 um 08:00 schrieb Thomas Munro:
> >> Actually ... is there a reason to bother with an intarray version
> >> at all, rather than going straight for an in-core anyarray function?
> >> It's not obvious to me that an int4-only version would have
> >> major performance advantages.
> >
> > Yeah, that seems like a good direction.  If there is a performance
> > advantage to specialising, then perhaps we only have to specialise on
> > size, not type.  Perhaps there could be a general function that
> > internally looks out for typbyval && typlen == 4, and dispatches to a
> > specialised 4-byte, and likewise for 8, if it can, and that'd already
> > be enough to cover int, bigint, float etc, without needing
> > specialisations for each type.
>
> I played around with the idea of an anyarray shuffle(). The hard part
> was to deal with arrays with variable length elements, as they can not
> be swapped easily in place. I solved it by creating an intermediate
> array of references to the elements. I'll attach a patch with the proof
> of concept. Unfortunatly it is already about 5 times slower than the
> specialised version and i am not sure if it is worth going down that road.

Seems OK for a worst case.  It must still be a lot faster than doing
it in SQL.  Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4.  I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'.  Or perhaps you can work out the padded size and
use that, to catch a few more types.  Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?

About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.

Re: Proposal to introduce a shuffle function to intarray extension

От

Tom Lane

Дата:

18 июля 2022 г., 01:46:27

Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> Am 17.07.22 um 08:00 schrieb Thomas Munro:
>>> Actually ... is there a reason to bother with an intarray version
>>> at all, rather than going straight for an in-core anyarray function?

> I played around with the idea of an anyarray shuffle(). The hard part 
> was to deal with arrays with variable length elements, as they can not 
> be swapped easily in place. I solved it by creating an intermediate 
> array of references to the elements. I'll attach a patch with the proof 
> of concept.

This does not look particularly idiomatic, or even type-safe.  What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality.  Not sure if anyone would care.)

            regards, tom lane

Re: Proposal to introduce a shuffle function to intarray extension

От

Tom Lane

Дата:

18 июля 2022 г., 01:53:31

Thomas Munro <thomas.munro@gmail.com> writes:
> Seems OK for a worst case.  It must still be a lot faster than doing
> it in SQL.  Now I wonder what the exact requirements would be to
> dispatch to a faster version that would handle int4.

I find it impossible to believe that it's worth micro-optimizing
shuffle() to that extent.  Now, maybe doing something in that line
in deconstruct_array and construct_array would be worth our time,
as that'd benefit a pretty wide group of functions.

            regards, tom lane

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

18 июля 2022 г., 02:05:19

Am 18.07.22 um 00:46 schrieb Tom Lane:
> 
> This does not look particularly idiomatic, or even type-safe.  What you
> should have done was use deconstruct_array to get an array of Datums and
> isnull flags, then shuffled those, then used construct_array to build the
> output.
> 
> (Or, perhaps, use construct_md_array to replicate the input's
> precise dimensionality.  Not sure if anyone would care.)
> 
>             regards, tom lane

deconstruct_array() would destroy the arrays dimensions. I would expect 
that shuffle() only shuffles the first dimension and keeps the inner 
arrays intact.

Martin

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

18 июля 2022 г., 02:07:22

Am 18.07.22 um 00:37 schrieb Thomas Munro:
> Seems OK for a worst case.  It must still be a lot faster than doing
> it in SQL.  Now I wonder what the exact requirements would be to
> dispatch to a faster version that would handle int4.  I haven't
> studied this in detail but perhaps to dispatch to a fast shuffle for
> objects of size X, the requirement would be something like typlen == X
> && align_bytes <= typlen && typlen % align_bytes == 0, where
> align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
> Or in English, 'the data consists of densely packed objects of fixed
> size X, no padding'.  Or perhaps you can work out the padded size and
> use that, to catch a few more types.  Then you call
> array_shuffle_{2,4,8}() as appropriate, which should be as fast as
> your original int[] proposal, but work also for float, date, ...?
> 
> About your experimental patch, I haven't reviewed it properly or tried
> it but I wonder if uint32 dat_offset, uint32 size (= half size
> elements) would be enough due to limitations on varlenas.

I made another experimental patch with fast tracks for typelen4 and 
typelen8. alignments are not yet considered.

Вложения

0001-introduce-array_shuffle.patch

Re: Proposal to introduce a shuffle function to intarray extension

От

Tom Lane

Дата:

18 июля 2022 г., 02:20:33

Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> Am 18.07.22 um 00:46 schrieb Tom Lane:
>> This does not look particularly idiomatic, or even type-safe.  What you
>> should have done was use deconstruct_array to get an array of Datums and
>> isnull flags, then shuffled those, then used construct_array to build the
>> output.
>> 
>> (Or, perhaps, use construct_md_array to replicate the input's
>> precise dimensionality.  Not sure if anyone would care.)

> deconstruct_array() would destroy the arrays dimensions.

As I said, you can use construct_md_array if you want to preserve the
array shape.

> I would expect that shuffle() only shuffles the first dimension and
> keeps the inner arrays intact.

This argument is based on a false premise, ie that Postgres thinks
multidimensional arrays are arrays-of-arrays.  They aren't, and
we're not going to start making them so by defining shuffle()
at variance with every other array-manipulating function.  Shuffling
the individual elements regardless of array shape is the definition
that's consistent with our existing functionality.

(Having said that, even if we were going to implement it with that
definition, I should think that it'd be easiest to do so on the
array-of-Datums representation produced by deconstruct_array.
That way you don't need to do different things for different element
types.)

            regards, tom lane

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

18 июля 2022 г., 10:12:50

Am 18.07.22 um 01:20 schrieb Tom Lane:
>> I would expect that shuffle() only shuffles the first dimension and
>> keeps the inner arrays intact.
> 
> This argument is based on a false premise, ie that Postgres thinks
> multidimensional arrays are arrays-of-arrays.  They aren't, and
> we're not going to start making them so by defining shuffle()
> at variance with every other array-manipulating function.  Shuffling
> the individual elements regardless of array shape is the definition
> that's consistent with our existing functionality.

Hey Tom,

thank you for clarification. I did not know that. I will make a patch 
that is using deconstruct_array().

Martin

Re: Proposal to introduce a shuffle function to intarray extension

От

Martin Kalcher

Дата:

18 июля 2022 г., 10:47:30

Am 18.07.22 um 01:20 schrieb Tom Lane:
> (Having said that, even if we were going to implement it with that
> definition, I should think that it'd be easiest to do so on the
> array-of-Datums representation produced by deconstruct_array.
> That way you don't need to do different things for different element
> types.)

Thank you Tom, here is a patch utilising deconstruct_array(). If we 
agree, that this is the way to go, i would like to add array_sample() 
(good name?), some test cases, and documentation.

One more question. How do i pick a Oid for the functions?

Martin

Вложения

0001-introduce-array_shuffle.patch

Re: Proposal to introduce a shuffle function to intarray extension

От

John Naylor

Дата:

18 июля 2022 г., 11:18:44

On Mon, Jul 18, 2022 at 2:47 PM Martin Kalcher <martin.kalcher@aboutsource.net> wrote:
> One more question. How do i pick a Oid for the functions?

For this, we recommend running src/include/catalog/unused_oids, and it will give you a random range to pick from. That reduces the chance of different patches conflicting with each other. It doesn't really matter what the oid here is, since at feature freeze a committer will change them anyway.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Proposal to introduce a shuffle function to intarray extension

От

Tom Lane

Дата:

18 июля 2022 г., 16:48:16

John Naylor <john.naylor@enterprisedb.com> writes:
> On Mon, Jul 18, 2022 at 2:47 PM Martin Kalcher <
> martin.kalcher@aboutsource.net> wrote:
>> One more question. How do i pick a Oid for the functions?

> For this, we recommend running src/include/catalog/unused_oids, and it will
> give you a random range to pick from. That reduces the chance of different
> patches conflicting with each other. It doesn't really matter what the oid
> here is, since at feature freeze a committer will change them anyway.

If you want the nitty gritty details here, see

https://www.postgresql.org/docs/devel/system-catalog-initial-data.html#SYSTEM-CATALOG-OID-ASSIGNMENT

            regards, tom lane

[PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

18 июля 2022 г., 22:03:35

Thanks for all your feedback and help. I got a patch that i consider 
ready for review. It introduces two new functions:

   array_shuffle(anyarray) -> anyarray
   array_sample(anyarray, integer) -> anyarray

array_shuffle() shuffles an array (obviously). array_sample() picks n 
random elements from an array.

Is someone interested in looking at it? What are the next steps?

Martin

Вложения

0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

18 июля 2022 г., 22:29:37

Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> Is someone interested in looking at it? What are the next steps?

The preferred thing to do is to add it to our "commitfest" queue,
which will ensure that it gets looked at eventually.  The currently
open cycle is 2022-09 [1] (see the "New Patch" button there).

I believe you have to have signed up as a community member[2]
in order to put yourself in as the patch author.  I think "New Patch"
will work anyway, but it's better to have an author listed.

            regards, tom lane

[1] https://commitfest.postgresql.org/39/
[2] https://www.postgresql.org/account/login/

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

18 июля 2022 г., 23:15:43

Am 18.07.22 um 21:29 schrieb Tom Lane:
> The preferred thing to do is to add it to our "commitfest" queue,
> which will ensure that it gets looked at eventually.  The currently
> open cycle is 2022-09 [1] (see the "New Patch" button there).

Thanks Tom, did that. I am not sure if "SQL Commands" is the correct 
topic for this patch, but i guess its not that important. I am impressed 
by all the infrastructure build around this project.

Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Robert Haas

Дата:

18 июля 2022 г., 23:27:30

On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
> Thanks for all your feedback and help. I got a patch that i consider
> ready for review. It introduces two new functions:
>
>    array_shuffle(anyarray) -> anyarray
>    array_sample(anyarray, integer) -> anyarray
>
> array_shuffle() shuffles an array (obviously). array_sample() picks n
> random elements from an array.

I like this idea.

I think it's questionable whether the behavior of array_shuffle() is
correct for a multi-dimensional array. The implemented behavior is to
keep the dimensions as they were, but permute the elements across all
levels at random. But there are at least two other behaviors that seem
potentially defensible: (1) always return a 1-dimensional array, (2)
shuffle the sub-arrays at the top-level without the possibility of
moving elements within or between sub-arrays. What behavior we decide
is best here should be documented.

array_sample() will return elements in random order when sample_size <
array_size, but in the original order when sample_size >= array_size.
Similarly, it will always return a 1-dimensional array in the former
case, but will keep the original dimensions in the latter case. That
seems pretty hard to defend. I think it should always return a
1-dimensional array with elements in random order, and I think this
should be documented.

I also think you should add test cases involving multi-dimensional
arrays, as well as arrays with non-default bounds. e.g. trying
shuffling or sampling some values like
'[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[]

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

18 июля 2022 г., 23:34:50

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher
> <martin.kalcher@aboutsource.net> wrote:
>> array_shuffle(anyarray) -> anyarray
>> array_sample(anyarray, integer) -> anyarray

> I think it's questionable whether the behavior of array_shuffle() is
> correct for a multi-dimensional array. The implemented behavior is to
> keep the dimensions as they were, but permute the elements across all
> levels at random. But there are at least two other behaviors that seem
> potentially defensible: (1) always return a 1-dimensional array, (2)
> shuffle the sub-arrays at the top-level without the possibility of
> moving elements within or between sub-arrays. What behavior we decide
> is best here should be documented.

Martin had originally proposed (2), which I rejected on the grounds
that we don't treat multi-dimensional arrays as arrays-of-arrays for
any other purpose.  Maybe we should have, but that ship sailed decades
ago, and I doubt that shuffle() is the place to start changing it.

I could get behind your option (1) though, to make it clearer that
the input array's dimensionality is not of interest.  Especially since,
as you say, (1) is pretty much the only sensible choice for array_sample.

> I also think you should add test cases involving multi-dimensional
> arrays, as well as arrays with non-default bounds. e.g. trying
> shuffling or sampling some values like
> '[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[]

This'd only matter if we decide not to ignore the input's dimensionality.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

19 июля 2022 г., 00:03:01

I wrote:
> Martin had originally proposed (2), which I rejected on the grounds
> that we don't treat multi-dimensional arrays as arrays-of-arrays for
> any other purpose.

Actually, after poking at it for awhile, that's an overstatement.
It's true that the type system doesn't think N-D arrays are
arrays-of-arrays, but there are individual functions/operators that do.

For instance, @> is in the its-a-flat-list-of-elements camp:

regression=# select array[[1,2],[3,4]] @> array[1,3];
 ?column?
----------
 t
(1 row)

but || wants to preserve dimensionality:

regression=# select array[[1,2],[3,4]] || array[1];
ERROR:  cannot concatenate incompatible arrays
DETAIL:  Arrays with differing dimensions are not compatible for concatenation.

Various other functions dodge the issue by refusing to work on arrays
of more than one dimension.

There seem to be more functions that think arrays are flat than
otherwise, but it's not as black-and-white as I was thinking.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

19 июля 2022 г., 00:48:55

Am 18.07.22 um 23:03 schrieb Tom Lane:
> I wrote:
>> Martin had originally proposed (2), which I rejected on the grounds
>> that we don't treat multi-dimensional arrays as arrays-of-arrays for
>> any other purpose.
> 
> Actually, after poking at it for awhile, that's an overstatement.
> It's true that the type system doesn't think N-D arrays are
> arrays-of-arrays, but there are individual functions/operators that do.
> Thanks Robert for pointing out the inconsistent behavior of 
array_sample(). That needs to be fixed.

As Tom's investigation showed, there is no consensus in the code if 
multi-dimensional arrays are treated as arrays-of-arrays or not. We need 
to decide what should be the correct treatment.

If we go with (1) array_shuffle() and array_sample() should shuffle each 
element individually and always return a one-dimensional array.

   select array_shuffle('{{1,2},{3,4},{5,6}}');
   -----------
    {1,4,3,5,6,2}

   select array_sample('{{1,2},{3,4},{5,6}}', 3);
   ----------
    {1,4,3}

If we go with (2) both functions should only operate on the first 
dimension and shuffle whole subarrays and keep the dimensions intact.

   select array_shuffle('{{1,2},{3,4},{5,6}}');
   ---------------------
    {{3,4},{1,2},{5,6}}

   select array_sample('{{1,2},{3,4},{5,6}}', 2);
   ---------------
    {{3,4},{1,2}}

I do not feel qualified to make that decision. (2) complicates the code 
a bit, but that should not be the main argument here.

Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

19 июля 2022 г., 01:18:40

Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> If we go with (1) array_shuffle() and array_sample() should shuffle each 
> element individually and always return a one-dimensional array.

>    select array_shuffle('{{1,2},{3,4},{5,6}}');
>    -----------
>     {1,4,3,5,6,2}

>    select array_sample('{{1,2},{3,4},{5,6}}', 3);
>    ----------
>     {1,4,3}

Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact.  If you want the behavior shown
above, you can do array_shuffle(array_sample(...)).  But if we
randomize it, and that's not what the user wanted, she has no
recourse.

Now, if you're convinced that the set of people wanting
sampling-without-shuffling is the empty set, then making everybody
else call two functions is a loser.  But I'm not convinced.
At the least, I'd like to see the argument made why nobody
would want that.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

"David G. Johnston"

Дата:

19 июля 2022 г., 01:32:00

On Mon, Jul 18, 2022 at 3:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact. If you want the behavior shown
above, you can do array_shuffle(array_sample(...)). But if we
randomize it, and that's not what the user wanted, she has no
recourse.

And for those that want to know in what order those elements were chosen they have no recourse in the other setup.

I really think this function needs to grow an algorithm argument that can be used to specify stuff like ordering, replacement/without-replacement, etc...just some enums separated by commas that can be added to the call.

David J.

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

19 июля 2022 г., 01:43:32

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, Jul 18, 2022 at 3:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Independently of the dimensionality question --- I'd imagined that
>> array_sample would select a random subset of the array elements
>> but keep their order intact.  If you want the behavior shown
>> above, you can do array_shuffle(array_sample(...)).  But if we
>> randomize it, and that's not what the user wanted, she has no
>> recourse.

> And for those that want to know in what order those elements were chosen
> they have no recourse in the other setup.

Um ... why is "the order in which the elements were chosen" a concept
we want to expose?  ISTM sample() is a black box in which notionally
the decisions could all be made at once.

> I really think this function needs to grow an algorithm argument that can
> be used to specify stuff like ordering, replacement/without-replacement,
> etc...just some enums separated by commas that can be added to the call.

I think you might run out of gold paint somewhere around here.  I'm
still not totally convinced we should bother with the sample() function
at all, let alone that it needs algorithm variants.  At some point we
say to the user "here's a PL, write what you want for yourself".

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

19 июля 2022 г., 01:52:47

Am 19.07.22 um 00:18 schrieb Tom Lane:
> 
> Independently of the dimensionality question --- I'd imagined that
> array_sample would select a random subset of the array elements
> but keep their order intact.  If you want the behavior shown
> above, you can do array_shuffle(array_sample(...)).  But if we
> randomize it, and that's not what the user wanted, she has no
> recourse.
> 
> Now, if you're convinced that the set of people wanting
> sampling-without-shuffling is the empty set, then making everybody
> else call two functions is a loser.  But I'm not convinced.
> At the least, I'd like to see the argument made why nobody
> would want that.
> 

On the contrary! I am pretty sure there are people out there wanting 
sampling-without-shuffling. I will think about that.

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Thomas Munro

Дата:

19 июля 2022 г., 04:09:12

On Tue, Jul 19, 2022 at 8:15 AM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
> Am 18.07.22 um 21:29 schrieb Tom Lane:
> > The preferred thing to do is to add it to our "commitfest" queue,
> > which will ensure that it gets looked at eventually.  The currently
> > open cycle is 2022-09 [1] (see the "New Patch" button there).
>
> Thanks Tom, did that.

FYI that brought your patch to the attention of this CI robot, which
is showing a couple of warnings.  See the FAQ link there for an
explanation of that infrastructure.

http://cfbot.cputube.org/martin-kalcher.html

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

19 июля 2022 г., 10:29:03

Am 19.07.22 um 00:52 schrieb Martin Kalcher:
> 
> On the contrary! I am pretty sure there are people out there wanting 
> sampling-without-shuffling. I will think about that.

I gave it some thought. Even though there might be use cases, where a 
stable order is desired, i would consider them edge cases, not worth the 
additional complexity. I personally would not expect array_sample() to 
return elements in any specific order. I looked up some sample() 
implementations. None of them makes guarantees about the order of the 
resulting array or explicitly states that the resulting array is in 
random or selection order.

- Python random.sample [0]
- Ruby Array#sample [1]
- Rust rand::seq::SliceRandom::choose_multiple [2]
- Julia StatsBase.sample [3] stable order needs explicit request

[0] https://docs.python.org/3/library/random.html#random.sample
[1] https://ruby-doc.org/core-3.0.0/Array.html#method-i-sample
[2] 
https://docs.rs/rand/0.6.5/rand/seq/trait.SliceRandom.html#tymethod.choose_multiple
[3] https://juliastats.org/StatsBase.jl/stable/sampling/#StatsBase.sample

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Aleksander Alekseev

Дата:

19 июля 2022 г., 13:07:29

Hi Martin,

I didn't look at the code yet but I very much like the idea. Many
thanks for working on this!

It's a pity your patch was too late for the July commitfest. In
September, please keep an eye on cfbot [1] to make sure your patch
applies properly.

> As Tom's investigation showed, there is no consensus in the code if
> multi-dimensional arrays are treated as arrays-of-arrays or not. We need
> to decide what should be the correct treatment.

Here are my two cents.

From the user perspective I would expect that by default a
multidimensional array should be treated as an array of arrays. So for
instance:

```
array_shuffle([ [1,2], [3,4], [5,6] ])
```

... should return something like:

```
[ [3,4], [1,2], [5,6] ]
```

Note that the order of the elements in the internal arrays is preserved.

However, I believe there should be an optional argument that overrides
this behavior. For instance:

```
array_shuffle([ [1,2], [3,4], [5,6] ], depth => 2)
```

BTW, while on it, shouldn't we add similar functions for JSON and/or
JSONB? Or is this going to be too much for a single discussion?

[1]: http://cfbot.cputube.org/

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

19 июля 2022 г., 14:15:42

Am 18.07.22 um 23:48 schrieb Martin Kalcher:
> 
> If we go with (1) array_shuffle() and array_sample() should shuffle each 
> element individually and always return a one-dimensional array.
> 
>    select array_shuffle('{{1,2},{3,4},{5,6}}');
>    -----------
>     {1,4,3,5,6,2}
> 
>    select array_sample('{{1,2},{3,4},{5,6}}', 3);
>    ----------
>     {1,4,3}
> 
> If we go with (2) both functions should only operate on the first 
> dimension and shuffle whole subarrays and keep the dimensions intact.
> 
>    select array_shuffle('{{1,2},{3,4},{5,6}}');
>    ---------------------
>     {{3,4},{1,2},{5,6}}
> 
>    select array_sample('{{1,2},{3,4},{5,6}}', 2);
>    ---------------
>     {{3,4},{1,2}}
> 

Having thought about it, i would go with (2). It gives the user the 
ability to decide wether or not array-of-arrays behavior is desired. If 
he wants the behavior of (1) he can flatten the array before applying 
array_shuffle(). Unfortunately there is no array_flatten() function (at 
the moment) and the user would have to work around it with unnest() and 
array_agg().

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Robert Haas

Дата:

19 июля 2022 г., 16:05:23

On Mon, Jul 18, 2022 at 6:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Um ... why is "the order in which the elements were chosen" a concept
> we want to expose?  ISTM sample() is a black box in which notionally
> the decisions could all be made at once.

I agree with that. But I also think it's fine for the elements to be
returned in a shuffled order rather than the original order.

> > I really think this function needs to grow an algorithm argument that can
> > be used to specify stuff like ordering, replacement/without-replacement,
> > etc...just some enums separated by commas that can be added to the call.
>
> I think you might run out of gold paint somewhere around here.  I'm
> still not totally convinced we should bother with the sample() function
> at all, let alone that it needs algorithm variants.  At some point we
> say to the user "here's a PL, write what you want for yourself".

I don't know what gold paint has to do with anything here, but I agree
that David's proposal seems to be moving the goalposts a very long
way.

The thing is, as Martin points out, these functions already exist in a
bunch of other systems. For one example I've used myself, see
https://underscorejs.org/

I probably wouldn't have called a function to put a list into a random
order "shuffle" in a vacuum, but it seems to be common nomenclature
these days. I believe that if you don't make reference to Fisher-Yates
in the documentation, they kick you out of the cool programmers club.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Andrew Dunstan

Дата:

19 июля 2022 г., 16:53:44

On 2022-07-19 Tu 07:15, Martin Kalcher wrote:
> Am 18.07.22 um 23:48 schrieb Martin Kalcher:
>>
>> If we go with (1) array_shuffle() and array_sample() should shuffle
>> each element individually and always return a one-dimensional array.
>>
>>    select array_shuffle('{{1,2},{3,4},{5,6}}');
>>    -----------
>>     {1,4,3,5,6,2}
>>
>>    select array_sample('{{1,2},{3,4},{5,6}}', 3);
>>    ----------
>>     {1,4,3}
>>
>> If we go with (2) both functions should only operate on the first
>> dimension and shuffle whole subarrays and keep the dimensions intact.
>>
>>    select array_shuffle('{{1,2},{3,4},{5,6}}');
>>    ---------------------
>>     {{3,4},{1,2},{5,6}}
>>
>>    select array_sample('{{1,2},{3,4},{5,6}}', 2);
>>    ---------------
>>     {{3,4},{1,2}}
>>
>
> Having thought about it, i would go with (2). It gives the user the
> ability to decide wether or not array-of-arrays behavior is desired.
> If he wants the behavior of (1) he can flatten the array before
> applying array_shuffle(). Unfortunately there is no array_flatten()
> function (at the moment) and the user would have to work around it
> with unnest() and array_agg().
>
>


Why not have an optional second parameter for array_shuffle that
indicates whether or not to flatten? e.g. array_shuffle(my_array,
flatten => true)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Robert Haas

Дата:

19 июля 2022 г., 16:56:41

On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> > Having thought about it, i would go with (2). It gives the user the
> > ability to decide wether or not array-of-arrays behavior is desired.
> > If he wants the behavior of (1) he can flatten the array before
> > applying array_shuffle(). Unfortunately there is no array_flatten()
> > function (at the moment) and the user would have to work around it
> > with unnest() and array_agg().
>
> Why not have an optional second parameter for array_shuffle that
> indicates whether or not to flatten? e.g. array_shuffle(my_array,
> flatten => true)

IMHO, if we think that's something many people are going to want, it
would be better to add an array_flatten() function, because that could
be used for a variety of purposes, whereas an option to this function
can only be used for this function.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

19 июля 2022 г., 17:20:23

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Why not have an optional second parameter for array_shuffle that
>> indicates whether or not to flatten? e.g. array_shuffle(my_array,
>> flatten => true)

> IMHO, if we think that's something many people are going to want, it
> would be better to add an array_flatten() function, because that could
> be used for a variety of purposes, whereas an option to this function
> can only be used for this function.

Agreed.  Whether it's really needed, I dunno --- I don't recall the
issue having come up before.

After taking a second look through
https://www.postgresql.org/docs/current/functions-array.html
it seems like the things that treat arrays as flat even when they
are multi-D are just

* unnest(), which is more or less forced into that position by our
type system: it has to be anyarray returning anyelement, not
anyarray returning anyelement-or-anyarray-depending.

* array_to_string(), which maybe could have done it differently but
can reasonably be considered a variant of unnest().

* The containment/overlap operators, which are kind of their own
thing anyway.  Arguably they got this wrong, though I suppose it's
too late to rethink that.

Everything else either explicitly rejects more-than-one-D arrays
or does something that is compatible with thinking of them as
arrays-of-arrays.

So I withdraw my original position.  These functions should just
shuffle or select in the array's first dimension, preserving
subarrays.  Or else be lazy and reject more-than-one-D arrays;
but it's probably not that hard to handle them.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

19 июля 2022 г., 23:20:57

Am 19.07.22 um 16:20 schrieb Tom Lane:
> 
> So I withdraw my original position.  These functions should just
> shuffle or select in the array's first dimension, preserving
> subarrays.  Or else be lazy and reject more-than-one-D arrays;
> but it's probably not that hard to handle them.
> 

Here is a patch with dimension aware array_shuffle() and array_sample().

If you think array_flatten() is desirable, i can take a look at it. 
Maybe a second parameter would be nice to specify the target dimension:

   select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 1);
   -------------------
    {1,2,3,4,5,6,7,8}

   select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 2);
   -----------------------
    {{1,2,3,4},{5,6,7,8}}

Martin

Вложения

0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Dean Rasheed

Дата:

21 июля 2022 г., 11:41:30

On Tue, 19 Jul 2022 at 21:21, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
>
> Here is a patch with dimension aware array_shuffle() and array_sample().
>

+1 for this feature, and this way of handling multi-dimensional arrays.

> If you think array_flatten() is desirable, i can take a look at it.

That's not something I've ever wanted -- personally, I rarely use
multi-dimensional arrays in Postgres.

A couple of quick comments on the current patch:

It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html

It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.

Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.

In my experience, the requirement for sampling with replacement is
about as common as the requirement for sampling without replacement,
so it seems odd to provide one but not the other. Of course, we can
always add a with-replacement function later, and give it a different
name. But maybe array_sample() could be used for both, with a
"with_replacement" boolean parameter?

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

21 июля 2022 г., 14:15:43

Am 21.07.22 um 10:41 schrieb Dean Rasheed:
> 
> A couple of quick comments on the current patch:

Thank you for your feedback!

> It's important to mark these new functions as VOLATILE, not IMMUTABLE,
> otherwise they won't work as expected in queries. See
> https://www.postgresql.org/docs/current/xfunc-volatility.html

CREATE FUNCTION marks functions as VOLATILE by default if not explicitly 
marked otherwise. I assumed function definitions in pg_proc.dat have the 
same behavior. I will fix that.
https://www.postgresql.org/docs/current/sql-createfunction.html

> It would be better to use pg_prng_uint64_range() rather than rand() to
> pick elements. Partly, that's because it uses a higher quality PRNG,
> with a larger internal state, and it ensures that the results are
> unbiased across the range. But more importantly, it interoperates with
> setseed(), allowing predictable sequences of "random" numbers to be
> generated -- something that's useful in writing repeatable regression
> tests.

I agree that we should use pg_prng_uint64_range(). However, in order to 
achieve interoperability with setseed() we would have to use 
drandom_seed (rather than pg_global_prng_state) as rng state, which is 
declared statically in float.c and exclusively used by random(). Do we 
want to expose drandom_seed to other functions?

> Assuming these new functions are made to interoperate with setseed(),
> which I think they should be, then they also need to be marked as
> PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
> https://www.postgresql.org/docs/current/parallel-safety.html, which
> explains why setseed() and random() are parallel restricted.

As mentioned above, i assumed the default here is PARALLEL UNSAFE. I'll 
fix that.

> In my experience, the requirement for sampling with replacement is
> about as common as the requirement for sampling without replacement,
> so it seems odd to provide one but not the other. Of course, we can
> always add a with-replacement function later, and give it a different
> name. But maybe array_sample() could be used for both, with a
> "with_replacement" boolean parameter?

We can also add a "with_replacement" boolean parameter which is false by 
default to array_sample() later. I do not have a strong opinion about 
that and will implement it, if desired. Any opinions about 
default/no-default?

Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Dean Rasheed

Дата:

21 июля 2022 г., 15:25:27

On Thu, 21 Jul 2022 at 12:15, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
>
> I agree that we should use pg_prng_uint64_range(). However, in order to
> achieve interoperability with setseed() we would have to use
> drandom_seed (rather than pg_global_prng_state) as rng state, which is
> declared statically in float.c and exclusively used by random(). Do we
> want to expose drandom_seed to other functions?
>

Ah, I didn't realise that setseed() and random() were bound up so
tightly. It does feel as though, if we're adding more user-facing
functions that return random sequences, there ought to be a way to
seed them, and I wouldn't want to have separate setseed functions for
each one.

I'm inclined to say that we want a new pg_global_prng_user_state that
is updated by setseed(), and used by random(), array_shuffle(),
array_sample(), and any other user-facing random functions we add
later.

I can also see that others might not like expanding the scope of
setseed() in this way.

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

21 июля 2022 г., 18:43:02

Am 21.07.22 um 14:25 schrieb Dean Rasheed:
> 
> I'm inclined to say that we want a new pg_global_prng_user_state that
> is updated by setseed(), and used by random(), array_shuffle(),
> array_sample(), and any other user-facing random functions we add
> later.
> 

I like the idea. How would you organize the code? I imagine some sort of 
user prng that is encapsulated in something like utils/adt/random.c and 
is guaranteed to always be seeded.

Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

21 июля 2022 г., 20:29:12

Am 21.07.22 um 10:41 schrieb Dean Rasheed:
> 
> It's important to mark these new functions as VOLATILE, not IMMUTABLE,
> otherwise they won't work as expected in queries. See
> https://www.postgresql.org/docs/current/xfunc-volatility.html
> 
> It would be better to use pg_prng_uint64_range() rather than rand() to
> pick elements. Partly, that's because it uses a higher quality PRNG,
> with a larger internal state, and it ensures that the results are
> unbiased across the range. But more importantly, it interoperates with
> setseed(), allowing predictable sequences of "random" numbers to be
> generated -- something that's useful in writing repeatable regression
> tests.
> 
> Assuming these new functions are made to interoperate with setseed(),
> which I think they should be, then they also need to be marked as
> PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
> https://www.postgresql.org/docs/current/parallel-safety.html, which
> explains why setseed() and random() are parallel restricted.
> 

Here is an updated patch that marks the functions VOLATILE PARALLEL 
RESTRICTED and uses pg_prng_uint64_range() rather than rand().

Вложения

0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Dean Rasheed

Дата:

22 июля 2022 г., 10:59:20

On Thu, 21 Jul 2022 at 16:43, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
>
> Am 21.07.22 um 14:25 schrieb Dean Rasheed:
> >
> > I'm inclined to say that we want a new pg_global_prng_user_state that
> > is updated by setseed(), and used by random(), array_shuffle(),
> > array_sample(), and any other user-facing random functions we add
> > later.
> >
>
> I like the idea. How would you organize the code? I imagine some sort of
> user prng that is encapsulated in something like utils/adt/random.c and
> is guaranteed to always be seeded.
>

Something like that, perhaps. I can see 2 ways it could go:

Option 1:
 Keep random.c small
 - Responsible for initialisation of the user prng on demand
 - Expose the user prng state to other code like float.c and arrayfuncs.c

Option 2:
 Move all random functions wanting to use the user prng to random.c
 - Starting with drandom() and setseed()
 - Then, array_shuffle() and array_sample()
 - Later, any new SQL-callable random functions we might add
 - Keep the user prng state local to random.c

Option 2 seems quite appealing, because it keeps all SQL-callable
random functions together in one place, and keeps the state local,
making it easier to keep track of which functions are using it.

Code like the Fisher-Yates algorithm is more to do with random than it
is to do with arrays, so putting it in random.c seems to make more
sense.

It's possible that some hypothetical random function might need access
to type-specific internals. For example, if we wanted to add a
function to return a random numeric, it would probably have to go in
numeric.c, but that could be solved by having random.c call numeric.c,
passing it the prng state.

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

22 июля 2022 г., 12:31:07

Am 22.07.22 um 09:59 schrieb Dean Rasheed:>
> Option 1:
>   Keep random.c small
>   - Responsible for initialisation of the user prng on demand
>   - Expose the user prng state to other code like float.c and arrayfuncs.c
> 
> Option 2:
>   Move all random functions wanting to use the user prng to random.c
>   - Starting with drandom() and setseed()
>   - Then, array_shuffle() and array_sample()
>   - Later, any new SQL-callable random functions we might add
>   - Keep the user prng state local to random.c
> 

Hey Dean,

i came to the same conclusions and went with Option 1 (see patch). 
Mainly because most code in utils/adt is organized by type and this way 
it is clear, that this is a thin wrapper around pg_prng.

What do you think?

Вложения

0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Dean Rasheed

Дата:

22 июля 2022 г., 14:04:06

On Fri, 22 Jul 2022 at 10:31, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
>
> i came to the same conclusions and went with Option 1 (see patch).
> Mainly because most code in utils/adt is organized by type and this way
> it is clear, that this is a thin wrapper around pg_prng.
>
> What do you think?

Looks fairly neat, on a quick read-through. There's certainly
something to be said for preserving the organisation by type.

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Joe Conway

Дата:

22 июля 2022 г., 19:49:10

On 7/19/22 10:20, Tom Lane wrote:
> Everything else either explicitly rejects more-than-one-D arrays
> or does something that is compatible with thinking of them as
> arrays-of-arrays.

I think I am responsible for at least some of those, and I agree that 
thinking of MD arrays as arrays-of-arrays is preferable even though they 
are not actually that. Long ago[1] Peter E asked me to fix that as I 
recall but it was one of those round tuits that I never found.

> So I withdraw my original position.  These functions should just
> shuffle or select in the array's first dimension, preserving
> subarrays.  Or else be lazy and reject more-than-one-D arrays;
> but it's probably not that hard to handle them.

+1

Joe

[1] 

https://www.postgresql.org/message-id/flat/Pine.LNX.4.44.0306281418020.2178-100000%40peter.localdomain#a064d6dd8593993d799db453a3ee04d1

-- 
Joe Conway
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

23 июля 2022 г., 15:20:47

Am 22.07.22 um 11:31 schrieb Martin Kalcher:
> 
> i came to the same conclusions and went with Option 1 (see patch). 
> Mainly because most code in utils/adt is organized by type and this way 
> it is clear, that this is a thin wrapper around pg_prng.
> 

Small patch update. I realized the new functions should live 
array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers 
and added some comments to the code.

Вложения

0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Fabien COELHO

Дата:

24 июля 2022 г., 11:15:22

>> i came to the same conclusions and went with Option 1 (see patch). Mainly 
>> because most code in utils/adt is organized by type and this way it is 
>> clear, that this is a thin wrapper around pg_prng.
>> 
>
> Small patch update. I realized the new functions should live 
> array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers and 
> added some comments to the code.

My 0,02€ about this patch:

Use (void) when declaring no parameters in headers or in functions.

Should the exchange be skipped when i == k?

I do not see the point of having *only* inline functions in a c file. The 
external functions should not be inlined?

The random and array shuffling functions share a common state. I'm 
wondering whether it should ot should not be so. It seems ok, but then 
ISTM that the documentation suggests implicitely that setseed applies to 
random() only, which is not the case anymore after the patch.

If more samples are required than the number of elements, it does not 
error out. I'm wondering whether it should.

Also, the sampling should not return its result in order when the number 
of elements required is the full array, ISTM that it should be shuffled 
there as well.

I must say that without significant knowledge of the array internal 
implementation, the swap code looks pretty strange. ISTM that the code 
would be clearer if pointers and array references style were not 
intermixed.

Maybe you could add a test with a 3D array? Some sample with NULLs?

Unrelated: I notice again that (postgre)SQL does not provide a way to 
generate random integers. I do not see why not. Should we provide one?

-- 
Fabien.

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

24 июля 2022 г., 22:29:39

Am 24.07.22 um 10:15 schrieb Fabien COELHO:
> 
> My 0,02€ about this patch:

Thank you for your feedback. I attached a patch, that addresses most of 
your points.

> Use (void) when declaring no parameters in headers or in functions.

Done.

> Should the exchange be skipped when i == k?

The additional branch is actually slower (on my machine, tested with an 
10M element int array) for 1D-arrays, which i assume is the most common 
case. Even with a 2D-array with a subarray size of 1M there is no 
difference in execution time. i == k should be relatively rare for large 
arrays, given a good prng.

> I do not see the point of having *only* inline functions in a c file. 
> The external functions should not be inlined?

Done.

> The random and array shuffling functions share a common state. I'm 
> wondering whether it should ot should not be so. It seems ok, but then 
> ISTM that the documentation suggests implicitely that setseed applies to 
> random() only, which is not the case anymore after the patch.

I really like the idea of a prng state owned by the user, that is used 
by all user facing random functions, but see that their might be 
concerns about this change. I would update the setseed() documentation, 
if this proposal is accepted. Do you think my patch should already 
include the documentation update?

> If more samples are required than the number of elements, it does not 
> error out. I'm wondering whether it should.

The second argument to array_sample() is treated like a LIMIT, rather 
than the actual number of elements. I updated the documentation. My 
use-case is: take max random items from an array of unknown size.

> Also, the sampling should not return its result in order when the number 
> of elements required is the full array, ISTM that it should be shuffled 
> there as well.

You are the second person asking for this. It's done. I am thinking 
about ditching array_sample() and replace it with a second signature for 
array_shuffle():

   array_shuffle(array anyarray) -> anyarray
   array_shuffle(array anyarray, limit int) -> anyarray

What do you think?

> I must say that without significant knowledge of the array internal 
> implementation, the swap code looks pretty strange. ISTM that the code 
> would be clearer if pointers and array references style were not 
> intermixed.

Done. Went with pointers.

> Maybe you could add a test with a 3D array? Some sample with NULLs?

Done.

> Unrelated: I notice again that (postgre)SQL does not provide a way to 
> generate random integers. I do not see why not. Should we provide one?

Maybe. I think it is outside the scope of this patch.

Thank you, Martin

Вложения

0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Fabien COELHO

Дата:

24 июля 2022 г., 22:42:48

Hello,

> Thank you for your feedback. I attached a patch, that addresses most of your 
> points.

I'll look into it. It would help if the patch could include a version 
number at the end.

>> Should the exchange be skipped when i == k?
>
> The additional branch is actually slower (on my machine, tested with an 10M 
> element int array) for 1D-arrays, which i assume is the most common case. 
> Even with a 2D-array with a subarray size of 1M there is no difference in 
> execution time. i == k should be relatively rare for large arrays, given a 
> good prng.

Ok, slower is bad.

>> The random and array shuffling functions share a common state. I'm 
>> wondering whether it should ot should not be so. It seems ok, but then ISTM 
>> that the documentation suggests implicitely that setseed applies to 
>> random() only, which is not the case anymore after the patch.
>
> I really like the idea of a prng state owned by the user, that is used by all 
> user facing random functions, but see that their might be concerns about this 
> change. I would update the setseed() documentation, if this proposal is 
> accepted. Do you think my patch should already include the documentation 
> update?

Duno. I'm still wondering what it should do. I'm pretty sure that the 
documentation should be clear about a shared seed, if any. I do not think 
that departing from the standard is a good thing, either.

>> If more samples are required than the number of elements, it does not error 
>> out. I'm wondering whether it should.
>
> The second argument to array_sample() is treated like a LIMIT, rather than 
> the actual number of elements. I updated the documentation. My use-case is: 
> take max random items from an array of unknown size.

Hmmm. ISTM that the use case of wanting "this many" stuff would make more 
sense because it is strictier so less likely to result in unforseen 
consequences. On principle I do not like not knowing the output size.

If someone wants a limit, they can easily "LEAST(#1 dim size, other 
limit)" to get it, it is easy enough with a strict function.

>> Also, the sampling should not return its result in order when the number of 
>> elements required is the full array, ISTM that it should be shuffled there 
>> as well.
>
> You are the second person asking for this. It's done. I am thinking about 
> ditching array_sample() and replace it with a second signature for 
> array_shuffle():
>
>  array_shuffle(array anyarray) -> anyarray
>  array_shuffle(array anyarray, limit int) -> anyarray
>
> What do you think?

I think that shuffle means shuffle, not partial shuffle, so a different 
name seems better to me.

-- 
Fabien.

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

25 июля 2022 г., 10:34:31

Am 24.07.22 um 21:42 schrieb Fabien COELHO:
> 
> Duno. I'm still wondering what it should do. I'm pretty sure that the 
> documentation should be clear about a shared seed, if any. I do not 
> think that departing from the standard is a good thing, either.

Are sure it violates the standard? I could not find anything about it. 
The private prng state for random() was introduced in 2018 [0]. Neither 
commit nor discussion mentions any standard compliance.

[0] 
https://www.postgresql.org/message-id/E1gdNAo-00036g-TB%40gemulon.postgresql.org

I updated the documentation for setseed().

> If someone wants a limit, they can easily "LEAST(#1 dim size, other 
> limit)" to get it, it is easy enough with a strict function.

Convinced. It errors out now if n is out of bounds.

Martin

Вложения

v2-0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

04 августа 2022 г., 08:46:10

Patch update without merge conflicts.

Martin

Вложения

v3-0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Andres Freund

Дата:

22 сентября 2022 г., 18:23:20

Hi,

On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote:
> Patch update without merge conflicts.

Due to the merge of the meson based build, this patch needs to be adjusted. See
https://cirrus-ci.com/build/6580671765282816
Looks like it'd just be adding user_prng.c to
src/backend/utils/adt/meson.build's list.

Greetings,

Andres Freund

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

22 сентября 2022 г., 18:40:17

Am 22.09.22 um 17:23 schrieb Andres Freund:
> Hi,
> 
> On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote:
>> Patch update without merge conflicts.
> 
> Due to the merge of the meson based build, this patch needs to be adjusted. See
> https://cirrus-ci.com/build/6580671765282816
> Looks like it'd just be adding user_prng.c to
> src/backend/utils/adt/meson.build's list.
> 
> Greetings,
> 
> Andres Freund

Hi Andres,

thanks for the heads up. Adjusted patch is attached.

- Martin

Вложения

v4-0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

26 сентября 2022 г., 23:16:25

Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> [ v4-0001-Introduce-array_shuffle-and-array_sample.patch ]

I think this idea of exporting drandom()'s PRNG for all and sundry
to use is completely misguided.  If we go down that path we'll
be right back in the swamp that we were in when we used random(3),
namely that (a) it's not clear what affects what, and (b) to the
extent that there are such interferences, it could be bad, maybe
even a security problem.  We very intentionally decoupled drandom()
from the rest of the world at commit 6645ad6bd, and I'm not ready
to unlearn that lesson.

With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose.  I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

28 сентября 2022 г., 13:40:31

Am 26.09.22 um 22:16 schrieb Tom Lane:
> 
> With our current PRNG infrastructure it doesn't cost much to have
> a separate PRNG for every purpose.  I don't object to having
> array_shuffle() and array_sample() share one PRNG, but I don't
> think it should go much further than that.
> 

Thanks for your thoughts, Tom. I have a couple of questions. Should we 
introduce a new seed function for the new PRNG state, used by 
array_shuffle() and array_sample()? What would be a good name? Or should 
those functions use pg_global_prng_state? Is it safe to assume, that 
pg_global_prng_state is seeded?

Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Fabien COELHO

Дата:

28 сентября 2022 г., 16:07:06

>> With our current PRNG infrastructure it doesn't cost much to have
>> a separate PRNG for every purpose.  I don't object to having
>> array_shuffle() and array_sample() share one PRNG, but I don't
>> think it should go much further than that.
>
> Thanks for your thoughts, Tom. I have a couple of questions. Should we 
> introduce a new seed function for the new PRNG state, used by array_shuffle() 
> and array_sample()? What would be a good name? Or should those functions use 
> pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is 
> seeded?

I'd suggest to use the existing global state. The global state should be 
seeded at the process start, AFAIKR.

-- 
Fabien.

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

28 сентября 2022 г., 17:18:45

Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> Thanks for your thoughts, Tom. I have a couple of questions. Should we
>> introduce a new seed function for the new PRNG state, used by array_shuffle()
>> and array_sample()? What would be a good name? Or should those functions use
>> pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is
>> seeded?

> I'd suggest to use the existing global state. The global state should be
> seeded at the process start, AFAIKR.

It is seeded at process start, yes.  If you don't feel a need for
user control over the sequence used by these functions, then using
pg_global_prng_state is fine.  (Basically the point to be made
here is that we need to keep a tight rein on what can be affected
by setseed().)

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Martin Kalcher

Дата:

29 сентября 2022 г., 12:39:28

Am 28.09.22 um 16:18 schrieb Tom Lane:
> It is seeded at process start, yes.  If you don't feel a need for
> user control over the sequence used by these functions, then using
> pg_global_prng_state is fine.  (Basically the point to be made
> here is that we need to keep a tight rein on what can be affected
> by setseed().)
> 
>             regards, tom lane

New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

Martin

Вложения

v5-0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

29 сентября 2022 г., 22:33:26

Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today.  I find this behavior a bit
surprising:

+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims
+-------------
+ [-1:1][2:3]
+(1 row)

I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

Some other comments:

+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection
order.

What's "selection order"?  And this probably shouldn't just rely
on the example to describe what happens with multi-D arrays.
Writing "elements" seems somewhere between confusing and wrong.

* Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n,
and let it pull out what it needs.  Less duplicate code that way.

* I find array_shuffle_n drastically undercommented, and what comments
it has are pretty misleading, eg

+        /* Swap all elements in item (i) with elements in item (j). */

j is *not* the index of the second item to be swapped.  You could make
it so, and that might be more readable:

        j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem - 1);
        jelms = elms + j * nelm;
        jnuls = nuls + j * nelm;

But if you want the code to stay as it is, this comment needs work.

* I think good style for SQL-callable C functions is to make the arguments
clear at the top:

+array_sample(PG_FUNCTION_ARGS)
+{
+    ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+    int         n = PG_GETARG_INT32(1);
+    ArrayType  *result;
+    ... other declarations as needed ...

We can't quite make normal C declaration style work, but that's a poor
excuse for not making the argument list visible as directly as possible.

* I wouldn't bother with the PG_FREE_IF_COPY calls.  Those are generally
only used in btree comparison functions, in which there's a need to not
leak memory.

* I wonder if we really need these to be proparallel = 'r'.  If we let
a worker process execute them, they will take numbers from the worker's
pg_global_prng_state seeding not the parent process's seeding, but why
is that a problem?  In the prior version where you were tying them
to the parent's drandom() sequence, proparallel = 'r' made sense;
but now I think it's unnecessary.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

"Gregory Stark (as CFM)"

Дата:

20 марта 2023 г., 22:34:55

On Thu, 29 Sept 2022 at 15:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> > New patch: array_shuffle() and array_sample() use pg_global_prng_state now.
>
> I took a closer look at the patch today.  I find this behavior a bit
> surprising:
>

It looks like this patch received useful feedback and it wouldn't take
much to push it over the line. But it's been Waiting On Author since
last September.

Martin, any chance of getting these last bits of feedback resolved so
it can be Ready for Commit?

-- 
Gregory Stark
As Commitfest Manager

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

"Gregory Stark (as CFM)"

Дата:

04 апреля 2023 г., 00:16:22

Given that there's been no updates since September 22 I'm going to
make this patch Returned with Feedback. The patch can be resurrected
when there's more work done -- don't forget to move it to the new CF
when you do that.


-- 
Gregory Stark
As Commitfest Manager

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Daniel Gustafsson

Дата:

04 апреля 2023 г., 00:25:39

> On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Martin Kalcher <martin.kalcher@aboutsource.net> writes:
>> New patch: array_shuffle() and array_sample() use pg_global_prng_state now.
>
> I took a closer look at the patch today.

Since this seems pretty close to going in, and seems like quite useful
functions, I took a look to see if I could get it across the line (although I
noticed that CFM beat me to the clock in sending this =)).

>  I find this behavior a bit surprising:
>
> +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
> + array_dims
> +-------------
> + [-1:1][2:3]
> +(1 row)
>
> I can buy preserving the lower bound in array_shuffle(), but
> array_sample() is not preserving the first-dimension indexes of
> the array, so ISTM it ought to reset the first lower bound to 1.

I might be daft but I'm not sure I follow why not preserving here, can you
explain?

The rest of your comments have been addressed in the attached v6 I think
(although I'm pretty sure the docs part is just as bad now, explaining these in
concise words is hard, will take another look with fresh eyes tomorrow).

--
Daniel Gustafsson

Вложения

v6-0001-Introduce-array_shuffle-and-array_sample.patch

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

04 апреля 2023 г., 00:46:57

Daniel Gustafsson <daniel@yesql.se> writes:
> On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I find this behavior a bit surprising:
>>
>> +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
>> + array_dims
>> +-------------
>> + [-1:1][2:3]
>> +(1 row)
>>
>> I can buy preserving the lower bound in array_shuffle(), but
>> array_sample() is not preserving the first-dimension indexes of
>> the array, so ISTM it ought to reset the first lower bound to 1.

> I might be daft but I'm not sure I follow why not preserving here, can you
> explain?

Because array_sample selects only some of the (first level) array
elements, those elements are typically not going to have the same
indexes in the output as they did in the input.  So I don't see why
it would be useful to preserve the same lower-bound index.  It does
make sense to preserve the lower-order index bounds ([2:3] in this
example) because we are including or not including those array
slices as a whole.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Daniel Gustafsson

Дата:

04 апреля 2023 г., 00:52:31

> On 3 Apr 2023, at 23:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I find this behavior a bit surprising:
>>>
>>> +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
>>> + array_dims
>>> +-------------
>>> + [-1:1][2:3]
>>> +(1 row)
>>>
>>> I can buy preserving the lower bound in array_shuffle(), but
>>> array_sample() is not preserving the first-dimension indexes of
>>> the array, so ISTM it ought to reset the first lower bound to 1.
>
>> I might be daft but I'm not sure I follow why not preserving here, can you
>> explain?
>
> Because array_sample selects only some of the (first level) array
> elements, those elements are typically not going to have the same
> indexes in the output as they did in the input.  So I don't see why
> it would be useful to preserve the same lower-bound index.  It does
> make sense to preserve the lower-order index bounds ([2:3] in this
> example) because we are including or not including those array
> slices as a whole.

Ah, ok, now I see what you mean, thanks!  I'll try to fix up the patch like
this tomorrow.

--
Daniel Gustafsson

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Tom Lane

Дата:

07 апреля 2023 г., 18:47:54

Daniel Gustafsson <daniel@yesql.se> writes:
> Ah, ok, now I see what you mean, thanks!  I'll try to fix up the patch like
> this tomorrow.

Since we're running out of time, I took the liberty of fixing and
pushing this.

            regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Daniel Gustafsson

Дата:

07 апреля 2023 г., 19:01:15

> On 7 Apr 2023, at 17:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> Ah, ok, now I see what you mean, thanks!  I'll try to fix up the patch like
>> this tomorrow.
>
> Since we're running out of time, I took the liberty of fixing and
> pushing this.

Great, thanks!

--
Daniel Gustafsson

Re: [PATCH] Introduce array_shuffle() and array_sample()

От

Salek Talangi

Дата:

19 апреля 2023 г., 15:04:45

Hi all,

reading this blog post https://www.depesz.com/2023/04/18/waiting-for-postgresql-16-add-array_sample-and-array_shuffle-functions/ I became aware of the new feature and had a look at it and the commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=888f2ea0a81ff171087bdd1c5c1eeda3b78d73d4

To me the description

   /*
    * Shuffle array using Fisher-Yates algorithm. Scan the array and swap
    * current item (nelm datums starting at ielms) with a randomly chosen
    * later item (nelm datums starting at jelms) in each iteration. We can
    * stop once we've done n iterations; then first n items are the result.
    */

seems wrong. For n = 1 the returned item could never be the 1st item of the array (see "randomly chosen later item").

If this really is the case then the result is not really random. But to me it seems j later can be 0 (making it not really "later"), so this might only be a documentation issue.

Best regards

Salek Talangi

Am Mi., 19. Apr. 2023 um 13:48 Uhr schrieb Daniel Gustafsson <daniel@yesql.se>:

> On 7 Apr 2023, at 17:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like
>> this tomorrow.
>
> Since we're running out of time, I took the liberty of fixing and
> pushing this.

Great, thanks!

--
Daniel Gustafsson

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: Proposal to introduce a shuffle function to intarray extension

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения