Обсуждение: Proposal to introduce a shuffle function to intarray extension
Dear list, i am dealing with an application that processes fairly large arrays of integers. It makes heavy use of the intarray extension, which works great in most cases. However, there are two requirements that cannot be addressed by the extension and are rather slow with plain SQL. Both can be met with shuffling: - Taking n random members from an integer array - Splitting an array into n chunks, where each member is assigned to a random chunk Shuffling is currently implemented by unnesting the array, ordering the members by random() and aggregating them again. create table numbers (arr int[]); insert into numbers (arr) select array_agg(i) from generate_series(1, 4000000) i; select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text from ( select array_agg(n order by random()) arr from ( select unnest(arr) n from numbers ) plain ) shuffled; --------------------------------------------------------- {2717290,3093757,2426384} ... {3011871,1402540,1613647} Time: 2348.961 ms (00:02.349) I wrote a small extension (see source code below) to see how much we can gain, when the shuffling is implemented in C and the results speak for themselves: select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text from ( select shuffle(arr) arr from numbers ) shuffled; ---------------------------------------------------- {1313971,3593627,86630} ... {50764,430410,3901128} Time: 132.151 ms I would like to see a function like this inside the intarray extension. Is there any way to get to this point? How is the process to deal with such proposals? Best regards, Martin Kalcher Source code of extension mentioned above: #include "postgres.h" #include "port.h" #include "utils/array.h" PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(shuffle); void _shuffle(int32 *a, int len); Datum shuffle(PG_FUNCTION_ARGS) { ArrayType *a = PG_GETARG_ARRAYTYPE_P_COPY(0); int len; if (array_contains_nulls(a)) ereport(ERROR, (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED), errmsg("array must not contain nulls"))); len = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a)); if (len > 1) _shuffle((int32 *) ARR_DATA_PTR(a), len); PG_RETURN_POINTER(a); } void _shuffle(int32 *a, int len) { int i, j; int32 tmp; for (i = len - 1; i > 0; i--) { j = random() % (i + 1); tmp = a[i]; a[i] = a[j]; a[j] = tmp; } }
On 7/15/22 04:36, Martin Kalcher wrote:
Dear list,
i am dealing with an application that processes fairly large arrays of integers. It makes heavy use of the intarray extension, which works great in most cases. However, there are two requirements that cannot be addressed by the extension and are rather slow with plain SQL. Both can be met with shuffling:
- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a random chunk
Shuffling is currently implemented by unnesting the array, ordering the members by random() and aggregating them again.
Martin, have you considered PL/Python and NumPy module?
-- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
Am 16.07.22 um 18:53 schrieb Mladen Gogala: > On 7/15/22 04:36, Martin Kalcher wrote: >> Dear list, >> >> i am dealing with an application that processes fairly large arrays of >> integers. It makes heavy use of the intarray extension, which works >> great in most cases. However, there are two requirements that cannot >> be addressed by the extension and are rather slow with plain SQL. Both >> can be met with shuffling: >> >> - Taking n random members from an integer array >> - Splitting an array into n chunks, where each member is assigned to a >> random chunk >> >> Shuffling is currently implemented by unnesting the array, ordering >> the members by random() and aggregating them again. > > > Martin, have you considered PL/Python and NumPy module? Hey Mladen, thank you for your advice. Unfortunately the performance of shuffling with NumPy is about the same as with SQL. create function numpy_shuffle(arr int[]) returns int[] as $$ import numpy numpy.random.shuffle(arr) return arr $$ language 'plpython3u'; select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text from ( select numpy_shuffle(arr) arr from numbers ) shuffled; ------------------------------------------------------- {674026,3306457,1727170} ... {343875,3825484,1235246} Time: 2315.431 ms (00:02.315) Am i doing something wrong? Martin
On 7/16/22 16:21, Martin Kalcher wrote:
Hey Mladen,
thank you for your advice. Unfortunately the performance of shuffling with NumPy is about the same as with SQL.
create function numpy_shuffle(arr int[])
returns int[]
as $$
import numpy
numpy.random.shuffle(arr)
return arr
$$ language 'plpython3u';
select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select numpy_shuffle(arr) arr from numbers
) shuffled;
-------------------------------------------------------
{674026,3306457,1727170} ... {343875,3825484,1235246}
Time: 2315.431 ms (00:02.315)
Am i doing something wrong?
Martin
Hi Martin,
No, you're doing everything right. I have no solution for you. You may need to do some C programming or throw a stronger hardware at the problem. The performance of your processors may be the problem. Good luck!
-- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com