Обсуждение: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

Поиск
Список
Период
Сортировка

Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

От
"Joel Jacobson"
Дата:
Hello hackers,

Attached patch introduces an optimization of mul_var() in numeric.c, targeting cases where the multiplicands consist of
onlyone or two base-NBASE digits. Such small multiplicands can fit into an int64 and thus be computed directly,
resultingin a significant performance improvement, between 26% - 34% benchmarked on Intel Core i9-14900K.
 

This optimization is similar to commit d1b307eef2, that also targeted one and two base-NBASE digit operands, but
optimizeddiv_var().
 

Regards,
Joel
Вложения
On Mon, Jul 1, 2024, at 08:04, Joel Jacobson wrote:
> * 0001-optimize-numeric-mul_var-small-factors.patch

New version to silence maybe-uninitialized error reported by cfbot.

/Joel
Вложения

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

От
Dagfinn Ilmari Mannsåker
Дата:
"Joel Jacobson" <joel@compiler.org> writes:

> Hello hackers,
>
> Attached patch introduces an optimization of mul_var() in numeric.c,
> targeting cases where the multiplicands consist of only one or two
> base-NBASE digits. Such small multiplicands can fit into an int64 and
> thus be computed directly, resulting in a significant performance
> improvement, between 26% - 34% benchmarked on Intel Core i9-14900K.
>
> This optimization is similar to commit d1b307eef2, that also targeted
> one and two base-NBASE digit operands, but optimized div_var().

div_var() also has an optimisation for 3- and 4-digit operands under
HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
mul_var() too?

> Regards,
> Joel

- ilmari



On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote:
> div_var() also has an optimisation for 3- and 4-digit operands under
> HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
> mul_var() too?

I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K,
and is actually slower on Apple M3 Max.
Not really sure why. Maybe the code I tried can be optimized further:

```
#ifdef HAVE_INT128
    /*
     * If var1 and var2 are up to four digits, their product will fit in
     * an int128 can be computed directly, which is significantly faster.
     */
    if (var2ndigits <= 4)
    {
        int128        product = 0;

        switch (var1ndigits)
        {
            case 1:
                product = var1digits[0];
                break;
            case 2:
                product = var1digits[0] * NBASE + var1digits[1];
                break;
            case 3:
                product = ((int128) var1digits[0] * NBASE + var1digits[1])
                        * NBASE + var1digits[2];
                break;
            case 4:
                product = (((int128) var1digits[0] * NBASE + var1digits[1])
                        * NBASE + var1digits[2]) * NBASE + var1digits[3];
                break;
        }

        switch (var2ndigits)
        {
            case 1:
                product *= var2digits[0];
                break;
            case 2:
                product *= var2digits[0] * NBASE + var2digits[1];
                break;
            case 3:
                product = ((int128) var2digits[0] * NBASE + var2digits[1])
                        * NBASE + var2digits[2];
                break;
            case 4:
                product = (((int128) var2digits[0] * NBASE + var2digits[1])
                        * NBASE + var2digits[2]) * NBASE + var2digits[3];
                break;
        }

        alloc_var(result, res_ndigits);
        res_digits = result->digits;
        for (i = res_ndigits - 1; i >= 0; i--)
        {
            res_digits[i] = product % NBASE;
            product /= NBASE;
        }
        Assert(product == 0);

        /*
         * Finally, round the result to the requested precision.
         */
        result->weight = res_weight;
        result->sign = res_sign;

        /* Round to target rscale (and set result->dscale) */
        round_var(result, rscale);

        /* Strip leading and trailing zeroes */
        strip_var(result);

        return;
    }
#endif
```

Benchmark 1, testing 2 ndigits * 2 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
      '11112222',
      '33334444'
   ],
   min_time := 1000000,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 32.2 ns          | 20.5 ns          | 1.5700112246809388
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 30.2 ns          | 21.4 ns          | 1.4113042510107371
(1 row)

So 57% and 41% faster.

Benchmark 2, testing 4 ndigits * 4 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
      '1111222233334444',
      '5555666677778888'
   ],
   min_time := 1000000,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b |   performance_ratio
------------------+------------------+------------------------
 41.9 ns          | 51.3 ns          | 0.81733655797170943614
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 40 ns            | 38 ns            | 1.0515610914706320
(1 row)

So 18% slower on Apple M3 Max and just 5% faster on Intel Core i9-14900K.

/Joel



On Mon, Jul 1, 2024, at 15:11, Joel Jacobson wrote:
> Not really sure why. Maybe the code I tried can be optimized further:

If anyone want experiment with the int128 version,
here is a patch that adds a separate numeric_mul_patched() function,
so it's easier to benchmark against the unmodified numeric_mul().

/Joel
Вложения
On Mon, Jul 1, 2024, at 15:14, Joel Jacobson wrote:
> * 0001-Optimize-mul_var-for-var2ndigits-4.patch

Found a typo, fixed in new version.

The int128 version is still slower though,
I wonder if there is something that can be done to speed it up further.

Below is a more realistic benchmark than just microbenchmarking mul_var(),
not testing the int128 version, but the code for up to 2 NBASE-digits:

```
CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);

INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);

SELECT SUM(num1*num2) FROM bench_mul_var;
Time: 8331.953 ms (00:08.332)
Time: 7415.241 ms (00:07.415)
Time: 7298.296 ms (00:07.298)
Time: 7314.754 ms (00:07.315)
Time: 7289.560 ms (00:07.290)

SELECT SUM(numeric_mul_patched(num1,num2)) FROM bench_mul_var;
Time: 6403.426 ms (00:06.403)
Time: 6401.797 ms (00:06.402)
Time: 6366.136 ms (00:06.366)
Time: 6376.049 ms (00:06.376)
Time: 6317.282 ms (00:06.317)
``

Benchmarked on a Intel Core i9-14900K machine.

/Joel
Вложения
On Mon, 1 Jul 2024 at 20:56, Joel Jacobson <joel@compiler.org> wrote:
>
> Below is a more realistic benchmark
>
> CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);
>
> INSERT INTO bench_mul_var (num1, num2)
> SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);
>
> SELECT SUM(num1*num2) FROM bench_mul_var;

I had a play with this, and came up with a slightly different way of
doing it that works for var2 of any size, as long as var1 is just 1 or
2 digits.

Repeating your benchmark where both numbers have up to 2 NBASE-digits,
this new approach was slightly faster:

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 4762.990 ms (00:04.763)
Time: 4332.166 ms (00:04.332)
Time: 4276.211 ms (00:04.276)
Time: 4247.321 ms (00:04.247)
Time: 4166.738 ms (00:04.167)

SELECT SUM(num1*num2) FROM bench_mul_var; -- v2 patch
Time: 4398.812 ms (00:04.399)
Time: 3672.668 ms (00:03.673)
Time: 3650.227 ms (00:03.650)
Time: 3611.420 ms (00:03.611)
Time: 3534.218 ms (00:03.534)

SELECT SUM(num1*num2) FROM bench_mul_var; -- this patch
Time: 3350.596 ms (00:03.351)
Time: 3336.224 ms (00:03.336)
Time: 3335.599 ms (00:03.336)
Time: 3336.990 ms (00:03.337)
Time: 3351.453 ms (00:03.351)

(This was on an older Intel Core i9-9900K, so I'm not sure why all the
timings are faster. What compiler settings are you using?)

The approach taken in this patch only uses 32-bit integers, so in
theory it could be extended to work for var1ndigits = 3, 4, or even
more, but the code would get increasingly complex, and I ran out of
steam at 2 digits. It might be worth trying though.

Regards,
Dean

Вложения
On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:
> I had a play with this, and came up with a slightly different way of
> doing it that works for var2 of any size, as long as var1 is just 1 or
> 2 digits.
>
> Repeating your benchmark where both numbers have up to 2 NBASE-digits,
> this new approach was slightly faster:
>
...
>
> (This was on an older Intel Core i9-9900K, so I'm not sure why all the
> timings are faster. What compiler settings are you using?)

Strange. I just did `./configure` with a --prefix.

Compiler settings on my Intel Core i9-14900K machine:

$ pg_config | grep -E '^(CC|CFLAGS|CPPFLAGS|LDFLAGS)'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute-Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security
-fno-strict-aliasing-fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2
 
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/home/joel/pg-dev/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =

> The approach taken in this patch only uses 32-bit integers, so in
> theory it could be extended to work for var1ndigits = 3, 4, or even
> more, but the code would get increasingly complex, and I ran out of
> steam at 2 digits. It might be worth trying though.
>
> Regards,
> Dean
>
> Attachments:
> * optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt

Really nice!

I've benchmarked your patch on my three machines with great results.
I added a setseed() step, to make the benchmarks reproducible,
shouldn't matter much since it should statistically average out, but I thought why not.

CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);
SELECT setseed(0.12345);
INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);
\timing

/*
 * Apple M3 Max
 */

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 3622.342 ms (00:03.622)
Time: 3029.786 ms (00:03.030)
Time: 3046.497 ms (00:03.046)
Time: 3035.910 ms (00:03.036)
Time: 3034.073 ms (00:03.034)

SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2484.685 ms (00:02.485)
Time: 2478.341 ms (00:02.478)
Time: 2494.397 ms (00:02.494)
Time: 2470.987 ms (00:02.471)
Time: 2490.215 ms (00:02.490)

/*
 * Intel Core i9-14900K
 */

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 2555.569 ms (00:02.556)
Time: 2523.145 ms (00:02.523)
Time: 2518.671 ms (00:02.519)
Time: 2514.501 ms (00:02.515)
Time: 2516.919 ms (00:02.517)

SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2246.441 ms (00:02.246)
Time: 2243.900 ms (00:02.244)
Time: 2245.350 ms (00:02.245)
Time: 2245.080 ms (00:02.245)
Time: 2247.856 ms (00:02.248)

/*
 * AMD Ryzen 9 7950X3D
 */

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 3037.497 ms (00:03.037)
Time: 3010.037 ms (00:03.010)
Time: 3000.956 ms (00:03.001)
Time: 2989.424 ms (00:02.989)
Time: 2984.911 ms (00:02.985)

SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2645.530 ms (00:02.646)
Time: 2640.472 ms (00:02.640)
Time: 2638.613 ms (00:02.639)
Time: 2637.889 ms (00:02.638)
Time: 2638.054 ms (00:02.638)

/Joel



On Tue, 2 Jul 2024 at 08:50, Joel Jacobson <joel@compiler.org> wrote:
>
> On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:
>
> > Attachments:
> > * optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
>

Shortly after posting that, I realised that there was a small bug. This bit:

            case 2:
                newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:

                newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
                if (res_ndigits - 3 < var2ndigits)
                    newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

That doesn't noticeably affect the performance though. Update attached.

> I've benchmarked your patch on my three machines with great results.
> I added a setseed() step, to make the benchmarks reproducible,
> shouldn't matter much since it should statistically average out, but I thought why not.

Nice. The results on the Apple M3 Max look particularly impressive.

I think it'd probably be worth trying to extend this to 3 or maybe 4
var1 digits, since that would cover a lot of "everyday" sized numeric
values that a lot of people might be using. I don't think we should go
beyond that though.

Regards,
Dean

Вложения
On Tue, Jul 2, 2024, at 10:22, Dean Rasheed wrote:
> Shortly after posting that, I realised that there was a small bug. This bit:
>
>             case 2:
>                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>
> isn't quite right in the case where rscale is less than the full
> result. In that case, the least significant result digit has a
> contribution from one more var2 digit, so it needs to be:
>
>                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>                 if (res_ndigits - 3 < var2ndigits)
>                     newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
>
> That doesn't noticeably affect the performance though. Update attached.

Nice catch. Could we add a test somehow that would test mul_var()
with rscale less than the full result, that would catch bugs like this one
and others?

I created a new benchmark, that specifically tests different var1ndigits.
I've only run it on Apple M3 Max yet. More to come.

\timing
SELECT setseed(0.12345);
CREATE TABLE bench_mul_var_var1ndigits_1 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_1 (var1, var2)
SELECT random(0::numeric,9999::numeric), random(10000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_2 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_2 (var1, var2)
SELECT random(10000000::numeric,99999999::numeric), random(100000000000::numeric,1e32::numeric) FROM
generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_3 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_3 (var1, var2)
SELECT random(100000000000::numeric,999999999999::numeric), random(1000000000000000::numeric,1e32::numeric) FROM
generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_4 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_4 (var1, var2)
SELECT random(1000000000000000::numeric,9999999999999999::numeric), random(10000000000000000000::numeric,1e32::numeric)
FROMgenerate_series(1,1e8);
 

/*
 * Apple M3 Max
 */

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 2986.952 ms (00:02.987)
Time: 2991.765 ms (00:02.992)
Time: 2987.253 ms (00:02.987)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2874.841 ms (00:02.875)
Time: 2883.070 ms (00:02.883)
Time: 2899.973 ms (00:02.900)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3459.556 ms (00:03.460)
Time: 3304.983 ms (00:03.305)
Time: 3299.728 ms (00:03.300)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3053.140 ms (00:03.053)
Time: 3065.227 ms (00:03.065)
Time: 3069.995 ms (00:03.070)

/*
 * Just for completeness, also testing var1ndigits 3 and 4,
 * although no change is expected since not yet implemented.
 */

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3809.005 ms (00:03.809)
Time: 3438.260 ms (00:03.438)
Time: 3453.920 ms (00:03.454)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3437.592 ms (00:03.438)
Time: 3457.586 ms (00:03.458)
Time: 3474.344 ms (00:03.474)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4133.193 ms (00:04.133)
Time: 3554.477 ms (00:03.554)
Time: 3560.855 ms (00:03.561)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3508.540 ms (00:03.509)
Time: 3566.721 ms (00:03.567)
Time: 3524.083 ms (00:03.524)

> I think it'd probably be worth trying to extend this to 3 or maybe 4
> var1 digits, since that would cover a lot of "everyday" sized numeric
> values that a lot of people might be using. I don't think we should go
> beyond that though.

I think so too. I'm working on var1ndigits=3 now.

/Joel



On Tue, Jul 2, 2024, at 11:05, Joel Jacobson wrote:
> On Tue, Jul 2, 2024, at 10:22, Dean Rasheed wrote:
>> Shortly after posting that, I realised that there was a small bug. This bit:
>>
>>             case 2:
>>                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>>
>> isn't quite right in the case where rscale is less than the full
>> result. In that case, the least significant result digit has a
>> contribution from one more var2 digit, so it needs to be:
>>
>>                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>>                 if (res_ndigits - 3 < var2ndigits)
>>                     newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
>>

Just to be able to verify mul_var() is working as expected when
rscale is less than the full result, I've added a numeric_mul_patched()
function that takes a third rscale_adjustment parameter:

I've tried to get a different result with and without the fix,
but I'm failing to hit the bug.

Can you think of an example that should trigger the bug?

Example usage:

SELECT numeric_mul_patched(9999.999,2,0);
var1ndigits=1 var2ndigits=2 rscale=3
 numeric_mul_patched
---------------------
           19999.998
(1 row)

SELECT numeric_mul_patched(9999.999,2,1);
var1ndigits=1 var2ndigits=2 rscale=4
 numeric_mul_patched
---------------------
          19999.9980
(1 row)

SELECT numeric_mul_patched(9999.999,2,-1);
var1ndigits=1 var2ndigits=2 rscale=2
 numeric_mul_patched
---------------------
            20000.00
(1 row)

SELECT numeric_mul_patched(9999.999,2,-2);
var1ndigits=1 var2ndigits=2 rscale=1
 numeric_mul_patched
---------------------
             20000.0
(1 row)

/Joel



On Tue, 2 Jul 2024 at 11:23, Joel Jacobson <joel@compiler.org> wrote:
>
> Just to be able to verify mul_var() is working as expected when
> rscale is less than the full result, I've added a numeric_mul_patched()
> function that takes a third rscale_adjustment parameter:

Yeah, we could expose such a function, and maybe it would be generally
useful, not just for testing, but that's really a separate proposal.
In general, mul_var() with reduced rscale doesn't guarantee correctly
rounded results though, which might make it less generally acceptable.

For this patch though, the aim is just to ensure the results are the
same as before.

> I've tried to get a different result with and without the fix,
> but I'm failing to hit the bug.
>
> Can you think of an example that should trigger the bug?

9999.0001 * 5000.9999_9999 with rscale = 0 triggers it (returned
50004999 instead of 50005000).

Regards,
Dean



On Tue, Jul 2, 2024, at 13:44, Dean Rasheed wrote:
>> Can you think of an example that should trigger the bug?
>
> 9999.0001 * 5000.9999_9999 with rscale = 0 triggers it (returned
> 50004999 instead of 50005000).

Thanks, helpful.

Attached patch adds the var1ndigits=3 case.

Benchmark:

/*
 * Apple M3 Max
 */

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3809.005 ms (00:03.809)
Time: 3438.260 ms (00:03.438)
Time: 3453.920 ms (00:03.454)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3230.017 ms (00:03.230)
Time: 3244.094 ms (00:03.244)
Time: 3246.370 ms (00:03.246)

/*
 * Intel Core i9-14900K
 */

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4082.759 ms (00:04.083)
Time: 4077.372 ms (00:04.077)
Time: 4080.830 ms (00:04.081)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3989.021 ms (00:03.989)
Time: 3978.263 ms (00:03.978)
Time: 3955.331 ms (00:03.955)

/*
 * AMD Ryzen 9 7950X3D
 */

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3791.271 ms (00:03.791)
Time: 3760.138 ms (00:03.760)
Time: 3758.773 ms (00:03.759)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3400.824 ms (00:03.401)
Time: 3373.714 ms (00:03.374)
Time: 3345.505 ms (00:03.346)

Regards,
Joel
Вложения
On Tue, Jul 2, 2024, at 18:20, Joel Jacobson wrote:
> * v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

Hmm, v3 contains a bug which I haven't been able to solve yet.
Reporting now to avoid time waste reviewing it since it's buggy.

The attached patch is how I tested and found the bug.
It contains a file test-mul-var.sql, which tests mul_var()
with varying rscale, using the SQL-callable numeric_mul_patched(),
which third argument is the rscale_adjustment.

Out of 2481600 random tests, all passed except 4:

SELECT
     result = numeric_mul_patched(var1,var2,rscale_adjustment),
     COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;
 ?column? |  count
----------+---------
 f        |       4
 t        | 2481596
(2 rows)

SELECT
    var1,
    var2,
    var1*var2 AS full_resolution,
    rscale_adjustment,
    result AS expected,
    numeric_mul_patched(var1,var2,rscale_adjustment) AS actual
FROM test_numeric_mul_patched
WHERE result <> numeric_mul_patched(var1,var2,rscale_adjustment);
       var1        |      var2      |      full_resolution      | rscale_adjustment | expected | actual
-------------------+----------------+---------------------------+-------------------+----------+--------
     676.797214075 | 0.308068877759 | 208.500158210502929257925 |               -21 |      209 |    208
         555.07029 | 0.381033094735 |     211.50015039415392315 |               -17 |      212 |    211
    0.476637718921 |      66.088276 |     31.500165120061470196 |               -18 |       32 |     31
 0.060569165063082 |      998.85933 |   60.50007563356949425506 |               -20 |       61 |     60
(4 rows)

Trying to wrap my head around what could cause this.

It's rounding down instead of up, and these cases all end with decimal .500XXXX.

Regards,
Joel
Вложения
On Tue, Jul 2, 2024, at 20:53, Joel Jacobson wrote:
> Trying to wrap my head around what could cause this.
>
> It's rounding down instead of up, and these cases all end with decimal .500XXXX.

Interesting, I actually think there is a bug in the normal mul_var() code.
Found a case that rounds down, when it should round up:

Calling mul_var() with:
var1=51.2945442386599
var2=0.828548712212
rscale=0

returns 42, but I think it should return 43,
since 51.2945442386599*0.828548712212=42.5000285724431241296446988

But maybe this is expected and OK, having to do with MUL_GUARD_DIGITS?

/Joel



On Tue, Jul 2, 2024, at 21:55, Joel Jacobson wrote:
> On Tue, Jul 2, 2024, at 20:53, Joel Jacobson wrote:
>> Trying to wrap my head around what could cause this.

I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:

            case 2:
                newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

The problem here is that res_ndigits could become less than 4,
if rscale is low enough,
and then we would try to access a negative array index in var2digits.

Fix:

            case 2:
                if (res_ndigits - 4 >= 0 && res_ndigits - 4 < var2ndigits)
                    newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
                else
                    newdig = 0;

Another problem in the case 2 code:

                if (res_ndigits - 3 < var2ndigits)
                    newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

Fix:

                if (res_ndigits - 3 >= 0 && res_ndigits - 3 < var2ndigits)
                    newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

New version attached that fixes both the case 2 and case 3 code.

Regards,
Joel
Вложения
On Tue, Jul 2, 2024, at 22:10, Joel Jacobson wrote:
> * v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

Instead of these boundary checks, maybe it would be cleaner to just
skip this optimization if there are too few res_ndigits?

/Joel



On Tue, 2 Jul 2024 at 20:55, Joel Jacobson <joel@compiler.org> wrote:
>
> Interesting, I actually think there is a bug in the normal mul_var() code.
> Found a case that rounds down, when it should round up:
>
> Calling mul_var() with:
> var1=51.2945442386599
> var2=0.828548712212
> rscale=0
>
> returns 42, but I think it should return 43,
> since 51.2945442386599*0.828548712212=42.5000285724431241296446988
>
> But maybe this is expected and OK, having to do with MUL_GUARD_DIGITS?
>

No, that's not a bug. It's to be expected. The point of using only
MUL_GUARD_DIGITS extra digits is to get the correctly rounded result
*most of the time*, while saving a lot of effort by not computing the
full result.

The callers of mul_var() that ask for reduced rscale results are the
transcendental functions like ln_var() and exp_var(), which are
working to some limited precision intended to compute the final result
reasonably accurately to a particular rscale.

Regards,
Dean



On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:
>
> I found the bug in the case 3 code,
> and it turns out the same type of bug also exists in the case 2 code:
>
>                         case 2:
>                                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>
> The problem here is that res_ndigits could become less than 4,

Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.

I've been hacking on this a bit and trying to tidy it up. Firstly, I
moved it to a separate function, because it was starting to look messy
having so much extra code in mul_var(). Then I added a bunch more
comments to explain what's going on, and the limits of the various
variables. Note that most of the boundary checks are actually
unnecessary -- in particular all the ones in or after the main loop,
provided you pull out the first 2 result digits from the main loop in
the 3-digit case. That does seem to work very well, but...

I wasn't entirely happy with how messy that code is getting, so I
tried a different approach. Similar to div_var_int(), I tried writing
a mul_var_int() function instead. This can be used for 1 and 2 digit
factors, and we could add a similar mul_var_int64() function on
platforms with 128-bit integers. The code looks quite a lot neater, so
it's probably less likely to contain bugs (though I have just written
it in a hurry,so it might still have bugs). In testing, it seemed to
give a decent speedup, but perhaps a little less than before. But
that's to be balanced against having more maintainable code, and also
a function that might be useful elsewhere in numeric.c.

Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.

Regards,
Dean

Вложения
Em qua., 3 de jul. de 2024 às 08:18, Dean Rasheed <dean.a.rasheed@gmail.com> escreveu:
On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:
>
> I found the bug in the case 3 code,
> and it turns out the same type of bug also exists in the case 2 code:
>
>                         case 2:
>                                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>
> The problem here is that res_ndigits could become less than 4,

Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.

I've been hacking on this a bit and trying to tidy it up. Firstly, I
moved it to a separate function, because it was starting to look messy
having so much extra code in mul_var(). Then I added a bunch more
comments to explain what's going on, and the limits of the various
variables. Note that most of the boundary checks are actually
unnecessary -- in particular all the ones in or after the main loop,
provided you pull out the first 2 result digits from the main loop in
the 3-digit case. That does seem to work very well, but...

I wasn't entirely happy with how messy that code is getting, so I
tried a different approach. Similar to div_var_int(), I tried writing
a mul_var_int() function instead. This can be used for 1 and 2 digit
factors, and we could add a similar mul_var_int64() function on
platforms with 128-bit integers. The code looks quite a lot neater, so
it's probably less likely to contain bugs (though I have just written
it in a hurry,so it might still have bugs). In testing, it seemed to
give a decent speedup, but perhaps a little less than before. But
that's to be balanced against having more maintainable code, and also
a function that might be useful elsewhere in numeric.c.

Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.
I liked v5-add-mul_var_int.patch better.

I think that *var_digits can be const too.
+ const NumericDigit *var_digits = var->digits;
 
Typo In the comments:
- by procssing
+ by processing

best regards,
Ranier Vilela
On Mon, Jul 1, 2024 at 6:19 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> Repeating your benchmark where both numbers have up to 2 NBASE-digits,
> this new approach was slightly faster:
>
> SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
> Time: 4762.990 ms (00:04.763)
> Time: 4332.166 ms (00:04.332)
> Time: 4276.211 ms (00:04.276)
> Time: 4247.321 ms (00:04.247)
> Time: 4166.738 ms (00:04.167)
>
> SELECT SUM(num1*num2) FROM bench_mul_var; -- v2 patch
> Time: 4398.812 ms (00:04.399)
> Time: 3672.668 ms (00:03.673)
> Time: 3650.227 ms (00:03.650)
> Time: 3611.420 ms (00:03.611)
> Time: 3534.218 ms (00:03.534)
>
> SELECT SUM(num1*num2) FROM bench_mul_var; -- this patch
> Time: 3350.596 ms (00:03.351)
> Time: 3336.224 ms (00:03.336)
> Time: 3335.599 ms (00:03.336)
> Time: 3336.990 ms (00:03.337)
> Time: 3351.453 ms (00:03.351)

I don't have any particular technical insight on this topic, but I
just want to mention that I'm excited about the work. Numeric
performance can be painfully slow, and these seem like quite
significant speedups that will affect lots of real-world cases.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Wed, Jul 3, 2024, at 13:17, Dean Rasheed wrote:
> On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:
>>
>> I found the bug in the case 3 code,
>> and it turns out the same type of bug also exists in the case 2 code:
>>
>>                         case 2:
>>                                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>>
>> The problem here is that res_ndigits could become less than 4,
>
> Yes. It can't be less than 3 though (per an earlier test), so the case
> 2 code was correct.

Hmm, I don't see how the case 2 code can be correct?
If, like you say, res_ndigits can't be less than 3, that means it can be 3, right?
And if res_ndigits=3 then `var2digits[res_ndigits - 4]` would try to access `var2digits[-1]`.

> I've been hacking on this a bit and trying to tidy it up. Firstly, I
> moved it to a separate function, because it was starting to look messy
> having so much extra code in mul_var(). Then I added a bunch more
> comments to explain what's going on, and the limits of the various
> variables. Note that most of the boundary checks are actually
> unnecessary -- in particular all the ones in or after the main loop,
> provided you pull out the first 2 result digits from the main loop in
> the 3-digit case. That does seem to work very well, but...

Nice, I was starting to feel a bit uncomfortable with the level of increased complexity.

> I wasn't entirely happy with how messy that code is getting, so I
> tried a different approach. Similar to div_var_int(), I tried writing
> a mul_var_int() function instead. This can be used for 1 and 2 digit
> factors, and we could add a similar mul_var_int64() function on
> platforms with 128-bit integers. The code looks quite a lot neater, so
> it's probably less likely to contain bugs (though I have just written
> it in a hurry,so it might still have bugs). In testing, it seemed to
> give a decent speedup, but perhaps a little less than before. But
> that's to be balanced against having more maintainable code, and also
> a function that might be useful elsewhere in numeric.c.
>
> Anyway, here are both patches for comparison. I'll stop hacking for a
> while and let you see what you make of these.

I've tested both patches, and they produces the same output given the
same input as HEAD, when rscale is unmodified (full precision).

However, for a reduced rscale, there are some differences:

mul_var_small() seems more resilient to rscale reductions than mul_var_int().

The previous version we worked on, I've called "mul_var inlined" in the output below.

```
CREATE TABLE test_numeric_mul_patched (
  var1 numeric,
  var2 numeric,
  rscale_adjustment int,
  result numeric
);

DO $$
DECLARE
var1 numeric;
var2 numeric;
BEGIN
  FOR i IN 1..1000 LOOP
    RAISE NOTICE '%', i;
    FOR var1ndigits IN 1..4 LOOP
      FOR var2ndigits IN 1..4 LOOP
        FOR var1dscale IN 0..(var1ndigits*4) LOOP
          FOR var2dscale IN 0..(var2ndigits*4) LOOP
            FOR rscale_adjustment IN 0..(var1dscale+var2dscale) LOOP
              var1 := round(random(
                format('1%s',repeat('0',(var1ndigits-1)*4-1))::numeric,
                format('%s',repeat('9',var1ndigits*4))::numeric
              ) / 10::numeric^var1dscale, var1dscale);
              var2 := round(random(
                format('1%s',repeat('0',(var2ndigits-1)*4-1))::numeric,
                format('%s',repeat('9',var2ndigits*4))::numeric
              ) / 10::numeric^var2dscale, var2dscale);
              INSERT INTO test_numeric_mul_patched
                (var1, var2, rscale_adjustment)
              VALUES
                (var1, var2, -rscale_adjustment);
            END LOOP;
          END LOOP;
        END LOOP;
      END LOOP;
    END LOOP;
  END LOOP;
END $$;

UPDATE test_numeric_mul_patched SET result = numeric_mul_head(var1, var2, rscale_adjustment);

SELECT
    rscale_adjustment,
    COUNT(*),
    COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_int(var1,var2,rscale_adjustment)) AS
"mul_var_int",
    COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_small(var1,var2,rscale_adjustment)) AS
"mul_var_small",
    COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_inline(var1,var2,rscale_adjustment)) AS "mul_var
inlined"
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

 rscale_adjustment |  count  | mul_var_int | mul_var_small | mul_var inlined
-------------------+---------+-------------+---------------+-----------------
               -32 |    1000 |           0 |             0 |               0
               -31 |    3000 |           0 |             0 |               0
               -30 |    6000 |           0 |             0 |               0
               -29 |   10000 |           0 |             0 |               0
               -28 |   17000 |           0 |             0 |               0
               -27 |   27000 |           0 |             0 |               0
               -26 |   40000 |           0 |             1 |               0
               -25 |   56000 |           1 |            11 |               0
               -24 |   78000 |         316 |           119 |               1
               -23 |  106000 |         498 |          1696 |               0
               -22 |  140000 |         531 |          2480 |               1
               -21 |  180000 |         591 |          3145 |               0
               -20 |  230000 |        1956 |          5309 |               1
               -19 |  290000 |        2189 |          5032 |               0
               -18 |  360000 |        2314 |          4868 |               0
               -17 |  440000 |        2503 |          4544 |               1
               -16 |  533000 |        5201 |          3633 |               0
               -15 |  631000 |        5621 |          3006 |               0
               -14 |  734000 |        5907 |          2631 |               0
               -13 |  842000 |        6268 |          2204 |               0
               -12 |  957000 |        9558 |           778 |               0
               -11 | 1071000 |       10597 |           489 |               0
               -10 | 1184000 |       10765 |           193 |               0
                -9 | 1296000 |        9452 |             0 |               0
                -8 | 1408000 |        1142 |             0 |               0
                -7 | 1512000 |         391 |             0 |               0
                -6 | 1608000 |         235 |             0 |               0
                -5 | 1696000 |           0 |             0 |               0
                -4 | 1776000 |           0 |             0 |               0
                -3 | 1840000 |           0 |             0 |               0
                -2 | 1888000 |           0 |             0 |               0
                -1 | 1920000 |           0 |             0 |               0
                 0 | 1936000 |           0 |             0 |               0
(33 rows)

SELECT
    result - numeric_mul_patch_int(var1,var2,rscale_adjustment),
    COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

    ?column?    |  count
----------------+----------
              0 | 24739964
 0.000000000001 |     2170
  0.00000000001 |      234
   0.0000000001 |       18
    0.000000001 |        4
     0.00000001 |     8927
      0.0000001 |      882
       0.000001 |       90
        0.00001 |        6
         0.0001 |    21963
          0.001 |     2174
           0.01 |      214
            0.1 |       18
              1 |    39336
(14 rows)

SELECT
    result - numeric_mul_patch_small(var1,var2,rscale_adjustment),
    COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

     ?column?      |  count
-------------------+----------
                -1 |     1233
             -0.01 |        9
            -0.001 |       73
           -0.0001 |      647
         -0.000001 |        2
        -0.0000001 |        9
       -0.00000001 |      116
 0.000000000000000 | 24775861
        0.00000001 |     1035
        0.00000002 |        2
         0.0000001 |       96
          0.000001 |        9
            0.0001 |     8771
            0.0002 |        3
             0.001 |      952
              0.01 |       69
               0.1 |       10
                 1 |    27098
                 2 |        5
(19 rows)


SELECT
    result - numeric_mul_patch_inline(var1,var2,rscale_adjustment),
    COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

 ?column? |  count
----------+----------
       -1 |        4
        0 | 24815996
(2 rows)
```

I found these two interesting to look closer at:
```
        0.00000002 |        2
            0.0002 |        3

SELECT
  *,
  numeric_mul_patch_small(var1,var2,rscale_adjustment)
FROM test_numeric_mul_patched
WHERE result - numeric_mul_patch_small(var1,var2,rscale_adjustment) IN (0.00000002, 0.0002);

       var1        |      var2      | rscale_adjustment |   result   | numeric_mul_patch_small
-------------------+----------------+-------------------+------------+-------------------------
     8952.12658563 | 0.902315486665 |               -16 |  8077.6425 |               8077.6423
    0.881715409579 | 0.843165739371 |               -16 | 0.74343223 |              0.74343221
    0.905322758954 | 0.756905996850 |               -16 | 0.68524423 |              0.68524421
 8464.043170546608 | 0.518100129611 |               -20 |  4385.2219 |               4385.2217
 5253.006296984449 | 0.989308019355 |               -20 |  5196.8413 |               5196.8411
(5 rows)
```

What can be said about mul_var()'s contract with regards to rscale?
It's the number of decimal digits requested by the caller, and if not
requesting full precision, then the decimal digits might not be accurate,
but can something be said about how far off they can be?

The mul_var_int() patch only produces a difference that is exactly
1 less than the exact result, at the last non-zero decimal digit.

Could the difference be more than 1 at the last non-zero digit,
like in the five cases found above?

It would be nice if we could define mul_var()'s contract with regards to
rscale, in terms of what precision can be expected in the result.

Attaching the hacked together version with all the patches, used to do the testing above.

Regards,
Joel
Вложения
On Wed, 3 Jul 2024 at 14:49, Joel Jacobson <joel@compiler.org> wrote:
>
> Hmm, I don't see how the case 2 code can be correct?
> If, like you say, res_ndigits can't be less than 3, that means it can be 3, right?
> And if res_ndigits=3 then `var2digits[res_ndigits - 4]` would try to access `var2digits[-1]`.
>

Ah yes, I think I was looking at a newer version of the code where I'd
already fixed that bug. Unless you think there are still bugs in any
of the boundary checks, which is entirely possible.

> I've tested both patches, and they produces the same output given the
> same input as HEAD, when rscale is unmodified (full precision).
>
> However, for a reduced rscale, there are some differences:
>
> mul_var_small() seems more resilient to rscale reductions than mul_var_int().
>

Ah, I can see what's going on. It's perhaps best illustrated with a
simple example. Suppose you are multiplying a 4-digit integer x by a
2-digit integer y (both with dscale=0). Then the terms of the full
product computed by mul_var() or mul_var_small() would look something
like this:

            x0     x1     x2     x3
                        * y0     y1
  ---------------------------------
  x0*y0  x1*y0  x2*y0  x3*y0
         x0*y1  x1*y1  x2*y1  x3*y1

In the reduced-rscale case, it might perform a truncated computation,
computing just the first 3 columns (say), and discarding the last two
columns. Therefore it would skip the 3 rightmost digit products.

However, in mul_var_int(), y0 and y1 have been combined into a single
integer equal to y0*NBASE+y1, and the terms of full product are
computed as follows:

  x0*(y0*NBASE+y1)  x1*(y0*NBASE+y1)  x2*(y0*NBASE+y1)  x3*(y0*NBASE+y1)

In the full product, that gives the same result, but if you follow the
same rule in the reduced-rscale case, skipping the last two terms, it
would actually discard 4 digit products, making it less accurate.

That could be avoided by increasing maxdigits by 1 in mul_var_int() so
it would always be at least as accurate as it was before, but that
might not really be necessary. However, if we implemented
mul_var_int64() in the same way, it would be discarding much
higher-order digit products, and so we probably would have to increase
maxdigits to get sufficiently accurate results. But there's an even
bigger problem: the results would be different between platforms that
did and didn't have 128-bit integers, which I really don't like. We
could avoid that by not using it in reduced-rscale cases, but that
would involve another test condition.

By contrast, mul_var_small() is intended to replicate the arithmetic
in mul_var() exactly (but in a different order) for all rscales. So if
you've found any cases where they give different results, that's a
bug.

In light of that, it might be that mul_var_small() is the better
option, rather than mul_var_int(), but it'd be interesting to see how
they compare in terms of performance first.

> What can be said about mul_var()'s contract with regards to rscale?
> It's the number of decimal digits requested by the caller, and if not
> requesting full precision, then the decimal digits might not be accurate,
> but can something be said about how far off they can be?
>

I wouldn't expect it to ever be off by more than 1, given that
MUL_GUARD_DIGITS = 2, which corresponds to 8 decimal digits, and the
number of digits in the smaller input (and hence the number of digit
products in each column) is limited to something like 16,000 NBASE
digits.

Regards,
Dean



On Wed, Jul 3, 2024, at 15:48, Joel Jacobson wrote:
> On Wed, Jul 3, 2024, at 13:17, Dean Rasheed wrote:
>> On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:
>>>
>>> I found the bug in the case 3 code,
>>> and it turns out the same type of bug also exists in the case 2 code:
>>>
>>>                         case 2:
>>>                                 newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
>>>
>>> The problem here is that res_ndigits could become less than 4,
>>
>> Yes. It can't be less than 3 though (per an earlier test), so the case
>> 2 code was correct.
>
> Hmm, I don't see how the case 2 code can be correct?
> If, like you say, res_ndigits can't be less than 3, that means it can 
> be 3, right?
> And if res_ndigits=3 then `var2digits[res_ndigits - 4]` would try to 
> access `var2digits[-1]`.

Here is an example on how to trigger the bug:

```
            case 2:
                if (res_ndigits - 4 < 0)
                {
                    printf("var1=%s\n",get_str_from_var(var1));
                    printf("var2=%s\n",get_str_from_var(var2));
                    printf("rscale=%d\n", rscale);
                    printf("res_ndigits - 4 < 0 => var2digits[%d]=%d\n", res_ndigits - 4, var2digits[res_ndigits -
4]);
                }
```

Running through my tests, I hit lots of cases, including:

var1=0.10968501
var2=0.903728177113
rscale=0
res_ndigits - 4 < 0 => var2digits[-1]=-31105

All of the spotted cases had rscale=0.

If we know that mul_var() will never be called with rscale=0 when dealing with decimal inputs, perhaps we should
enforcethis with an Assert(), to prevent the otherwise possible out-of-bounds access (negative indexing) and provide
earlydetection?
 

Regards,
Joel



On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:
> Ah yes, I think I was looking at a newer version of the code where I'd
> already fixed that bug. Unless you think there are still bugs in any
> of the boundary checks, which is entirely possible.

Ah, that explains it.
And no, I can't find any other bugs in the boundary checks.

> Ah, I can see what's going on. It's perhaps best illustrated with a
> simple example. Suppose you are multiplying a 4-digit integer x by a
> 2-digit integer y (both with dscale=0). Then the terms of the full
> product computed by mul_var() or mul_var_small() would look something
> like this:
>
>             x0     x1     x2     x3
>                         * y0     y1
>   ---------------------------------
>   x0*y0  x1*y0  x2*y0  x3*y0
>          x0*y1  x1*y1  x2*y1  x3*y1
>
> In the reduced-rscale case, it might perform a truncated computation,
> computing just the first 3 columns (say), and discarding the last two
> columns. Therefore it would skip the 3 rightmost digit products.
>
> However, in mul_var_int(), y0 and y1 have been combined into a single
> integer equal to y0*NBASE+y1, and the terms of full product are
> computed as follows:
>
>   x0*(y0*NBASE+y1)  x1*(y0*NBASE+y1)  x2*(y0*NBASE+y1)  x3*(y0*NBASE+y1)
>
> In the full product, that gives the same result, but if you follow the
> same rule in the reduced-rscale case, skipping the last two terms, it
> would actually discard 4 digit products, making it less accurate.
>
> That could be avoided by increasing maxdigits by 1 in mul_var_int() so
> it would always be at least as accurate as it was before, but that
> might not really be necessary. However, if we implemented
> mul_var_int64() in the same way, it would be discarding much
> higher-order digit products, and so we probably would have to increase
> maxdigits to get sufficiently accurate results. But there's an even
> bigger problem: the results would be different between platforms that
> did and didn't have 128-bit integers, which I really don't like. We
> could avoid that by not using it in reduced-rscale cases, but that
> would involve another test condition.
>
> By contrast, mul_var_small() is intended to replicate the arithmetic
> in mul_var() exactly (but in a different order) for all rscales. So if
> you've found any cases where they give different results, that's a
> bug.
>
> In light of that, it might be that mul_var_small() is the better
> option, rather than mul_var_int(), but it'd be interesting to see how
> they compare in terms of performance first.

Thanks for explaining, very helpful.

I agree on your reasoning about the pros and cons.
Not sure yet which version I prefer. Let's see how it evolves.

I've done some benchmarks.
Haven't tested Intel and AMD yet, but this is what I get on my Apple M3 Max:

--
-- varndigits=1
--

-- HEAD
SELECT SUM(numeric_mul_head(var1,var2,0)) FROM bench_mul_var_var1ndigits_1;
Time: 2976.896 ms (00:02.977)
Time: 2984.759 ms (00:02.985)
Time: 2970.364 ms (00:02.970)

-- mul_var_int() patch
SELECT SUM(numeric_mul_patch_int(var1,var2,0)) FROM bench_mul_var_var1ndigits_1;
Time: 2790.227 ms (00:02.790)
Time: 2786.338 ms (00:02.786)
Time: 2784.957 ms (00:02.785)

-- mul_var_small() patch
SELECT SUM(numeric_mul_patch_small(var1,var2,0)) FROM bench_mul_var_var1ndigits_1;
Time: 2770.211 ms (00:02.770)
Time: 2760.685 ms (00:02.761)
Time: 2773.221 ms (00:02.773)

--
-- varndigits=2
--

-- HEAD
SELECT SUM(numeric_mul_head(var1,var2,0)) FROM bench_mul_var_var1ndigits_2;
Time: 3353.258 ms (00:03.353)
Time: 3273.055 ms (00:03.273)
Time: 3266.392 ms (00:03.266)

-- mul_var_int() patch
SELECT SUM(numeric_mul_patch_int(var1,var2,0)) FROM bench_mul_var_var1ndigits_2;
Time: 2694.169 ms (00:02.694)
Time: 2687.935 ms (00:02.688)
Time: 2692.398 ms (00:02.692)

-- mul_var_small() patch
SELECT SUM(numeric_mul_patch_small(var1,var2,0)) FROM bench_mul_var_var1ndigits_2;
Time: 2997.685 ms (00:02.998)
Time: 2984.418 ms (00:02.984)
Time: 2986.976 ms (00:02.987)

--
-- varndigits=3
--

-- HEAD
SELECT SUM(numeric_mul_head(var1,var2,0)) FROM bench_mul_var_var1ndigits_3;
Time: 3471.391 ms (00:03.471)
Time: 3384.114 ms (00:03.384)
Time: 3387.031 ms (00:03.387)

-- mul_var_int() patch
SELECT SUM(numeric_mul_patch_int(var1,var2,0)) FROM bench_mul_var_var1ndigits_3;
Time: 3384.428 ms (00:03.384)
Time: 3398.044 ms (00:03.398)
Time: 3393.727 ms (00:03.394)

-- mul_var_small() patch
SELECT SUM(numeric_mul_patch_small(var1,var2,0)) FROM bench_mul_var_var1ndigits_3;
Time: 3100.567 ms (00:03.101)
Time: 3114.225 ms (00:03.114)
Time: 3116.137 ms (00:03.116)

Interesting, mul_var_small() seems to be the winner for var1ndigits=3
and mul_var_int() to be the winner for var1ndigits=2,
and about the same for var1ndigits=1.

>> What can be said about mul_var()'s contract with regards to rscale?
>> It's the number of decimal digits requested by the caller, and if not
>> requesting full precision, then the decimal digits might not be accurate,
>> but can something be said about how far off they can be?
>>
>
> I wouldn't expect it to ever be off by more than 1, given that
> MUL_GUARD_DIGITS = 2, which corresponds to 8 decimal digits, and the
> number of digits in the smaller input (and hence the number of digit
> products in each column) is limited to something like 16,000 NBASE
> digits.

OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
are unexpected?

Regards,
Joel



On Wed, Jul 3, 2024, at 22:27, Joel Jacobson wrote:
> On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:
>> I wouldn't expect it to ever be off by more than 1, given that
>> MUL_GUARD_DIGITS = 2, which corresponds to 8 decimal digits, and the
>> number of digits in the smaller input (and hence the number of digit
>> products in each column) is limited to something like 16,000 NBASE
>> digits.
>
> OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
> are unexpected?

Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found:

       var1        |      var2      | rscale_adjustment |   result   | numeric_mul_patch_small
-------------------+----------------+-------------------+------------+-------------------------
     8952.12658563 | 0.902315486665 |               -16 |  8077.6425 |               8077.6423
    0.881715409579 | 0.843165739371 |               -16 | 0.74343223 |              0.74343221
    0.905322758954 | 0.756905996850 |               -16 | 0.68524423 |              0.68524421
 8464.043170546608 | 0.518100129611 |               -20 |  4385.2219 |               4385.2217
 5253.006296984449 | 0.989308019355 |               -20 |  5196.8413 |               5196.8411
(5 rows)

Regards,
Joel



On Wed, Jul 3, 2024, at 13:17, Dean Rasheed wrote:
> Anyway, here are both patches for comparison. I'll stop hacking for a
> while and let you see what you make of these.
>
> Regards,
> Dean
>
> Attachments:
> * v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
> * v5-add-mul_var_int.patch

I've now benchmarked the patches on all my machines,
see bench_mul_var.sql for details.

Summary of benchmark results:

         cpu          | var1ndigits |                           winner
----------------------+-------------+-------------------------------------------------------------
 AMD Ryzen 9 7950X3D  |           1 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
 AMD Ryzen 9 7950X3D  |           2 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
 AMD Ryzen 9 7950X3D  |           3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
 Apple M3 Max         |           1 | v5-add-mul_var_int.patch
 Apple M3 Max         |           2 | v5-add-mul_var_int.patch
 Apple M3 Max         |           3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
 Intel Core i9-14900K |           1 | v5-add-mul_var_int.patch
 Intel Core i9-14900K |           2 | v5-add-mul_var_int.patch
 Intel Core i9-14900K |           3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
(9 rows)

Performance ratio against HEAD per CPU and var1ndigits:

         cpu          | var1ndigits |                           version                           | performance_ratio
----------------------+-------------+-------------------------------------------------------------+-------------------
 AMD Ryzen 9 7950X3D  |           1 | HEAD                                                        |              1.00
 AMD Ryzen 9 7950X3D  |           1 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.11
 AMD Ryzen 9 7950X3D  |           1 | v5-add-mul_var_int.patch                                    |              1.07
 AMD Ryzen 9 7950X3D  |           1 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.12
 AMD Ryzen 9 7950X3D  |           2 | HEAD                                                        |              1.00
 AMD Ryzen 9 7950X3D  |           2 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.10
 AMD Ryzen 9 7950X3D  |           2 | v5-add-mul_var_int.patch                                    |              1.11
 AMD Ryzen 9 7950X3D  |           2 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.13
 AMD Ryzen 9 7950X3D  |           3 | HEAD                                                        |              1.00
 AMD Ryzen 9 7950X3D  |           3 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.10
 AMD Ryzen 9 7950X3D  |           3 | v5-add-mul_var_int.patch                                    |              0.98
 AMD Ryzen 9 7950X3D  |           3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.15
 Apple M3 Max         |           1 | HEAD                                                        |              1.00
 Apple M3 Max         |           1 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.07
 Apple M3 Max         |           1 | v5-add-mul_var_int.patch                                    |              1.08
 Apple M3 Max         |           1 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.07
 Apple M3 Max         |           2 | HEAD                                                        |              1.00
 Apple M3 Max         |           2 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.09
 Apple M3 Max         |           2 | v5-add-mul_var_int.patch                                    |              1.21
 Apple M3 Max         |           2 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.10
 Apple M3 Max         |           3 | HEAD                                                        |              1.00
 Apple M3 Max         |           3 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.09
 Apple M3 Max         |           3 | v5-add-mul_var_int.patch                                    |              0.99
 Apple M3 Max         |           3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.09
 Intel Core i9-14900K |           1 | HEAD                                                        |              1.00
 Intel Core i9-14900K |           1 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.05
 Intel Core i9-14900K |           1 | v5-add-mul_var_int.patch                                    |              1.07
 Intel Core i9-14900K |           1 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.06
 Intel Core i9-14900K |           2 | HEAD                                                        |              1.00
 Intel Core i9-14900K |           2 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.06
 Intel Core i9-14900K |           2 | v5-add-mul_var_int.patch                                    |              1.08
 Intel Core i9-14900K |           2 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.06
 Intel Core i9-14900K |           3 | HEAD                                                        |              1.00
 Intel Core i9-14900K |           3 | v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.04
 Intel Core i9-14900K |           3 | v5-add-mul_var_int.patch                                    |              1.00
 Intel Core i9-14900K |           3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch |              1.04
(36 rows)

The queries to produce the above are in bench_csv_queries.txt

/Joel
Вложения
On Thu, Jul 4, 2024, at 09:38, Joel Jacobson wrote:
> Summary of benchmark results:
>
>          cpu          | var1ndigits |                           winner
> ----------------------+-------------+-------------------------------------------------------------
..
> v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
>  AMD Ryzen 9 7950X3D  |           3 | 
> v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
...
>  Apple M3 Max         |           3 | 
> v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
...
>  Intel Core i9-14900K |           3 | 
> v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

Since v5-add-mul_var_int.patch only implements (var1ndigits <= 2)
it can't possibly win the var1ndigits=3 competition.

/Joel



On Wed, 3 Jul 2024 at 21:45, Joel Jacobson <joel@compiler.org> wrote:
>
> > On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:
> >> I wouldn't expect it to ever be off by more than 1
> >
> > OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
> > are unexpected?
>
> Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found:
>

Yeah, so that was another bug in mul_var_small(). If rscale is made
small enough, the result index for the digits computed before the main
loop overlaps the ones after, so it would overwrite digits already
computed.

Of course, that's fairly easy to fix, but at this point I think the
better solution is to only use mul_var_small() when an exact product
is requested. We would have to do that for mul_var_int() anyway,
because of its accuracy issues discussed earlier. I think this is a
reasonable thing to do because only functions like ln_var() and
exp_var() will ask mul_var() for a reduced-rscale result, and those
functions are likely to be dominated by computations involving larger
numbers, for which this patch wouldn't help anyway. Also those
functions are probably less widely used.

If we make that decision, a lot of the complexity in mul_var_small()
goes away, including all the conditional array accesses, making it
simpler and more efficient. v6 patch attached.

I also updated the mul_var_int() patch so that it is also only invoked
when an exact product is requested, and I noticed a couple of other
minor optimisations that could be made. Then I decided to try
implementing mul_var_int64(). This gives a pretty decent speedup for
3-digit inputs, but unfortunately it is much slower for 4-digit inputs
(for which most values will go through the 128-bit code path). I'm
attaching that too, just for information, but it's clearly not going
to be acceptable as-is.

Running your benchmark queries, I got these results:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1;
Time: 4520.874 ms (00:04.521)  -- HEAD
Time: 3937.536 ms (00:03.938)  -- v5-mul_var_int.patch
Time: 3919.495 ms (00:03.919)  -- v5-mul_var_small.patch
Time: 3916.964 ms (00:03.917)  -- v6-mul_var_int64.patch
Time: 3811.118 ms (00:03.811)  -- v6-mul_var_small.patch

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2;
Time: 4762.528 ms (00:04.763)  -- HEAD
Time: 4075.546 ms (00:04.076)  -- v5-mul_var_int.patch
Time: 4055.180 ms (00:04.055)  -- v5-mul_var_small.patch
Time: 4037.866 ms (00:04.038)  -- v6-mul_var_int64.patch
Time: 4018.488 ms (00:04.018)  -- v6-mul_var_small.patch

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3;
Time: 5387.514 ms (00:05.388)  -- HEAD
Time: 5350.736 ms (00:05.351)  -- v5-mul_var_int.patch
Time: 4648.449 ms (00:04.648)  -- v5-mul_var_small.patch
Time: 4655.204 ms (00:04.655)  -- v6-mul_var_int64.patch
Time: 4645.962 ms (00:04.646)  -- v6-mul_var_small.patch

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
Time: 5617.150 ms (00:05.617)  -- HEAD
Time: 5505.913 ms (00:05.506)  -- v5-mul_var_int.patch
Time: 5486.441 ms (00:05.486)  -- v5-mul_var_small.patch
Time: 8203.081 ms (00:08.203)  -- v6-mul_var_int64.patch
Time: 5598.909 ms (00:05.599)  -- v6-mul_var_small.patch

So v6-mul_var_int64 improves on v5-mul_var_int in the 3-digit case,
but is terrible in the 4-digit case. None of the other patches touch
the 4-digit case, but it might be interesting to try mul_var_small()
with 4 digits.

Regards,
Dean

Вложения
On Thu, Jul 4, 2024, at 20:43, Dean Rasheed wrote:
> On Wed, 3 Jul 2024 at 21:45, Joel Jacobson <joel@compiler.org> wrote:
>>
>> > On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:
>> >> I wouldn't expect it to ever be off by more than 1
>> >
>> > OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
>> > are unexpected?
>>
>> Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found:
>>
>
> Yeah, so that was another bug in mul_var_small(). If rscale is made
> small enough, the result index for the digits computed before the main
> loop overlaps the ones after, so it would overwrite digits already
> computed.
>
> Of course, that's fairly easy to fix, but at this point I think the
> better solution is to only use mul_var_small() when an exact product
> is requested. We would have to do that for mul_var_int() anyway,
> because of its accuracy issues discussed earlier. I think this is a
> reasonable thing to do because only functions like ln_var() and
> exp_var() will ask mul_var() for a reduced-rscale result, and those
> functions are likely to be dominated by computations involving larger
> numbers, for which this patch wouldn't help anyway. Also those
> functions are probably less widely used.
>
> If we make that decision, a lot of the complexity in mul_var_small()
> goes away, including all the conditional array accesses, making it
> simpler and more efficient. v6 patch attached.

Nice, I think that looks like the better trade-off.

> I also updated the mul_var_int() patch so that it is also only invoked
> when an exact product is requested, and I noticed a couple of other
> minor optimisations that could be made.

Looks nice.

> Then I decided to try
> implementing mul_var_int64(). This gives a pretty decent speedup for
> 3-digit inputs, but unfortunately it is much slower for 4-digit inputs
> (for which most values will go through the 128-bit code path). I'm
> attaching that too, just for information, but it's clearly not going
> to be acceptable as-is.
>
> Running your benchmark queries, I got these results:
>
> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1;
> Time: 4520.874 ms (00:04.521)  -- HEAD
> Time: 3937.536 ms (00:03.938)  -- v5-mul_var_int.patch
> Time: 3919.495 ms (00:03.919)  -- v5-mul_var_small.patch
> Time: 3916.964 ms (00:03.917)  -- v6-mul_var_int64.patch
> Time: 3811.118 ms (00:03.811)  -- v6-mul_var_small.patch

My benchmarks also indicate v6-mul_var_small.patch (v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch)
is the winner on all my three CPUs, for var1ndigits=1:

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3046.668 ms (00:03.047)
Time: 3053.327 ms (00:03.053)
Time: 3045.517 ms (00:03.046)
Time: 3049.626 ms (00:03.050)
Time: 3075.101 ms (00:03.075)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 2781.536 ms (00:02.782)
Time: 2781.324 ms (00:02.781)
Time: 2781.301 ms (00:02.781)
Time: 2786.524 ms (00:02.787)
Time: 2784.494 ms (00:02.784)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2711.167 ms (00:02.711)
Time: 2723.647 ms (00:02.724)
Time: 2706.372 ms (00:02.706)
Time: 2708.883 ms (00:02.709)
Time: 2704.621 ms (00:02.705)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3496.714 ms (00:03.497)
Time: 3278.909 ms (00:03.279)
Time: 3278.631 ms (00:03.279)
Time: 3277.658 ms (00:03.278)
Time: 3276.121 ms (00:03.276)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 3080.751 ms (00:03.081)
Time: 3078.118 ms (00:03.078)
Time: 3079.932 ms (00:03.080)
Time: 3080.668 ms (00:03.081)
Time: 3080.697 ms (00:03.081)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3043.635 ms (00:03.044)
Time: 3043.606 ms (00:03.044)
Time: 3041.391 ms (00:03.041)
Time: 3041.997 ms (00:03.042)
Time: 3045.464 ms (00:03.045)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3421.307 ms (00:03.421)
Time: 3400.935 ms (00:03.401)
Time: 3359.839 ms (00:03.360)
Time: 3374.246 ms (00:03.374)
Time: 3375.085 ms (00:03.375)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 3002.214 ms (00:03.002)
Time: 3016.042 ms (00:03.016)
Time: 3010.541 ms (00:03.011)
Time: 3009.204 ms (00:03.009)
Time: 3002.088 ms (00:03.002)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2959.319 ms (00:02.959)
Time: 2957.694 ms (00:02.958)
Time: 2971.559 ms (00:02.972)
Time: 2958.788 ms (00:02.959)
Time: 2958.812 ms (00:02.959)

> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2;
> Time: 4762.528 ms (00:04.763)  -- HEAD
> Time: 4075.546 ms (00:04.076)  -- v5-mul_var_int.patch
> Time: 4055.180 ms (00:04.055)  -- v5-mul_var_small.patch
> Time: 4037.866 ms (00:04.038)  -- v6-mul_var_int64.patch
> Time: 4018.488 ms (00:04.018)  -- v6-mul_var_small.patch

I get mixed results for var1ndigits=2:
Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch.
Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3369.724 ms (00:03.370)
Time: 3340.977 ms (00:03.341)
Time: 3331.407 ms (00:03.331)
Time: 3333.304 ms (00:03.333)
Time: 3332.136 ms (00:03.332)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 2768.108 ms (00:02.768)
Time: 2736.996 ms (00:02.737)
Time: 2730.217 ms (00:02.730)
Time: 2743.556 ms (00:02.744)
Time: 2746.725 ms (00:02.747)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2895.200 ms (00:02.895)
Time: 2894.823 ms (00:02.895)
Time: 2899.475 ms (00:02.899)
Time: 2895.385 ms (00:02.895)
Time: 2898.647 ms (00:02.899)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3385.836 ms (00:03.386)
Time: 3367.739 ms (00:03.368)
Time: 3363.321 ms (00:03.363)
Time: 3365.433 ms (00:03.365)
Time: 3365.301 ms (00:03.365)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 3086.253 ms (00:03.086)
Time: 3085.272 ms (00:03.085)
Time: 3085.769 ms (00:03.086)
Time: 3089.128 ms (00:03.089)
Time: 3086.735 ms (00:03.087)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3053.775 ms (00:03.054)
Time: 3058.392 ms (00:03.058)
Time: 3068.113 ms (00:03.068)
Time: 3057.333 ms (00:03.057)
Time: 3057.218 ms (00:03.057)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3619.441 ms (00:03.619)
Time: 3609.553 ms (00:03.610)
Time: 3574.277 ms (00:03.574)
Time: 3578.031 ms (00:03.578)
Time: 3558.545 ms (00:03.559)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 3061.858 ms (00:03.062)
Time: 3082.174 ms (00:03.082)
Time: 3081.093 ms (00:03.081)
Time: 3093.610 ms (00:03.094)
Time: 3064.507 ms (00:03.065)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3100.287 ms (00:03.100)
Time: 3100.264 ms (00:03.100)
Time: 3097.207 ms (00:03.097)
Time: 3101.477 ms (00:03.101)
Time: 3098.035 ms (00:03.098)

> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3;
> Time: 5387.514 ms (00:05.388)  -- HEAD
> Time: 5350.736 ms (00:05.351)  -- v5-mul_var_int.patch
> Time: 4648.449 ms (00:04.648)  -- v5-mul_var_small.patch
> Time: 4655.204 ms (00:04.655)  -- v6-mul_var_int64.patch
> Time: 4645.962 ms (00:04.646)  -- v6-mul_var_small.patch

I get mixed results for var1ndigits=3:
Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch.
Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Same winners as for var1ndigits=2.

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3466.932 ms (00:03.467)
Time: 3447.001 ms (00:03.447)
Time: 3457.259 ms (00:03.457)
Time: 3445.938 ms (00:03.446)
Time: 3443.310 ms (00:03.443)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 2988.444 ms (00:02.988)
Time: 2976.036 ms (00:02.976)
Time: 2982.756 ms (00:02.983)
Time: 2986.436 ms (00:02.986)
Time: 2973.457 ms (00:02.973)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3026.666 ms (00:03.027)
Time: 3024.458 ms (00:03.024)
Time: 3022.976 ms (00:03.023)
Time: 3029.526 ms (00:03.030)
Time: 3021.543 ms (00:03.022)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4510.078 ms (00:04.510)
Time: 4222.501 ms (00:04.223)
Time: 4179.509 ms (00:04.180)
Time: 4179.307 ms (00:04.179)
Time: 4183.026 ms (00:04.183)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 3811.866 ms (00:03.812)
Time: 3816.664 ms (00:03.817)
Time: 3811.695 ms (00:03.812)
Time: 3811.674 ms (00:03.812)
Time: 3812.265 ms (00:03.812)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 4095.053 ms (00:04.095)
Time: 3896.002 ms (00:03.896)
Time: 3888.999 ms (00:03.889)
Time: 3889.346 ms (00:03.889)
Time: 3889.017 ms (00:03.889)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3946.255 ms (00:03.946)
Time: 3896.110 ms (00:03.896)
Time: 3877.470 ms (00:03.877)
Time: 3854.402 ms (00:03.854)
Time: 3901.218 ms (00:03.901)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 3393.572 ms (00:03.394)
Time: 3395.401 ms (00:03.395)
Time: 3395.199 ms (00:03.395)
Time: 3418.555 ms (00:03.419)
Time: 3399.619 ms (00:03.400)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3308.791 ms (00:03.309)
Time: 3309.316 ms (00:03.309)
Time: 3318.238 ms (00:03.318)
Time: 3296.061 ms (00:03.296)
Time: 3320.282 ms (00:03.320)

> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
> Time: 5617.150 ms (00:05.617)  -- HEAD
> Time: 5505.913 ms (00:05.506)  -- v5-mul_var_int.patch
> Time: 5486.441 ms (00:05.486)  -- v5-mul_var_small.patch
> Time: 8203.081 ms (00:08.203)  -- v6-mul_var_int64.patch
> Time: 5598.909 ms (00:05.599)  -- v6-mul_var_small.patch
> So v6-mul_var_int64 improves on v5-mul_var_int in the 3-digit case,
> but is terrible in the 4-digit case. None of the other patches touch
> the 4-digit case, but it might be interesting to try mul_var_small()
> with 4 digits.

Interesting you got so bad bench results for v6-mul_var_int64.patch
for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.

On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.
On Apple M3 Max, HEAD is the winner.

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3618.530 ms (00:03.619)
Time: 3595.239 ms (00:03.595)
Time: 3600.412 ms (00:03.600)
Time: 3607.500 ms (00:03.607)
Time: 3604.122 ms (00:03.604)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 4498.993 ms (00:04.499)
Time: 4527.302 ms (00:04.527)
Time: 4493.613 ms (00:04.494)
Time: 4482.194 ms (00:04.482)
Time: 4493.660 ms (00:04.494)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3628.177 ms (00:03.628)
Time: 3613.975 ms (00:03.614)
Time: 3612.213 ms (00:03.612)
Time: 3614.026 ms (00:03.614)
Time: 3622.959 ms (00:03.623)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4130.810 ms (00:04.131)
Time: 3836.462 ms (00:03.836)
Time: 3810.604 ms (00:03.811)
Time: 3805.443 ms (00:03.805)
Time: 3803.055 ms (00:03.803)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3952.862 ms (00:03.953)
Time: 3792.272 ms (00:03.792)
Time: 3793.995 ms (00:03.794)
Time: 3790.531 ms (00:03.791)
Time: 3793.647 ms (00:03.794)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3762.828 ms (00:03.763)
Time: 3754.869 ms (00:03.755)
Time: 3756.041 ms (00:03.756)
Time: 3758.719 ms (00:03.759)
Time: 3754.894 ms (00:03.755)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4075.988 ms (00:04.076)
Time: 4067.702 ms (00:04.068)
Time: 4035.191 ms (00:04.035)
Time: 4022.411 ms (00:04.022)
Time: 4016.475 ms (00:04.016)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3830.021 ms (00:03.830)
Time: 3837.947 ms (00:03.838)
Time: 3834.894 ms (00:03.835)
Time: 3832.755 ms (00:03.833)
Time: 3834.512 ms (00:03.835)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 4031.128 ms (00:04.031)
Time: 4001.590 ms (00:04.002)
Time: 4031.212 ms (00:04.031)
Time: 4035.941 ms (00:04.036)
Time: 4031.218 ms (00:04.031)

Regards,
Joel



On Fri, 5 Jul 2024 at 12:56, Joel Jacobson <joel@compiler.org> wrote:
>
> Interesting you got so bad bench results for v6-mul_var_int64.patch
> for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.

Interesting.

> On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.

That must be random noise, since
v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch doesn't
invoke mul_var_small() for 4-digit inputs.

> On Apple M3 Max, HEAD is the winner.

Importantly, mul_var_int64() is around 1.25x slower there, and it was
even worse on my machine.

Attached is a v7 mul_var_small() patch adding 4-digit support. For me,
this gives a nice speedup:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
Time: 5617.150 ms (00:05.617)  -- HEAD
Time: 8203.081 ms (00:08.203)  -- v6-mul_var_int64.patch
Time: 4750.212 ms (00:04.750)  -- v7-mul_var_small.patch

The other advantage, of course, is that it doesn't require 128-bit
integer support.

Regards,
Dean

Вложения
On Fri, Jul 5, 2024, at 17:41, Dean Rasheed wrote:
> On Fri, 5 Jul 2024 at 12:56, Joel Jacobson <joel@compiler.org> wrote:
>>
>> Interesting you got so bad bench results for v6-mul_var_int64.patch
>> for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.
>
> Interesting.

I remeasured just to be sure, and yes, it was the winner among the previous patches,
but the new v7 beats it.

>> On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.
>
> That must be random noise, since
> v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch doesn't
> invoke mul_var_small() for 4-digit inputs.

Yes, something was off with the HEAD measurements for that one,
I remeasured and then got almost identical results (as expected)
between HEAD and  v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
for 4-digit inputs.

>> On Apple M3 Max, HEAD is the winner.
>
> Importantly, mul_var_int64() is around 1.25x slower there, and it was
> even worse on my machine.
>
> Attached is a v7 mul_var_small() patch adding 4-digit support. For me,
> this gives a nice speedup:
>
> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
> Time: 5617.150 ms (00:05.617)  -- HEAD
> Time: 8203.081 ms (00:08.203)  -- v6-mul_var_int64.patch
> Time: 4750.212 ms (00:04.750)  -- v7-mul_var_small.patch
>
> The other advantage, of course, is that it doesn't require 128-bit
> integer support.

Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3574.865 ms (00:03.575)
Time: 3573.678 ms (00:03.574)
Time: 3576.953 ms (00:03.577)
Time: 3580.536 ms (00:03.581)
Time: 3589.007 ms (00:03.589)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3110.171 ms (00:03.110)
Time: 3098.558 ms (00:03.099)
Time: 3105.873 ms (00:03.106)
Time: 3104.484 ms (00:03.104)
Time: 3109.035 ms (00:03.109)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3751.767 ms (00:03.752)
Time: 3745.916 ms (00:03.746)
Time: 3742.542 ms (00:03.743)
Time: 3746.139 ms (00:03.746)
Time: 3745.493 ms (00:03.745)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3747.640 ms (00:03.748)
Time: 3747.231 ms (00:03.747)
Time: 3747.965 ms (00:03.748)
Time: 3748.309 ms (00:03.748)
Time: 3746.498 ms (00:03.746)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3417.924 ms (00:03.418)
Time: 3417.088 ms (00:03.417)
Time: 3415.708 ms (00:03.416)
Time: 3415.453 ms (00:03.415)
Time: 3419.566 ms (00:03.420)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3970.131 ms (00:03.970)
Time: 3924.335 ms (00:03.924)
Time: 3927.863 ms (00:03.928)
Time: 3924.761 ms (00:03.925)
Time: 3926.290 ms (00:03.926)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3874.769 ms (00:03.875)
Time: 3858.071 ms (00:03.858)
Time: 3836.698 ms (00:03.837)
Time: 3871.388 ms (00:03.871)
Time: 3844.907 ms (00:03.845)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3397.846 ms (00:03.398)
Time: 3398.050 ms (00:03.398)
Time: 3395.279 ms (00:03.395)
Time: 3393.285 ms (00:03.393)
Time: 3402.570 ms (00:03.403)

Code wise I think it's now very nice and clear, with just enough comments.

Also nice to see that the var1ndigits=4 case isn't much more complex
than var1ndigits=3, since it follows the same pattern.

Regards,
Joel



On Fri, Jul 5, 2024, at 18:42, Joel Jacobson wrote:
> Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
> is now the winner on all my CPUs:

I thought it would be interesting to also measure the isolated effect
on just numeric_mul() without the query overhead.

Included var1ndigits=5 var2ndigits=5, that should be unaffected,
just to get a sense of the noise level.

SELECT timeit.h('numeric_mul',array['9999','9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999','9999_9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999_9999','9999_9999_9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999_9999_9999','9999_9999_9999_9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999_9999_9999_9999','9999_9999_9999_9999_9999'],2,min_time:='1
s'::interval);

CPU                  | var1ndigits | var2ndigits | HEAD  | v7    | HEAD/v7
---------------------+-------------+-------------+-------+-------+---------
Apple M3 Max         |           1 |           1 | 28 ns | 18 ns | 1.56
Apple M3 Max         |           2 |           2 | 32 ns | 18 ns | 1.78
Apple M3 Max         |           3 |           3 | 38 ns | 21 ns | 1.81
Apple M3 Max         |           4 |           4 | 42 ns | 24 ns | 1.75
Intel Core i9-14900K |           1 |           1 | 25 ns | 20 ns | 1.25
Intel Core i9-14900K |           2 |           2 | 28 ns | 20 ns | 1.40
Intel Core i9-14900K |           3 |           3 | 33 ns | 24 ns | 1.38
Intel Core i9-14900K |           4 |           4 | 37 ns | 25 ns | 1.48
AMD Ryzen 9 7950X3D  |           1 |           1 | 37 ns | 29 ns | 1.28
AMD Ryzen 9 7950X3D  |           2 |           2 | 43 ns | 31 ns | 1.39
AMD Ryzen 9 7950X3D  |           3 |           3 | 50 ns | 37 ns | 1.35
AMD Ryzen 9 7950X3D  |           4 |           4 | 55 ns | 39 ns | 1.41

Impressive speed-up, between 25% - 81%.

Regards,
Joel



On Fri, 5 Jul 2024 at 18:37, Joel Jacobson <joel@compiler.org> wrote:
>
> On Fri, Jul 5, 2024, at 18:42, Joel Jacobson wrote:
> > Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
> > is now the winner on all my CPUs:
>
> I thought it would be interesting to also measure the isolated effect
> on just numeric_mul() without the query overhead.
>
> Impressive speed-up, between 25% - 81%.
>

Cool. I think we should go with the mul_var_small() patch then, since
it's more generally applicable.

I also did some testing with much larger var2 values, and saw similar
speed-ups. One high-level function that benefits from that is
factorial(), which accepts inputs up to 32177, and so uses both the
1-digit and 2-digit code with very large var2 values. I doubt anyone
actually uses it with such large inputs, but it's interesting
nonetheless:

SELECT factorial(32177);
Time: 923.117 ms  -- HEAD
Time: 534.375 ms  -- mul_var_small() patch

I did one more round of (mostly cosmetic) copy-editing. Aside from
improving some of the comments, it occurred to me that there's no need
to pass rscale to mul_var_small(), or for it to call round_var(),
since it's always computing the exact result. That shaves off a few
more cycles.

Additionally, I didn't like how res_weight and res_ndigits were being
set 1 higher than they needed to be. That makes sense in mul_var()
because it may round the result, causing a non-zero carry to propagate
into the next digit up, but it's just confusing in mul_var_small(). So
I've reduced those by 1, which makes the look much more logical. To be
clear, this doesn't change how many digits we're calculating. But now
res_ndigits is actually the number of digits being calculated, whereas
before, res_ndigits was 1 larger and we were calculating res_ndigits -
1 digits, which was confusing.

I think this is good to go, so unless there are any further comments,
I plan to commit it soon.

Possible future work would be to try extending it to larger var1
values. I have a feeling that might work quite well for 5 or 6 digits,
but at some point, we'll start seeing diminishing returns, and the
code bloat won't be worth it.

Regards,
Dean

Вложения
On Sat, Jul 6, 2024, at 11:34, Dean Rasheed wrote:
> On Fri, 5 Jul 2024 at 18:37, Joel Jacobson <joel@compiler.org> wrote:
>>
>> On Fri, Jul 5, 2024, at 18:42, Joel Jacobson wrote:
>> > Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
>> > is now the winner on all my CPUs:
>>
>> I thought it would be interesting to also measure the isolated effect
>> on just numeric_mul() without the query overhead.
>>
>> Impressive speed-up, between 25% - 81%.
>>
>
> Cool. I think we should go with the mul_var_small() patch then, since
> it's more generally applicable.

I agree.

> I also did some testing with much larger var2 values, and saw similar
> speed-ups. One high-level function that benefits from that is
> factorial(), which accepts inputs up to 32177, and so uses both the
> 1-digit and 2-digit code with very large var2 values. I doubt anyone
> actually uses it with such large inputs, but it's interesting
> nonetheless:
>
> SELECT factorial(32177);
> Time: 923.117 ms  -- HEAD
> Time: 534.375 ms  -- mul_var_small() patch

Nice!

> I did one more round of (mostly cosmetic) copy-editing. Aside from
> improving some of the comments, it occurred to me that there's no need
> to pass rscale to mul_var_small(), or for it to call round_var(),
> since it's always computing the exact result. That shaves off a few
> more cycles.

Nice, also cleaner.

> Additionally, I didn't like how res_weight and res_ndigits were being
> set 1 higher than they needed to be. That makes sense in mul_var()
> because it may round the result, causing a non-zero carry to propagate
> into the next digit up, but it's just confusing in mul_var_small(). So
> I've reduced those by 1, which makes the look much more logical. To be
> clear, this doesn't change how many digits we're calculating. But now
> res_ndigits is actually the number of digits being calculated, whereas
> before, res_ndigits was 1 larger and we were calculating res_ndigits -
> 1 digits, which was confusing.

Nice, much cleaner.

> I think this is good to go, so unless there are any further comments,
> I plan to commit it soon.

LGTM.

Benchmark, only on Apple M3 Max:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3042.157 ms (00:03.042)
Time: 3027.711 ms (00:03.028)
Time: 3078.215 ms (00:03.078)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v8
Time: 2700.676 ms (00:02.701)
Time: 2713.594 ms (00:02.714)
Time: 2704.139 ms (00:02.704)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 4506.064 ms (00:04.506)
Time: 3316.204 ms (00:03.316)
Time: 3321.086 ms (00:03.321)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v8
Time: 2904.786 ms (00:02.905)
Time: 2921.996 ms (00:02.922)
Time: 2919.269 ms (00:02.919)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4636.051 ms (00:04.636)
Time: 3439.951 ms (00:03.440)
Time: 3471.245 ms (00:03.471)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v8
Time: 3034.364 ms (00:03.034)
Time: 3025.351 ms (00:03.025)
Time: 3075.024 ms (00:03.075)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4978.086 ms (00:04.978)
Time: 3580.283 ms (00:03.580)
Time: 3582.719 ms (00:03.583)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v8
Time: 3147.352 ms (00:03.147)
Time: 3135.903 ms (00:03.136)
Time: 3172.491 ms (00:03.172)

Regards,
Joel



On Sat, 6 Jul 2024 at 12:17, Joel Jacobson <joel@compiler.org> wrote:
>
> > I think this is good to go, so unless there are any further comments,
> > I plan to commit it soon.
>
> LGTM.
>

OK, I have committed this.

At the last minute, I changed the name of the new function to
mul_var_short() because "short" is probably a better term to use in
this context (we already use it in a preceding comment). "Small" is
potentially misleading, because the numbers themselves could be
numerically very large.

Regards,
Dean



On Tue, 9 Jul 2024 at 10:11, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> OK, I have committed this.
>

Before considering the other patches to optimise for larger inputs, I
think it's worth optimising the existing mul_var() code as much as
possible.

One thing I noticed while testing the earlier patches on this thread
was that they were significantly faster if they used unsigned integers
rather than signed integers. I think the reason is that operations
like "x / 10000" and "x % 10000" use fewer CPU instructions (on every
platform, according to godbolt.org) if x is unsigned.

In addition, this reduces the number of times the digit array needs to
be renormalised, which seems to be the biggest factor.

Another small optimisation that seems to be just about worthwhile is
to pull the first digit of var1 out of the main loop, so that its
contributions can be set directly in dig[], rather than being added to
it. This allows palloc() to be used to allocate dig[], rather than
palloc0(), and only requires the upper part of dig[] to be initialised
to zeros, rather than all of it.

Together, these seem to give a decent speed-up:

 NBASE digits |  HEAD rate    | patch rate
--------------+---------------+---------------
            5 | 5.8797105e+06 |  6.047134e+06
           12 |  4.140017e+06 | 4.3429845e+06
           25 | 2.5711072e+06 | 2.7530615e+06
           50 | 1.0367389e+06 | 1.3370771e+06
          100 |      367924.1 |     462437.38
          250 |      77231.32 |     104593.95
         2500 |     881.48694 |     1197.4739
        15000 |     25.064987 |      32.78391

The largest gains are above around 50 NBASE digits, where the time
spent renormalising dig[] becomes significant.

Regards,
Dean

Вложения
On Tue, Jul 9, 2024, at 14:01, Dean Rasheed wrote:
> Before considering the other patches to optimise for larger inputs, I
> think it's worth optimising the existing mul_var() code as much as
> possible.
>
> One thing I noticed while testing the earlier patches on this thread
> was that they were significantly faster if they used unsigned integers
> rather than signed integers. I think the reason is that operations
> like "x / 10000" and "x % 10000" use fewer CPU instructions (on every
> platform, according to godbolt.org) if x is unsigned.
>
> In addition, this reduces the number of times the digit array needs to
> be renormalised, which seems to be the biggest factor.
>
> Another small optimisation that seems to be just about worthwhile is
> to pull the first digit of var1 out of the main loop, so that its
> contributions can be set directly in dig[], rather than being added to
> it. This allows palloc() to be used to allocate dig[], rather than
> palloc0(), and only requires the upper part of dig[] to be initialised
> to zeros, rather than all of it.

Nice, really smart!

> Together, these seem to give a decent speed-up:
>
>  NBASE digits |  HEAD rate    | patch rate
> --------------+---------------+---------------
>             5 | 5.8797105e+06 |  6.047134e+06
>            12 |  4.140017e+06 | 4.3429845e+06
>            25 | 2.5711072e+06 | 2.7530615e+06
>            50 | 1.0367389e+06 | 1.3370771e+06
>           100 |      367924.1 |     462437.38
>           250 |      77231.32 |     104593.95
>          2500 |     881.48694 |     1197.4739
>         15000 |     25.064987 |      32.78391
>
> The largest gains are above around 50 NBASE digits, where the time
> spent renormalising dig[] becomes significant.

I added some more ndigits test cases:

/*
 * Intel Core i9-14900K
 */

 NBASE digits |   HEAD rate    |   patch rate   | relative difference
--------------+----------------+----------------+---------------------
            1 |  4.7846890e+07 |  4.7846890e+07 | 0.00%
            2 |  4.9504950e+07 |  4.7393365e+07 | -4.27%
            3 |  4.0816327e+07 |  4.0983607e+07 | 0.41%
            4 |  4.1152263e+07 |  3.9370079e+07 | -4.33%
            5 |  2.2573363e+07 |  2.1978022e+07 | -2.64%
            6 |  2.1739130e+07 |  1.9646365e+07 | -9.63%
            7 |  1.6393443e+07 |  1.6339869e+07 | -0.33%
            8 |  1.6863406e+07 |  1.6778523e+07 | -0.50%
            9 |  1.5105740e+07 |  1.6420361e+07 | 8.70%
           10 |  1.3315579e+07 |  1.5527950e+07 | 16.61%
           11 |  1.2360939e+07 |  1.4124294e+07 | 14.27%
           12 |  1.1764706e+07 |  1.2836970e+07 | 9.11%
           13 |  1.0060362e+07 |  1.1820331e+07 | 17.49%
           14 |  9.0909091e+06 |  1.0000000e+07 | 10.00%
           15 |  7.6923077e+06 |  8.0000000e+06 | 4.00%
           16 |  9.0909091e+06 |  9.4339623e+06 | 3.77%
           17 |  7.2992701e+06 |  9.0909091e+06 | 24.55%
           18 |  7.0921986e+06 |  7.8125000e+06 | 10.16%
           19 |  6.5789474e+06 |  6.6666667e+06 | 1.33%
           20 |  6.2500000e+06 |  6.5789474e+06 | 5.26%
           21 |  5.8479532e+06 |  6.1728395e+06 | 5.56%
           22 |  5.5555556e+06 |  5.9880240e+06 | 7.78%
           24 |  5.2631579e+06 |  5.8823529e+06 | 11.76%
           25 |  5.2083333e+06 |  5.5555556e+06 | 6.67%
           26 |  4.7619048e+06 |  5.2631579e+06 | 10.53%
           27 |  4.5045045e+06 |  5.2083333e+06 | 15.63%
           28 |  4.4247788e+06 |  4.7619048e+06 | 7.62%
           29 |  4.1666667e+06 |  4.5454545e+06 | 9.09%
           30 |  4.0000000e+06 |  4.3478261e+06 | 8.70%
           31 |  3.4482759e+06 |  4.0000000e+06 | 16.00%
           32 |  3.9840637e+06 |  4.2016807e+06 | 5.46%
           50 |  2.0964361e+06 |  2.6595745e+06 | 26.86%
          100 | 666666.67      | 869565.22      | 30.43%
          250 | 142653.35      | 171526.59      | 20.24%
         2500 | 1642.04        | 2197.80        | 33.85%
        15000 | 41.67          | 52.63          | 26.32%
(36 rows)

/*
 * AMD Ryzen 9 7950X3D
 */

 NBASE digits |   HEAD rate    |   patch rate   | relative difference
--------------+----------------+----------------+---------------------
            1 |  3.6900369e+07 |  3.8022814e+07 | 3.04%
            2 |  3.4602076e+07 |  3.5714286e+07 | 3.21%
            3 |  2.8011204e+07 |  2.7777778e+07 | -0.83%
            4 |  2.7932961e+07 |  2.8328612e+07 | 1.42%
            5 |  1.6420361e+07 |  1.7123288e+07 | 4.28%
            6 |  1.4705882e+07 |  1.5313936e+07 | 4.13%
            7 |  1.3192612e+07 |  1.3888889e+07 | 5.28%
            8 |  1.2121212e+07 |  1.2919897e+07 | 6.59%
            9 |  1.1235955e+07 |  1.2135922e+07 | 8.01%
           10 |  1.0000000e+07 |  1.1312217e+07 | 13.12%
           11 |  9.0909091e+06 |  1.0000000e+07 | 10.00%
           12 |  8.1967213e+06 |  8.4033613e+06 | 2.52%
           13 |  7.2463768e+06 |  7.7519380e+06 | 6.98%
           14 |  6.7567568e+06 |  7.1428571e+06 | 5.71%
           15 |  5.5555556e+06 |  5.8823529e+06 | 5.88%
           16 |  6.3291139e+06 |  5.7803468e+06 | -8.67%
           17 |  5.8823529e+06 |  5.9880240e+06 | 1.80%
           18 |  5.5555556e+06 |  5.7142857e+06 | 2.86%
           19 |  5.2356021e+06 |  5.6179775e+06 | 7.30%
           20 |  4.9019608e+06 |  5.1020408e+06 | 4.08%
           21 |  4.5454545e+06 |  4.8543689e+06 | 6.80%
           22 |  4.1841004e+06 |  4.5871560e+06 | 9.63%
           24 |  4.4642857e+06 |  4.4052863e+06 | -1.32%
           25 |  4.1666667e+06 |  4.2194093e+06 | 1.27%
           26 |  4.0000000e+06 |  3.9525692e+06 | -1.19%
           27 |  3.8461538e+06 |  3.8022814e+06 | -1.14%
           28 |  3.9062500e+06 |  3.8759690e+06 | -0.78%
           29 |  3.7878788e+06 |  3.8022814e+06 | 0.38%
           30 |  3.3898305e+06 |  3.7174721e+06 | 9.67%
           31 |  2.7472527e+06 |  2.8571429e+06 | 4.00%
           32 |  3.0395137e+06 |  3.1446541e+06 | 3.46%
           50 |  1.7094017e+06 |  2.0576132e+06 | 20.37%
          100 | 518134.72      | 609756.10      | 17.68%
          250 | 108577.63      | 136612.02      | 25.82%
         2500 | 1264.22        | 1610.31        | 27.38%
        15000 | 34.48          | 43.48          | 26.09%
(36 rows)

/*
 * Apple M3 Max
 */

 NBASE digits |   HEAD rate    |   patch rate   | relative difference
--------------+----------------+----------------+---------------------
            1 |  4.9504950e+07 |  4.9504950e+07 | 0.00%
            2 |  6.1349693e+07 |  5.9171598e+07 | -3.55%
            3 |  5.2631579e+07 |  5.2083333e+07 | -1.04%
            4 |  4.5248869e+07 |  4.5248869e+07 | 0.00%
            5 |  2.1978022e+07 |  2.2727273e+07 | 3.41%
            6 |  1.9920319e+07 |  2.0876827e+07 | 4.80%
            7 |  1.7182131e+07 |  1.8018018e+07 | 4.86%
            8 |  1.5822785e+07 |  1.6051364e+07 | 1.44%
            9 |  1.3368984e+07 |  1.3333333e+07 | -0.27%
           10 |  1.1709602e+07 |  1.1627907e+07 | -0.70%
           11 |  1.0020040e+07 |  1.0526316e+07 | 5.05%
           12 |  9.0909091e+06 |  9.0909091e+06 | 0.00%
           13 |  8.2644628e+06 |  8.2644628e+06 | 0.00%
           14 |  7.6923077e+06 |  7.6335878e+06 | -0.76%
           15 |  7.1428571e+06 |  7.0921986e+06 | -0.71%
           16 |  6.6225166e+06 |  6.5789474e+06 | -0.66%
           17 |  5.9880240e+06 |  6.2111801e+06 | 3.73%
           18 |  5.7803468e+06 |  5.5865922e+06 | -3.35%
           19 |  5.2631579e+06 |  5.2356021e+06 | -0.52%
           20 |  4.6296296e+06 |  4.8543689e+06 | 4.85%
           21 |  4.4444444e+06 |  4.3859649e+06 | -1.32%
           22 |  4.2016807e+06 |  4.0485830e+06 | -3.64%
           24 |  3.7453184e+06 |  3.5714286e+06 | -4.64%
           25 |  3.4843206e+06 |  3.4013605e+06 | -2.38%
           26 |  3.2786885e+06 |  3.2786885e+06 | 0.00%
           27 |  3.0674847e+06 |  3.1055901e+06 | 1.24%
           28 |  2.8818444e+06 |  2.9069767e+06 | 0.87%
           29 |  2.7322404e+06 |  2.7700831e+06 | 1.39%
           30 |  2.5839793e+06 |  2.6246719e+06 | 1.57%
           31 |  2.5062657e+06 |  2.4630542e+06 | -1.72%
           32 |  4.5871560e+06 |  4.6082949e+06 | 0.46%
           50 |  1.7513135e+06 |  1.9880716e+06 | 13.52%
          100 | 714285.71      | 833333.33      | 16.67%
          250 | 124223.60      | 149925.04      | 20.69%
         2500 | 1440.92        | 1760.56        | 22.18%
        15000 | 39.53          | 48.08          | 21.63%
(36 rows)

Regards,
Joel



On Tue, Jul 9, 2024, at 16:11, Joel Jacobson wrote:
> I added some more ndigits test cases:

Ops, please ignore previous benchmark;
I had forgot to commit in between the measurements,
so they all ran in the same db txn,
which caused a lot of noise on few ndigits.

New benchmark:

> /*
>  * Intel Core i9-14900K
>  */

 NBASE digits |   HEAD rate    |   patch rate   | relative difference
--------------+----------------+----------------+---------------------
            1 |  5.0251256e+07 |  5.2631579e+07 | 4.74%
            2 |  4.8543689e+07 |  4.9751244e+07 | 2.49%
            3 |  4.1493776e+07 |  4.3478261e+07 | 4.78%
            4 |  4.1493776e+07 |  4.0816327e+07 | -1.63%
            5 |  2.2371365e+07 |  2.3364486e+07 | 4.44%
            6 |  2.1008403e+07 |  2.1186441e+07 | 0.85%
            7 |  1.7152659e+07 |  1.6233766e+07 | -5.36%
            8 |  1.7123288e+07 |  1.8450185e+07 | 7.75%
            9 |  1.5290520e+07 |  1.7271157e+07 | 12.95%
           10 |  1.3351135e+07 |  1.5384615e+07 | 15.23%
           11 |  1.2453300e+07 |  1.4164306e+07 | 13.74%
           12 |  1.1655012e+07 |  1.2936611e+07 | 11.00%
           13 |  1.0373444e+07 |  1.1904762e+07 | 14.76%
           14 |  9.0909091e+06 |  1.0162602e+07 | 11.79%
           15 |  7.7519380e+06 |  8.1300813e+06 | 4.88%
           16 |  9.0909091e+06 |  9.8039216e+06 | 7.84%
           17 |  7.5757576e+06 |  9.0909091e+06 | 20.00%
           18 |  7.2463768e+06 |  8.2644628e+06 | 14.05%
           19 |  6.6225166e+06 |  7.5757576e+06 | 14.39%
           20 |  6.4516129e+06 |  7.0422535e+06 | 9.15%
           21 |  6.0606061e+06 |  6.5789474e+06 | 8.55%
           22 |  5.7142857e+06 |  6.2500000e+06 | 9.38%
           24 |  5.4054054e+06 |  6.0240964e+06 | 11.45%
           25 |  5.2356021e+06 |  5.8139535e+06 | 11.05%
           26 |  5.0251256e+06 |  5.8139535e+06 | 15.70%
           27 |  4.7393365e+06 |  5.7142857e+06 | 20.57%
           28 |  4.6082949e+06 |  5.2083333e+06 | 13.02%
           29 |  4.3478261e+06 |  4.9504950e+06 | 13.86%
           30 |  4.0816327e+06 |  4.6728972e+06 | 14.49%
           31 |  3.4843206e+06 |  3.9682540e+06 | 13.89%
           32 |  4.0000000e+06 |  4.1666667e+06 | 4.17%
           50 |  2.1097046e+06 |  2.8571429e+06 | 35.43%
          100 | 680272.11      | 909090.91      | 33.64%
          250 | 141643.06      | 174216.03      | 23.00%
         2500 | 1626.02        | 2188.18        | 34.57%
        15000 | 41.67          | 52.63          | 26.32%
(36 rows)


> /*
>  * AMD Ryzen 9 7950X3D
>  */

 NBASE digits |   HEAD rate    |   patch rate   | relative difference
--------------+----------------+----------------+---------------------
            1 |  3.7037037e+07 |  3.8910506e+07 | 5.06%
            2 |  3.5587189e+07 |  3.5971223e+07 | 1.08%
            3 |  3.0581040e+07 |  2.9239766e+07 | -4.39%
            4 |  2.7322404e+07 |  3.0303030e+07 | 10.91%
            5 |  1.8050542e+07 |  1.9011407e+07 | 5.32%
            6 |  1.5974441e+07 |  1.6233766e+07 | 1.62%
            7 |  1.3106160e+07 |  1.3071895e+07 | -0.26%
            8 |  1.2285012e+07 |  1.3106160e+07 | 6.68%
            9 |  1.1534025e+07 |  1.2269939e+07 | 6.38%
           10 |  1.1135857e+07 |  1.1507480e+07 | 3.34%
           11 |  9.7943193e+06 |  1.0976948e+07 | 12.07%
           12 |  9.5238095e+06 |  1.0256410e+07 | 7.69%
           13 |  8.6206897e+06 |  8.7719298e+06 | 1.75%
           14 |  7.3529412e+06 |  8.1967213e+06 | 11.48%
           15 |  6.2893082e+06 |  6.7114094e+06 | 6.71%
           16 |  7.2463768e+06 |  7.0422535e+06 | -2.82%
           17 |  6.2893082e+06 |  7.2463768e+06 | 15.22%
           18 |  6.3694268e+06 |  7.4626866e+06 | 17.16%
           19 |  5.6818182e+06 |  6.6225166e+06 | 16.56%
           20 |  5.2083333e+06 |  6.1728395e+06 | 18.52%
           21 |  5.0251256e+06 |  5.7471264e+06 | 14.37%
           22 |  4.5248869e+06 |  5.1282051e+06 | 13.33%
           24 |  4.9261084e+06 |  5.1020408e+06 | 3.57%
           25 |  4.6511628e+06 |  4.9504950e+06 | 6.44%
           26 |  4.2553191e+06 |  4.6082949e+06 | 8.29%
           27 |  3.9682540e+06 |  4.2918455e+06 | 8.15%
           28 |  3.8910506e+06 |  4.1322314e+06 | 6.20%
           29 |  3.8167939e+06 |  3.7593985e+06 | -1.50%
           30 |  3.5842294e+06 |  3.6101083e+06 | 0.72%
           31 |  3.1948882e+06 |  3.1645570e+06 | -0.95%
           32 |  3.4722222e+06 |  3.7174721e+06 | 7.06%
           50 |  1.6474465e+06 |  2.1691974e+06 | 31.67%
          100 | 555555.56      | 653594.77      | 17.65%
          250 | 109409.19      | 140449.44      | 28.37%
         2500 | 1236.09        | 1555.21        | 25.82%
        15000 | 34.48          | 43.48          | 26.09%
(36 rows)

> /*
>  * Apple M3 Max
>  */

 NBASE digits |   HEAD rate    |   patch rate   | relative difference
--------------+----------------+----------------+---------------------
            1 |  4.7169811e+07 |  4.7619048e+07 | 0.95%
            2 |  6.0240964e+07 |  5.8479532e+07 | -2.92%
            3 |  5.2083333e+07 |  5.3191489e+07 | 2.13%
            4 |  4.5871560e+07 |  4.6948357e+07 | 2.35%
            5 |  2.2075055e+07 |  2.3529412e+07 | 6.59%
            6 |  2.0080321e+07 |  2.1505376e+07 | 7.10%
            7 |  1.7301038e+07 |  1.8975332e+07 | 9.68%
            8 |  1.6025641e+07 |  1.6556291e+07 | 3.31%
            9 |  1.3245033e+07 |  1.3717421e+07 | 3.57%
           10 |  1.1709602e+07 |  1.2315271e+07 | 5.17%
           11 |  1.0000000e+07 |  1.0989011e+07 | 9.89%
           12 |  9.0909091e+06 |  9.7276265e+06 | 7.00%
           13 |  8.3333333e+06 |  9.0090090e+06 | 8.11%
           14 |  7.6923077e+06 |  8.0645161e+06 | 4.84%
           15 |  7.0921986e+06 |  7.5187970e+06 | 6.02%
           16 |  6.6666667e+06 |  7.0921986e+06 | 6.38%
           17 |  6.2111801e+06 |  6.3694268e+06 | 2.55%
           18 |  5.7803468e+06 |  5.9523810e+06 | 2.98%
           19 |  5.2910053e+06 |  5.4347826e+06 | 2.72%
           20 |  4.7846890e+06 |  5.0505051e+06 | 5.56%
           21 |  4.5454545e+06 |  4.6728972e+06 | 2.80%
           22 |  4.2372881e+06 |  4.3859649e+06 | 3.51%
           24 |  3.7174721e+06 |  3.8759690e+06 | 4.26%
           25 |  3.4722222e+06 |  3.6231884e+06 | 4.35%
           26 |  3.2894737e+06 |  3.3898305e+06 | 3.05%
           27 |  3.0674847e+06 |  3.1847134e+06 | 3.82%
           28 |  2.9239766e+06 |  3.0120482e+06 | 3.01%
           29 |  2.7548209e+06 |  2.8901734e+06 | 4.91%
           30 |  2.6041667e+06 |  2.7322404e+06 | 4.92%
           31 |  2.5000000e+06 |  2.5773196e+06 | 3.09%
           32 |  4.6082949e+06 |  4.7846890e+06 | 3.83%
           50 |  1.7241379e+06 |  2.0703934e+06 | 20.08%
          100 | 719424.46      | 869565.22      | 20.87%
          250 | 124688.28      | 157977.88      | 26.70%
         2500 | 1455.60        | 1811.59        | 24.46%
        15000 | 40.00          | 50.00          | 25.00%
(36 rows)

Regards,
Joel



On Tue, Jul 9, 2024, at 14:01, Dean Rasheed wrote:
> One thing I noticed while testing the earlier patches on this thread
> was that they were significantly faster if they used unsigned integers
> rather than signed integers. I think the reason is that operations
> like "x / 10000" and "x % 10000" use fewer CPU instructions (on every
> platform, according to godbolt.org) if x is unsigned.
>
> In addition, this reduces the number of times the digit array needs to
> be renormalised, which seems to be the biggest factor.
>
> Another small optimisation that seems to be just about worthwhile is
> to pull the first digit of var1 out of the main loop, so that its
> contributions can be set directly in dig[], rather than being added to
> it. This allows palloc() to be used to allocate dig[], rather than
> palloc0(), and only requires the upper part of dig[] to be initialised
> to zeros, rather than all of it.
>
> Together, these seem to give a decent speed-up:
..
> Attachments:
> * optimise-mul_var.patch

I've reviewed the patch now.
Code is straightforward, and comments easy to understand.
LGTM.

Regards,
Joel