Обсуждение: Speedup usages of pg_*toa() functions

Поиск
Список
Период
Сортировка

Speedup usages of pg_*toa() functions

От
David Rowley
Дата:
Hi,

pg_itoa, pg_ltoa and pg_lltoa all have access to the length of the
string that is produced in the function by way of the "len" variable.
These functions don't have a great deal of use in core, but it seems
that most callers do require the len but end up getting it via
strlen(). It seems we could optimise this a little if we just had the
functions return the length instead of making callers do the work
themselves.

This allows us to speed up a few cases. int2vectorout() should be
faster and int8out() becomes a bit faster if we get rid of the
strdup() call and replace it with a palloc()/memcpy() call.

The slight drawback that I can see from this is that on testing
int4out() it gets slightly slower, which I assume is because I'm now
returning the length, but there's no use for it in that function.

create table bi (a bigint);
insert into bi select generate_Series(1,10000000);
vacuum freeze analyze bi;

bench.sql = copy bi to '/dev/null';

BIGINT test

drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres

Master: latency average = 1791.597 ms
Patched: latency average = 1705.322 ms (95.184%)

INT test

create table i (a int);
insert into i select generate_Series(1,10000000);
vacuum freeze analyze i;

bench.sql = copy i to '/dev/null';

drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres

Master: latency average = 1631.956 ms
Patched: latency average = 1678.626 ms (102.859%)

As you can see, this squeezes about 5% extra out of a copy of a 10
million row bigint table but costs us almost 3% on an equivalent int
table.  A likely workaround for that is moving the functions into the
header file and making them static inline. It would be nice not to
have to do that though.

These tests were done on modern AMD hardware. I've not tested yet on
anything intel based.

I've copied in Andrew as I know he only recently rewrote these
functions and Andres since he did mention this in [1].

I'm interested to know if that int4out regression exists on other hardware.

David

[1] https://www.postgresql.org/message-id/20190920051857.2fhnvhvx4qdddviz@alap3.anarazel.de

Вложения

Re: Speedup usages of pg_*toa() functions

От
Andrew Gierth
Дата:
>>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:

 David> As you can see, this squeezes about 5% extra out of a copy of a
 David> 10 million row bigint table but costs us almost 3% on an
 David> equivalent int table.

And once again I have to issue the reminder: you can have gains or
losses of several percent on microbenchmarks of this kind just by
touching unrelated pieces of code that are never used in the test. In
order to demonstrate a consistent difference, you have to do each set of
tests multiple times, with random amounts of padding added to some
unrelated part of the code.

-- 
Andrew (irc:RhodiumToad)



Re: Speedup usages of pg_*toa() functions

От
David Rowley
Дата:
On Tue, 9 Jun 2020 at 19:24, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
> >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:
>
>  David> As you can see, this squeezes about 5% extra out of a copy of a
>  David> 10 million row bigint table but costs us almost 3% on an
>  David> equivalent int table.
>
> And once again I have to issue the reminder: you can have gains or
> losses of several percent on microbenchmarks of this kind just by
> touching unrelated pieces of code that are never used in the test. In
> order to demonstrate a consistent difference, you have to do each set of
> tests multiple times, with random amounts of padding added to some
> unrelated part of the code.

Thanks for the reminder.

Instead of that, I tried with clang 10.0.0.  I was previously using gcc 9.3.

BIGINT test

Master: latency average = 1842.182 ms
Patched: latency average = 1715.418 ms

INT test

Master: latency average = 1650.583 ms
Patched: latency average = 1617.783 ms

There's nothing in the patch that makes the INT test faster, so I
guess that's noise.  The BIGINT test is about 7.3% faster in this
case.

David



Re: Speedup usages of pg_*toa() functions

От
Andrew Gierth
Дата:
>>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:

 David> This allows us to speed up a few cases. int2vectorout() should
 David> be faster and int8out() becomes a bit faster if we get rid of
 David> the strdup() call and replace it with a palloc()/memcpy() call.

What about removing the memcpy entirely? I don't think we save anything
much useful here by pallocing the exact length, rather than doing what
int4out does and palloc a fixed size and convert the int directly into
it.

i.e.

Datum
int8out(PG_FUNCTION_ARGS)
{
    int64       val = PG_GETARG_INT64(0);
    char       *result = palloc(MAXINT8LEN + 1);

    pg_lltoa(val, result);
    PG_RETURN_CSTRING(result);
}

For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at
least on my clang, that results in two copies of pg_ultoa_n inlined.
How about doing it like,

int
pg_lltoa(int64 value, char *a)
{
    int         len = 0;
    uint64      uvalue = value;

    if (value < 0)
    {
        uvalue = (uint64) 0 - uvalue;
        a[len++] = '-';
    }
    len += pg_ulltoa_n(uvalue, a + len);
    a[len] = '\0';
    return len;
}

-- 
Andrew (irc:RhodiumToad)



Re: Speedup usages of pg_*toa() functions

От
David Rowley
Дата:
On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
> >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:
>
>  David> This allows us to speed up a few cases. int2vectorout() should
>  David> be faster and int8out() becomes a bit faster if we get rid of
>  David> the strdup() call and replace it with a palloc()/memcpy() call.
>
> What about removing the memcpy entirely? I don't think we save anything
> much useful here by pallocing the exact length, rather than doing what
> int4out does and palloc a fixed size and convert the int directly into
> it.

On looking back through git blame, it seems int2out and int4out have
been that way since at least 1996, before int8.c existed. int8out has
been doing it since fa838876e9f -- Include 8-byte integer type. dated
1998.  Quite likely the larger than required palloc size back then was
more of a concern. So perhaps you're right about just doing it that
way instead. With that and the ints I tested with, the int8
performance should be about aligned to int4 performance.

> For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at
> least on my clang, that results in two copies of pg_ultoa_n inlined.
> How about doing it like,
>
> int
> pg_lltoa(int64 value, char *a)
> {
>     int         len = 0;
>     uint64      uvalue = value;
>
>     if (value < 0)
>     {
>         uvalue = (uint64) 0 - uvalue;
>         a[len++] = '-';
>     }
>     len += pg_ulltoa_n(uvalue, a + len);
>     a[len] = '\0';
>     return len;
> }

Agreed, that seems better.

David



Re: Speedup usages of pg_*toa() functions

От
Ranier Vilela
Дата:
Em ter., 9 de jun. de 2020 às 07:55, David Rowley <dgrowleyml@gmail.com> escreveu:
On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
> >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:
>
>  David> This allows us to speed up a few cases. int2vectorout() should
>  David> be faster and int8out() becomes a bit faster if we get rid of
>  David> the strdup() call and replace it with a palloc()/memcpy() call.
>
> What about removing the memcpy entirely? I don't think we save anything
> much useful here by pallocing the exact length, rather than doing what
> int4out does and palloc a fixed size and convert the int directly into
> it.

On looking back through git blame, it seems int2out and int4out have
been that way since at least 1996, before int8.c existed. int8out has
been doing it since fa838876e9f -- Include 8-byte integer type. dated
1998.  Quite likely the larger than required palloc size back then was
more of a concern. So perhaps you're right about just doing it that
way instead. With that and the ints I tested with, the int8
performance should be about aligned to int4 performance.

> For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at
> least on my clang, that results in two copies of pg_ultoa_n inlined.
> How about doing it like,
>
> int
> pg_lltoa(int64 value, char *a)
> {
>     int         len = 0;
>     uint64      uvalue = value;
>
>     if (value < 0)
>     {
>         uvalue = (uint64) 0 - uvalue;
>         a[len++] = '-';
>     }
>     len += pg_ulltoa_n(uvalue, a + len);
>     a[len] = '\0';
>     return len;
> }

Written like that, wouldn't it get better?

int
pg_lltoa(int64 value, char *a)
{
    if (value < 0)
    {
        int         len = 0;
        uint64      uvalue = (uint64) 0 - uvalue;

        a[len++] = '-';
        len += pg_ulltoa_n(uvalue, a + len);
        a[len] = '\0';
        return len;
    }
else
        return pg_ulltoa_n(value, a);
}
 
regards,
Ranier Vilela

Re: Speedup usages of pg_*toa() functions

От
Andrew Gierth
Дата:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:

 Ranier> Written like that, wouldn't it get better?

 Ranier> int
 Ranier> pg_lltoa(int64 value, char *a)
 Ranier> {
 Ranier>     if (value < 0)
 Ranier>     {
 Ranier>         int         len = 0;
 Ranier>         uint64      uvalue = (uint64) 0 - uvalue;
 Ranier>         a[len++] = '-';
 Ranier>         len += pg_ulltoa_n(uvalue, a + len);
 Ranier>         a[len] = '\0';
 Ranier>         return len;
 Ranier>     }
 Ranier>     else
 Ranier>         return pg_ulltoa_n(value, a);
 Ranier> }

No. While it doesn't matter so much for pg_lltoa since that's unlikely
to inline multiple pg_ulltoa_n calls, if you do pg_ltoa like this it (a)
ends up with two copies of pg_ultoa_n inlined into it, and (b) you don't
actually save any useful amount of time. Your version is also failing to
add the terminating '\0' for the positive case and has other obvious
bugs.

-- 
Andrew (irc:RhodiumToad)



Re: Speedup usages of pg_*toa() functions

От
Ranier Vilela
Дата:
Em ter., 9 de jun. de 2020 às 13:01, Andrew Gierth <andrew@tao11.riddles.org.uk> escreveu:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:

 Ranier> Written like that, wouldn't it get better?

 Ranier> int
 Ranier> pg_lltoa(int64 value, char *a)
 Ranier> {
 Ranier>     if (value < 0)
 Ranier>     {
 Ranier>         int         len = 0;
 Ranier>         uint64      uvalue = (uint64) 0 - uvalue;
 Ranier>         a[len++] = '-';
 Ranier>         len += pg_ulltoa_n(uvalue, a + len);
 Ranier>         a[len] = '\0';
 Ranier>         return len;
 Ranier>     }
 Ranier>     else
 Ranier>         return pg_ulltoa_n(value, a);
 Ranier> }

No. While it doesn't matter so much for pg_lltoa since that's unlikely
to inline multiple pg_ulltoa_n calls, if you do pg_ltoa like this it (a)
ends up with two copies of pg_ultoa_n inlined into it, and (b) you don't
actually save any useful amount of time. Your version is also failing to
add the terminating '\0' for the positive case and has other obvious
bugs.
(a) Sorry, I'm not asm specialist.

#include <stdio.h>

int pg_utoa(unsigned int num, char * a) {
    int len;

    len = sprintf(a, "%lu", num);

    return len;
}

int pg_toa(int num, char * a)
{
    if (num < 0) {
       int len;
       len = pg_utoa(num, a);
       a[len] = '\0';
       return len;
    }
    else
       return pg_utoa(num, a);
}
 

.LC0:
        .string "%lu"
pg_utoa(unsigned int, char*):
        mov     edxedi
        xor     eaxeax
        mov     rdirsi
        mov     esiOFFSET FLAT:.LC0
        jmp     sprintf
pg_toa(int, char*):
        push    rbp
        test    ediedi
        mov     rbprsi
        mov     edxedi
        mov     esiOFFSET FLAT:.LC0
        mov     rdirbp
        mov     eax0
        js      .L7
        pop     rbp
        jmp     sprintf
.L7:
        call    sprintf
        movsx   rdxeax
        mov     BYTE PTR [rbp+0+rdx]0
        pop     rbp
        ret

Where " ends up with two copies of pg_ultoa_n inlined into it", in this simplified example?

(b) I call this tail cut, I believe it saves time, for sure.

Regarding bugs:

(c) your version don't check size of a var, when pg_ulltoa_n
warning about "least MAXINT8LEN bytes."

So in theory, I could blow it up, by calling pg_lltoa.

(d) So I can't trust pg_ulltoa_n, when var a, is it big enough?
If not, there are more bugs.

regards,
Ranier Vilela

Re: Speedup usages of pg_*toa() functions

От
Andrew Gierth
Дата:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:

 Ranier> Where " ends up with two copies of pg_ultoa_n inlined into it",
 Ranier> in this simplified example?

The two references to sprintf are both inlined copies of your pg_utoa.

 Ranier> (b) I call this tail cut, I believe it saves time, for sure.

You seem to have missed the point that the pg_ultoa_n / pg_ulltoa_n
functions DO NOT ADD A TRAILING NUL. Which means that pg_ltoa / pg_lltoa
can't just tail call them, since they must add the NUL after.

 Ranier> Regarding bugs:

 Ranier> (c) your version don't check size of a var, when pg_ulltoa_n
 Ranier> warning about "least MAXINT8LEN bytes."

 Ranier> So in theory, I could blow it up, by calling pg_lltoa.

No. Callers of pg_lltoa are required to provide a buffer of at least
MAXINT8LEN+1 bytes.

-- 
Andrew.



Re: Speedup usages of pg_*toa() functions

От
Ranier Vilela
Дата:
Em ter., 9 de jun. de 2020 às 15:53, Andrew Gierth <andrew@tao11.riddles.org.uk> escreveu:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:

 Ranier> Where " ends up with two copies of pg_ultoa_n inlined into it",
 Ranier> in this simplified example?

The two references to sprintf are both inlined copies of your pg_utoa.

 Ranier> (b) I call this tail cut, I believe it saves time, for sure.

You seem to have missed the point that the pg_ultoa_n / pg_ulltoa_n
functions DO NOT ADD A TRAILING NUL. Which means that pg_ltoa / pg_lltoa
can't just tail call them, since they must add the NUL after.

 Ranier> Regarding bugs:

 Ranier> (c) your version don't check size of a var, when pg_ulltoa_n
 Ranier> warning about "least MAXINT8LEN bytes."

 Ranier> So in theory, I could blow it up, by calling pg_lltoa.

No. Callers of pg_lltoa are required to provide a buffer of at least
MAXINT8LEN+1 bytes.
Thanks for explanations.

So I would change, just the initialization (var uvalue), even though it is irrelevant.

int
pg_lltoa(int64 value, char *a)
{
int len = 0;
uint64 uvalue;

if (value < 0)
{
uvalue = (uint64) 0 - uvalue;
        a[len++] = '-';
}
else
uvalue = value;

len += pg_ulltoa_n(uvalue, a + len);
a[len] = '\0';

return len;
}
 
regards,
Ranier Vilela

Re: Speedup usages of pg_*toa() functions

От
Andrew Gierth
Дата:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:

 Ranier> So I would change, just the initialization (var uvalue), even though it is
 Ranier> irrelevant.

 Ranier> int
 Ranier> pg_lltoa(int64 value, char *a)
 Ranier> {
 Ranier> int len = 0;
 Ranier> uint64 uvalue;

 Ranier> if (value < 0)
 Ranier> {
 Ranier> uvalue = (uint64) 0 - uvalue;

Use of uninitialized variable.

-- 
Andrew (irc:RhodiumToad)



Re: Speedup usages of pg_*toa() functions

От
Ranier Vilela
Дата:
Em ter., 9 de jun. de 2020 às 17:42, Andrew Gierth <andrew@tao11.riddles.org.uk> escreveu:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:

 Ranier> So I would change, just the initialization (var uvalue), even though it is
 Ranier> irrelevant.

 Ranier> int
 Ranier> pg_lltoa(int64 value, char *a)
 Ranier> {
 Ranier> int len = 0;
 Ranier> uint64 uvalue;

 Ranier> if (value < 0)
 Ranier> {
 Ranier> uvalue = (uint64) 0 - uvalue;

Use of uninitialized variable.
Sorry, my mistake.

uvalue = (uint64) 0 - value;

regards,
Ranier Vilela
 

--
Andrew (irc:RhodiumToad)

Re: Speedup usages of pg_*toa() functions

От
David Rowley
Дата:
On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
> >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:
>
>  David> This allows us to speed up a few cases. int2vectorout() should
>  David> be faster and int8out() becomes a bit faster if we get rid of
>  David> the strdup() call and replace it with a palloc()/memcpy() call.
>
> What about removing the memcpy entirely? I don't think we save anything
> much useful here by pallocing the exact length, rather than doing what
> int4out does and palloc a fixed size and convert the int directly into
> it.

The attached 0001 patch does this.

create table bi (a bigint);
insert into bi select generate_Series(1,10000000);
vacuum freeze analyze bi;

query = copy bi to '/dev/null';
120 second pgbench run.

The results are:

GCC master: latency average = 1757.556 ms
GCC master+0001: latency average = 1588.793 ms (90.4%)

clang master: latency average = 1818.952 ms
clang master+0001: latency average = 1649.100 ms (90.6%)


> For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at
> least on my clang, that results in two copies of pg_ultoa_n inlined.
> How about doing it like,
>
> int
> pg_lltoa(int64 value, char *a)
> {
>     int         len = 0;
>     uint64      uvalue = value;
>
>     if (value < 0)
>     {
>         uvalue = (uint64) 0 - uvalue;
>         a[len++] = '-';
>     }
>     len += pg_ulltoa_n(uvalue, a + len);
>     a[len] = '\0';
>     return len;
> }

The 0002 patch does it this way.

David

Вложения

Re: Speedup usages of pg_*toa() functions

От
Andrew Gierth
Дата:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:

 Ranier> Sorry, my mistake.

 Ranier> uvalue = (uint64) 0 - value;

This doesn't gain anything over the original, and it has the downside of
hiding an int64 to uint64 conversion that is actually quite sensitive.
For example, it might tempt someone to rewrite it as

    uvalue = -value;

which is actually incorrect (though our -fwrapv will hide the error).

-- 
Andrew (irc:RhodiumToad)



Re: Speedup usages of pg_*toa() functions

От
David Rowley
Дата:
On Wed, 10 Jun 2020 at 11:57, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
> >
> > >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:
> >
> >  David> This allows us to speed up a few cases. int2vectorout() should
> >  David> be faster and int8out() becomes a bit faster if we get rid of
> >  David> the strdup() call and replace it with a palloc()/memcpy() call.
> >
> > What about removing the memcpy entirely? I don't think we save anything
> > much useful here by pallocing the exact length, rather than doing what
> > int4out does and palloc a fixed size and convert the int directly into
> > it.
>
> The attached 0001 patch does this.

Pending any objections, I'd like to push both of these patches in the
next few days to master.

Anyone object to changing the signature of these functions in 0002, or
have concerns about allocating the maximum memory that we might
require in int8out()?

David



Re: Speedup usages of pg_*toa() functions

От
Andrew Gierth
Дата:
>>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:

 David> Pending any objections, I'd like to push both of these patches
 David> in the next few days to master.

For the second patch, can we take the opportunity to remove the
extraneous blank line at the top of pg_ltoa, and add the two missing
"extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ?

 David> Anyone object to changing the signature of these functions in
 David> 0002, or have concerns about allocating the maximum memory that
 David> we might require in int8out()?

Changing the function signatures seems safe enough. The memory thing
only seems likely to be an issue if you allocate a lot of text strings
for bigint values without a context reset, and I'm not sure where that
would happen (maybe passing large bigint arrays to pl/perl or pl/python
would do it?)

-- 
Andrew (irc:RhodiumToad)



Re: Speedup usages of pg_*toa() functions

От
David Rowley
Дата:
On Thu, 11 Jun 2020 at 18:52, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
> For the second patch, can we take the opportunity to remove the
> extraneous blank line at the top of pg_ltoa, and add the two missing
> "extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ?

I think since we've branched for PG14 now that fixing those should be
backpatched to PG13.

David



Re: Speedup usages of pg_*toa() functions

От
David Rowley
Дата:
On Thu, 11 Jun 2020 at 18:52, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
> >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:
>
>  David> Pending any objections, I'd like to push both of these patches
>  David> in the next few days to master.
>
> For the second patch, can we take the opportunity to remove the
> extraneous blank line at the top of pg_ltoa, and add the two missing
> "extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ?
>
>  David> Anyone object to changing the signature of these functions in
>  David> 0002, or have concerns about allocating the maximum memory that
>  David> we might require in int8out()?
>
> Changing the function signatures seems safe enough. The memory thing
> only seems likely to be an issue if you allocate a lot of text strings
> for bigint values without a context reset, and I'm not sure where that
> would happen (maybe passing large bigint arrays to pl/perl or pl/python
> would do it?)

I ended up chickening out of doing the larger allocation
unconditionally. Instead, I pushed the original idea of doing the
palloc/memcpy of the length returned by pg_lltoa.  That gets us most
of the gains without the change in memory usage behaviour.

Thanks for your reviews on this.

David



Re: Speedup usages of pg_*toa() functions

От
John Naylor
Дата:
On Sat, Jun 13, 2020 at 8:36 AM David Rowley <dgrowleyml@gmail.com> wrote:
> I ended up chickening out of doing the larger allocation
> unconditionally. Instead, I pushed the original idea of doing the
> palloc/memcpy of the length returned by pg_lltoa.  That gets us most
> of the gains without the change in memory usage behaviour.

This was still marked as needing review in commitfest, so I marked it
as committed.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services