Обсуждение: Speedup usages of pg_*toa() functions
Hi, pg_itoa, pg_ltoa and pg_lltoa all have access to the length of the string that is produced in the function by way of the "len" variable. These functions don't have a great deal of use in core, but it seems that most callers do require the len but end up getting it via strlen(). It seems we could optimise this a little if we just had the functions return the length instead of making callers do the work themselves. This allows us to speed up a few cases. int2vectorout() should be faster and int8out() becomes a bit faster if we get rid of the strdup() call and replace it with a palloc()/memcpy() call. The slight drawback that I can see from this is that on testing int4out() it gets slightly slower, which I assume is because I'm now returning the length, but there's no use for it in that function. create table bi (a bigint); insert into bi select generate_Series(1,10000000); vacuum freeze analyze bi; bench.sql = copy bi to '/dev/null'; BIGINT test drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres Master: latency average = 1791.597 ms Patched: latency average = 1705.322 ms (95.184%) INT test create table i (a int); insert into i select generate_Series(1,10000000); vacuum freeze analyze i; bench.sql = copy i to '/dev/null'; drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres Master: latency average = 1631.956 ms Patched: latency average = 1678.626 ms (102.859%) As you can see, this squeezes about 5% extra out of a copy of a 10 million row bigint table but costs us almost 3% on an equivalent int table. A likely workaround for that is moving the functions into the header file and making them static inline. It would be nice not to have to do that though. These tests were done on modern AMD hardware. I've not tested yet on anything intel based. I've copied in Andrew as I know he only recently rewrote these functions and Andres since he did mention this in [1]. I'm interested to know if that int4out regression exists on other hardware. David [1] https://www.postgresql.org/message-id/20190920051857.2fhnvhvx4qdddviz@alap3.anarazel.de
Вложения
>>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: David> As you can see, this squeezes about 5% extra out of a copy of a David> 10 million row bigint table but costs us almost 3% on an David> equivalent int table. And once again I have to issue the reminder: you can have gains or losses of several percent on microbenchmarks of this kind just by touching unrelated pieces of code that are never used in the test. In order to demonstrate a consistent difference, you have to do each set of tests multiple times, with random amounts of padding added to some unrelated part of the code. -- Andrew (irc:RhodiumToad)
On Tue, 9 Jun 2020 at 19:24, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > > >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: > > David> As you can see, this squeezes about 5% extra out of a copy of a > David> 10 million row bigint table but costs us almost 3% on an > David> equivalent int table. > > And once again I have to issue the reminder: you can have gains or > losses of several percent on microbenchmarks of this kind just by > touching unrelated pieces of code that are never used in the test. In > order to demonstrate a consistent difference, you have to do each set of > tests multiple times, with random amounts of padding added to some > unrelated part of the code. Thanks for the reminder. Instead of that, I tried with clang 10.0.0. I was previously using gcc 9.3. BIGINT test Master: latency average = 1842.182 ms Patched: latency average = 1715.418 ms INT test Master: latency average = 1650.583 ms Patched: latency average = 1617.783 ms There's nothing in the patch that makes the INT test faster, so I guess that's noise. The BIGINT test is about 7.3% faster in this case. David
>>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: David> This allows us to speed up a few cases. int2vectorout() should David> be faster and int8out() becomes a bit faster if we get rid of David> the strdup() call and replace it with a palloc()/memcpy() call. What about removing the memcpy entirely? I don't think we save anything much useful here by pallocing the exact length, rather than doing what int4out does and palloc a fixed size and convert the int directly into it. i.e. Datum int8out(PG_FUNCTION_ARGS) { int64 val = PG_GETARG_INT64(0); char *result = palloc(MAXINT8LEN + 1); pg_lltoa(val, result); PG_RETURN_CSTRING(result); } For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at least on my clang, that results in two copies of pg_ultoa_n inlined. How about doing it like, int pg_lltoa(int64 value, char *a) { int len = 0; uint64 uvalue = value; if (value < 0) { uvalue = (uint64) 0 - uvalue; a[len++] = '-'; } len += pg_ulltoa_n(uvalue, a + len); a[len] = '\0'; return len; } -- Andrew (irc:RhodiumToad)
On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > > >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: > > David> This allows us to speed up a few cases. int2vectorout() should > David> be faster and int8out() becomes a bit faster if we get rid of > David> the strdup() call and replace it with a palloc()/memcpy() call. > > What about removing the memcpy entirely? I don't think we save anything > much useful here by pallocing the exact length, rather than doing what > int4out does and palloc a fixed size and convert the int directly into > it. On looking back through git blame, it seems int2out and int4out have been that way since at least 1996, before int8.c existed. int8out has been doing it since fa838876e9f -- Include 8-byte integer type. dated 1998. Quite likely the larger than required palloc size back then was more of a concern. So perhaps you're right about just doing it that way instead. With that and the ints I tested with, the int8 performance should be about aligned to int4 performance. > For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at > least on my clang, that results in two copies of pg_ultoa_n inlined. > How about doing it like, > > int > pg_lltoa(int64 value, char *a) > { > int len = 0; > uint64 uvalue = value; > > if (value < 0) > { > uvalue = (uint64) 0 - uvalue; > a[len++] = '-'; > } > len += pg_ulltoa_n(uvalue, a + len); > a[len] = '\0'; > return len; > } Agreed, that seems better. David
Em ter., 9 de jun. de 2020 às 07:55, David Rowley <dgrowleyml@gmail.com> escreveu:
On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
> >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes:
>
> David> This allows us to speed up a few cases. int2vectorout() should
> David> be faster and int8out() becomes a bit faster if we get rid of
> David> the strdup() call and replace it with a palloc()/memcpy() call.
>
> What about removing the memcpy entirely? I don't think we save anything
> much useful here by pallocing the exact length, rather than doing what
> int4out does and palloc a fixed size and convert the int directly into
> it.
On looking back through git blame, it seems int2out and int4out have
been that way since at least 1996, before int8.c existed. int8out has
been doing it since fa838876e9f -- Include 8-byte integer type. dated
1998. Quite likely the larger than required palloc size back then was
more of a concern. So perhaps you're right about just doing it that
way instead. With that and the ints I tested with, the int8
performance should be about aligned to int4 performance.
> For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at
> least on my clang, that results in two copies of pg_ultoa_n inlined.
> How about doing it like,
>
> int
> pg_lltoa(int64 value, char *a)
> {
> int len = 0;
> uint64 uvalue = value;
>
> if (value < 0)
> {
> uvalue = (uint64) 0 - uvalue;
> a[len++] = '-';
> }
> len += pg_ulltoa_n(uvalue, a + len);
> a[len] = '\0';
> return len;
> }
Written like that, wouldn't it get better?
int
pg_lltoa(int64 value, char *a)
{
if (value < 0)
{
int len = 0;
uint64 uvalue = (uint64) 0 - uvalue;
a[len++] = '-';
len += pg_ulltoa_n(uvalue, a + len);
a[len] = '\0';
return len;
}
else
return pg_ulltoa_n(value, a);
}
pg_lltoa(int64 value, char *a)
{
if (value < 0)
{
int len = 0;
uint64 uvalue = (uint64) 0 - uvalue;
a[len++] = '-';
len += pg_ulltoa_n(uvalue, a + len);
a[len] = '\0';
return len;
}
else
return pg_ulltoa_n(value, a);
}
regards,
Ranier Vilela
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes: Ranier> Written like that, wouldn't it get better? Ranier> int Ranier> pg_lltoa(int64 value, char *a) Ranier> { Ranier> if (value < 0) Ranier> { Ranier> int len = 0; Ranier> uint64 uvalue = (uint64) 0 - uvalue; Ranier> a[len++] = '-'; Ranier> len += pg_ulltoa_n(uvalue, a + len); Ranier> a[len] = '\0'; Ranier> return len; Ranier> } Ranier> else Ranier> return pg_ulltoa_n(value, a); Ranier> } No. While it doesn't matter so much for pg_lltoa since that's unlikely to inline multiple pg_ulltoa_n calls, if you do pg_ltoa like this it (a) ends up with two copies of pg_ultoa_n inlined into it, and (b) you don't actually save any useful amount of time. Your version is also failing to add the terminating '\0' for the positive case and has other obvious bugs. -- Andrew (irc:RhodiumToad)
Em ter., 9 de jun. de 2020 às 13:01, Andrew Gierth <andrew@tao11.riddles.org.uk> escreveu:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:
Ranier> Written like that, wouldn't it get better?
Ranier> int
Ranier> pg_lltoa(int64 value, char *a)
Ranier> {
Ranier> if (value < 0)
Ranier> {
Ranier> int len = 0;
Ranier> uint64 uvalue = (uint64) 0 - uvalue;
Ranier> a[len++] = '-';
Ranier> len += pg_ulltoa_n(uvalue, a + len);
Ranier> a[len] = '\0';
Ranier> return len;
Ranier> }
Ranier> else
Ranier> return pg_ulltoa_n(value, a);
Ranier> }
No. While it doesn't matter so much for pg_lltoa since that's unlikely
to inline multiple pg_ulltoa_n calls, if you do pg_ltoa like this it (a)
ends up with two copies of pg_ultoa_n inlined into it, and (b) you don't
actually save any useful amount of time. Your version is also failing to
add the terminating '\0' for the positive case and has other obvious
bugs.
(a) Sorry, I'm not asm specialist.
#include <stdio.h>
int pg_utoa(unsigned int num, char * a) {
int len;
len = sprintf(a, "%lu", num);
return len;
}
int pg_toa(int num, char * a)
{
if (num < 0) {
int len;
len = pg_utoa(num, a);
a[len] = '\0';
return len;
}
else
return pg_utoa(num, a);
}
.LC0:
.string "%lu"
pg_utoa(unsigned int, char*):
mov edx, edi
xor eax, eax
mov rdi, rsi
mov esi, OFFSET FLAT:.LC0
jmp sprintf
pg_toa(int, char*):
push rbp
test edi, edi
mov rbp, rsi
mov edx, edi
mov esi, OFFSET FLAT:.LC0
mov rdi, rbp
mov eax, 0
js .L7
pop rbp
jmp sprintf
.L7:
call sprintf
movsx rdx, eax
mov BYTE PTR [rbp+0+rdx], 0
pop rbp
ret
Where " ends up with two copies of pg_ultoa_n inlined into it", in this simplified example?
(b) I call this tail cut, I believe it saves time, for sure.
Regarding bugs:
(c) your version don't check size of a var, when pg_ulltoa_n
warning about "least MAXINT8LEN bytes."
So in theory, I could blow it up, by calling pg_lltoa.
(d) So I can't trust pg_ulltoa_n, when var a, is it big enough?
If not, there are more bugs.
regards,
Ranier Vilela
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes: Ranier> Where " ends up with two copies of pg_ultoa_n inlined into it", Ranier> in this simplified example? The two references to sprintf are both inlined copies of your pg_utoa. Ranier> (b) I call this tail cut, I believe it saves time, for sure. You seem to have missed the point that the pg_ultoa_n / pg_ulltoa_n functions DO NOT ADD A TRAILING NUL. Which means that pg_ltoa / pg_lltoa can't just tail call them, since they must add the NUL after. Ranier> Regarding bugs: Ranier> (c) your version don't check size of a var, when pg_ulltoa_n Ranier> warning about "least MAXINT8LEN bytes." Ranier> So in theory, I could blow it up, by calling pg_lltoa. No. Callers of pg_lltoa are required to provide a buffer of at least MAXINT8LEN+1 bytes. -- Andrew.
Em ter., 9 de jun. de 2020 às 15:53, Andrew Gierth <andrew@tao11.riddles.org.uk> escreveu:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:
Ranier> Where " ends up with two copies of pg_ultoa_n inlined into it",
Ranier> in this simplified example?
The two references to sprintf are both inlined copies of your pg_utoa.
Ranier> (b) I call this tail cut, I believe it saves time, for sure.
You seem to have missed the point that the pg_ultoa_n / pg_ulltoa_n
functions DO NOT ADD A TRAILING NUL. Which means that pg_ltoa / pg_lltoa
can't just tail call them, since they must add the NUL after.
Ranier> Regarding bugs:
Ranier> (c) your version don't check size of a var, when pg_ulltoa_n
Ranier> warning about "least MAXINT8LEN bytes."
Ranier> So in theory, I could blow it up, by calling pg_lltoa.
No. Callers of pg_lltoa are required to provide a buffer of at least
MAXINT8LEN+1 bytes.
Thanks for explanations.
So I would change, just the initialization (var uvalue), even though it is irrelevant.
int
pg_lltoa(int64 value, char *a)
{
int len = 0;
uint64 uvalue;
if (value < 0)
{
uvalue = (uint64) 0 - uvalue;
a[len++] = '-';
}
else
uvalue = value;
len += pg_ulltoa_n(uvalue, a + len);
a[len] = '\0';
return len;
}
pg_lltoa(int64 value, char *a)
{
int len = 0;
uint64 uvalue;
if (value < 0)
{
uvalue = (uint64) 0 - uvalue;
a[len++] = '-';
}
else
uvalue = value;
len += pg_ulltoa_n(uvalue, a + len);
a[len] = '\0';
return len;
}
regards,
Ranier Vilela
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes: Ranier> So I would change, just the initialization (var uvalue), even though it is Ranier> irrelevant. Ranier> int Ranier> pg_lltoa(int64 value, char *a) Ranier> { Ranier> int len = 0; Ranier> uint64 uvalue; Ranier> if (value < 0) Ranier> { Ranier> uvalue = (uint64) 0 - uvalue; Use of uninitialized variable. -- Andrew (irc:RhodiumToad)
Em ter., 9 de jun. de 2020 às 17:42, Andrew Gierth <andrew@tao11.riddles.org.uk> escreveu:
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes:
Ranier> So I would change, just the initialization (var uvalue), even though it is
Ranier> irrelevant.
Ranier> int
Ranier> pg_lltoa(int64 value, char *a)
Ranier> {
Ranier> int len = 0;
Ranier> uint64 uvalue;
Ranier> if (value < 0)
Ranier> {
Ranier> uvalue = (uint64) 0 - uvalue;
Use of uninitialized variable.
Sorry, my mistake.
uvalue = (uint64) 0 - value;
regards,
Ranier Vilela
--
Andrew (irc:RhodiumToad)
On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > > >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: > > David> This allows us to speed up a few cases. int2vectorout() should > David> be faster and int8out() becomes a bit faster if we get rid of > David> the strdup() call and replace it with a palloc()/memcpy() call. > > What about removing the memcpy entirely? I don't think we save anything > much useful here by pallocing the exact length, rather than doing what > int4out does and palloc a fixed size and convert the int directly into > it. The attached 0001 patch does this. create table bi (a bigint); insert into bi select generate_Series(1,10000000); vacuum freeze analyze bi; query = copy bi to '/dev/null'; 120 second pgbench run. The results are: GCC master: latency average = 1757.556 ms GCC master+0001: latency average = 1588.793 ms (90.4%) clang master: latency average = 1818.952 ms clang master+0001: latency average = 1649.100 ms (90.6%) > For pg_ltoa, etc., I don't like adding the extra call to pg_ultoa_n - at > least on my clang, that results in two copies of pg_ultoa_n inlined. > How about doing it like, > > int > pg_lltoa(int64 value, char *a) > { > int len = 0; > uint64 uvalue = value; > > if (value < 0) > { > uvalue = (uint64) 0 - uvalue; > a[len++] = '-'; > } > len += pg_ulltoa_n(uvalue, a + len); > a[len] = '\0'; > return len; > } The 0002 patch does it this way. David
Вложения
>>>>> "Ranier" == Ranier Vilela <ranier.vf@gmail.com> writes: Ranier> Sorry, my mistake. Ranier> uvalue = (uint64) 0 - value; This doesn't gain anything over the original, and it has the downside of hiding an int64 to uint64 conversion that is actually quite sensitive. For example, it might tempt someone to rewrite it as uvalue = -value; which is actually incorrect (though our -fwrapv will hide the error). -- Andrew (irc:RhodiumToad)
On Wed, 10 Jun 2020 at 11:57, David Rowley <dgrowleyml@gmail.com> wrote: > > On Tue, 9 Jun 2020 at 22:08, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > > > > >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: > > > > David> This allows us to speed up a few cases. int2vectorout() should > > David> be faster and int8out() becomes a bit faster if we get rid of > > David> the strdup() call and replace it with a palloc()/memcpy() call. > > > > What about removing the memcpy entirely? I don't think we save anything > > much useful here by pallocing the exact length, rather than doing what > > int4out does and palloc a fixed size and convert the int directly into > > it. > > The attached 0001 patch does this. Pending any objections, I'd like to push both of these patches in the next few days to master. Anyone object to changing the signature of these functions in 0002, or have concerns about allocating the maximum memory that we might require in int8out()? David
>>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: David> Pending any objections, I'd like to push both of these patches David> in the next few days to master. For the second patch, can we take the opportunity to remove the extraneous blank line at the top of pg_ltoa, and add the two missing "extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ? David> Anyone object to changing the signature of these functions in David> 0002, or have concerns about allocating the maximum memory that David> we might require in int8out()? Changing the function signatures seems safe enough. The memory thing only seems likely to be an issue if you allocate a lot of text strings for bigint values without a context reset, and I'm not sure where that would happen (maybe passing large bigint arrays to pl/perl or pl/python would do it?) -- Andrew (irc:RhodiumToad)
On Thu, 11 Jun 2020 at 18:52, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > For the second patch, can we take the opportunity to remove the > extraneous blank line at the top of pg_ltoa, and add the two missing > "extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ? I think since we've branched for PG14 now that fixing those should be backpatched to PG13. David
On Thu, 11 Jun 2020 at 18:52, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > > >>>>> "David" == David Rowley <dgrowleyml@gmail.com> writes: > > David> Pending any objections, I'd like to push both of these patches > David> in the next few days to master. > > For the second patch, can we take the opportunity to remove the > extraneous blank line at the top of pg_ltoa, and add the two missing > "extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ? > > David> Anyone object to changing the signature of these functions in > David> 0002, or have concerns about allocating the maximum memory that > David> we might require in int8out()? > > Changing the function signatures seems safe enough. The memory thing > only seems likely to be an issue if you allocate a lot of text strings > for bigint values without a context reset, and I'm not sure where that > would happen (maybe passing large bigint arrays to pl/perl or pl/python > would do it?) I ended up chickening out of doing the larger allocation unconditionally. Instead, I pushed the original idea of doing the palloc/memcpy of the length returned by pg_lltoa. That gets us most of the gains without the change in memory usage behaviour. Thanks for your reviews on this. David
On Sat, Jun 13, 2020 at 8:36 AM David Rowley <dgrowleyml@gmail.com> wrote: > I ended up chickening out of doing the larger allocation > unconditionally. Instead, I pushed the original idea of doing the > palloc/memcpy of the length returned by pg_lltoa. That gets us most > of the gains without the change in memory usage behaviour. This was still marked as needing review in commitfest, so I marked it as committed. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services