Обсуждение: Why is pq_begintypsend so slow?

Поиск
Список
Период
Сортировка

Why is pq_begintypsend so slow?

От
Jeff Janes
Дата:
I was using COPY recently and was wondering why BINARY format is not much (if any) faster than the default format.  Once I switched from mostly exporting ints to mostly exporting double precisions (7e6 rows of 100 columns, randomly generated), it was faster, but not by as much as I intuitively thought it should be.  

Running 'perf top' to profile a "COPY BINARY .. TO '/dev/null'" on a AWS m5.large machine running Ubuntu 18.04, with self compiled PostgreSQL:

PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit

I saw that the hotspot was pq_begintypsend at 20%, which was twice the percentage as the next place winner (AllocSetAlloc).  If I drill down into teh function, I see something like the below.  I don't really speak assembly, but usually when I see an assembly instruction being especially hot and not being the inner most instruction in a loop, I blame it on CPU cache misses.  But everything being touched here should already be well cached, since  initStringInfo has just got done setting it up. And if not for that, then the by the 2nd invocation of appendStringInfoCharMacro it certainly should be in the cache, yet that one is even slower than the 1st appendStringInfoCharMacro.

Why is this such a bottleneck?

pq_begintypsend  /usr/local/pgsql/bin/postgres

 0.15 |    push  %rbx
 0.09 |    mov  %rdi,%rbx
   |       initStringInfo(buf);
 3.03 |    callq initStringInfo
   |       /* Reserve four bytes for the bytea length word */
   |       appendStringInfoCharMacro(buf, '\0');
   |    movslq 0x8(%rbx),%rax
 1.05 |    lea  0x1(%rax),%edx
 0.72 |    cmp  0xc(%rbx),%edx
   |    jge  b0
 2.92 |    mov  (%rbx),%rdx
   |    movb  $0x0,(%rdx,%rax,1)
13.76 |    mov  0x8(%rbx),%eax
 0.81 |    mov  (%rbx),%rdx
 0.52 |    add  $0x1,%eax
 0.12 |    mov  %eax,0x8(%rbx)
 2.85 |    cltq
 0.01 |    movb  $0x0,(%rdx,%rax,1)
   |       appendStringInfoCharMacro(buf, '\0');
10.65 |    movslq 0x8(%rbx),%rax
   |    lea  0x1(%rax),%edx
 0.90 |    cmp  0xc(%rbx),%edx
   |    jge  ca
 0.54 | 42:  mov  (%rbx),%rdx
 1.84 |    movb  $0x0,(%rdx,%rax,1)
13.88 |    mov  0x8(%rbx),%eax
 0.03 |    mov  (%rbx),%rdx
   |    add  $0x1,%eax
 0.33 |    mov  %eax,0x8(%rbx)
 2.60 |    cltq
 0.06 |    movb  $0x0,(%rdx,%rax,1)
   |       appendStringInfoCharMacro(buf, '\0');
 3.21 |    movslq 0x8(%rbx),%rax
 0.23 |    lea  0x1(%rax),%edx
 1.74 |    cmp  0xc(%rbx),%edx
   |    jge  e0
 0.21 | 67:  mov  (%rbx),%rdx
 1.18 |    movb  $0x0,(%rdx,%rax,1)
 9.29 |    mov  0x8(%rbx),%eax
 0.18 |    mov  (%rbx),%rdx
   |    add  $0x1,%eax
 0.19 |    mov  %eax,0x8(%rbx)
 3.14 |    cltq
 0.12 |    movb  $0x0,(%rdx,%rax,1)
   |       appendStringInfoCharMacro(buf, '\0');
 5.29 |    movslq 0x8(%rbx),%rax
 0.03 |    lea  0x1(%rax),%edx
 1.45 |    cmp  0xc(%rbx),%edx
   |    jge  f6
 0.41 | 8c:  mov  (%rbx),%rdx

Cheers,

Jeff

Re: Why is pq_begintypsend so slow?

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> I saw that the hotspot was pq_begintypsend at 20%, which was twice the
> percentage as the next place winner (AllocSetAlloc).

Weird.

> Why is this such a bottleneck?

Not sure, but it seems like a pretty dumb way to push the stringinfo's
len forward.  We're reading/updating the len word in each line, and
if your perf measurements are to be believed, it's the re-fetches of
the len values that are bottlenecked --- maybe your CPU isn't too
bright about that?  The bytes of the string value are getting written
twice too, thanks to uselessly setting up a terminating nul each time.

I'd be inclined to replace the appendStringInfoCharMacro calls with
appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what
is inserted into those bytes at this point.  And maybe
appendStringInfoSpaces could stand some micro-optimization, too.
Use a memset and a single len adjustment, perhaps?

            regards, tom lane



Re: Why is pq_begintypsend so slow?

От
David Fetter
Дата:
On Sat, Jan 11, 2020 at 03:19:37PM -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > I saw that the hotspot was pq_begintypsend at 20%, which was twice the
> > percentage as the next place winner (AllocSetAlloc).
> 
> Weird.
> 
> > Why is this such a bottleneck?
> 
> Not sure, but it seems like a pretty dumb way to push the stringinfo's
> len forward.  We're reading/updating the len word in each line, and
> if your perf measurements are to be believed, it's the re-fetches of
> the len values that are bottlenecked --- maybe your CPU isn't too
> bright about that?  The bytes of the string value are getting written
> twice too, thanks to uselessly setting up a terminating nul each time.
> 
> I'd be inclined to replace the appendStringInfoCharMacro calls with
> appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what
> is inserted into those bytes at this point.  And maybe
> appendStringInfoSpaces could stand some micro-optimization, too.
> Use a memset and a single len adjustment, perhaps?

Please find attached a patch that does it both of the things you
suggested.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Вложения

Re: Why is pq_begintypsend so slow?

От
Tom Lane
Дата:
David Fetter <david@fetter.org> writes:
> On Sat, Jan 11, 2020 at 03:19:37PM -0500, Tom Lane wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> I saw that the hotspot was pq_begintypsend at 20%, which was twice the
>>> percentage as the next place winner (AllocSetAlloc).

>> I'd be inclined to replace the appendStringInfoCharMacro calls with
>> appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what
>> is inserted into those bytes at this point.  And maybe
>> appendStringInfoSpaces could stand some micro-optimization, too.
>> Use a memset and a single len adjustment, perhaps?

> Please find attached a patch that does it both of the things you
> suggested.

I've been fooling around with this here.  On the test case Jeff
describes --- COPY BINARY tab TO '/dev/null' where tab contains
100 float8 columns filled from random() --- I can reproduce his
results.  pq_begintypsend is the top hotspot and if perf's
localization is accurate, it's the instructions that fetch
str->len that hurt the most.  Still not very clear why that is...

Converting pq_begintypsend to use appendStringInfoSpaces helps
a bit; it takes my test case from 91725 ms to 88847 ms, or about
3% savings.  Noodling around with appendStringInfoSpaces doesn't
help noticeably; I tried memset, as well as open-coding (cf
patch below) but the results were all well within the noise
threshold.

I saw at this point that the remaining top spots were
enlargeStringInfo and appendBinaryStringInfo, so I experimented
with inlining them (again, see patch below).  That *did* move
the needle: down to 72691 ms, or 20% better than HEAD.  Of
course, that comes at a code-size cost, but it's only about
13kB growth:

before:
$ size src/backend/postgres 
   text    data     bss     dec     hex filename
7485285   58088  203328 7746701  76348d src/backend/postgres
after:
$ size src/backend/postgres 
   text    data     bss     dec     hex filename
7498652   58088  203328 7760068  7668c4 src/backend/postgres

That's under two-tenths of a percent.  (This'd affect frontend
executables too, and I didn't check them.)

Seems like this is worth pursuing, especially if it can be
shown to improve any other cases noticeably.  It might be
worth inlining some of the other trivial stringinfo functions,
though I'd tread carefully on that.

            regards, tom lane

diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c
index a6f990c..0fc8c3f 100644
--- a/src/backend/libpq/pqformat.c
+++ b/src/backend/libpq/pqformat.c
@@ -328,11 +328,8 @@ void
 pq_begintypsend(StringInfo buf)
 {
     initStringInfo(buf);
-    /* Reserve four bytes for the bytea length word */
-    appendStringInfoCharMacro(buf, '\0');
-    appendStringInfoCharMacro(buf, '\0');
-    appendStringInfoCharMacro(buf, '\0');
-    appendStringInfoCharMacro(buf, '\0');
+    /* Reserve four bytes for the bytea length word; contents not important */
+    appendStringInfoSpaces(buf, 4);
 }

 /* --------------------------------
diff --git a/src/common/stringinfo.c b/src/common/stringinfo.c
index 0badc46..8f8bb0d 100644
--- a/src/common/stringinfo.c
+++ b/src/common/stringinfo.c
@@ -207,43 +207,21 @@ appendStringInfoSpaces(StringInfo str, int count)
 {
     if (count > 0)
     {
+        char       *ptr;
+
         /* Make more room if needed */
         enlargeStringInfo(str, count);

         /* OK, append the spaces */
+        ptr = str->data + str->len;
+        str->len += count;
         while (--count >= 0)
-            str->data[str->len++] = ' ';
-        str->data[str->len] = '\0';
+            *ptr++ = ' ';
+        *ptr = '\0';
     }
 }

 /*
- * appendBinaryStringInfo
- *
- * Append arbitrary binary data to a StringInfo, allocating more space
- * if necessary. Ensures that a trailing null byte is present.
- */
-void
-appendBinaryStringInfo(StringInfo str, const char *data, int datalen)
-{
-    Assert(str != NULL);
-
-    /* Make more room if needed */
-    enlargeStringInfo(str, datalen);
-
-    /* OK, append the data */
-    memcpy(str->data + str->len, data, datalen);
-    str->len += datalen;
-
-    /*
-     * Keep a trailing null in place, even though it's probably useless for
-     * binary data.  (Some callers are dealing with text but call this because
-     * their input isn't null-terminated.)
-     */
-    str->data[str->len] = '\0';
-}
-
-/*
  * appendBinaryStringInfoNT
  *
  * Append arbitrary binary data to a StringInfo, allocating more space
@@ -263,7 +241,7 @@ appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen)
 }

 /*
- * enlargeStringInfo
+ * enlargeStringInfo_OOL
  *
  * Make sure there is enough space for 'needed' more bytes
  * ('needed' does not include the terminating null).
@@ -274,13 +252,16 @@ appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen)
  * can save some palloc overhead by enlarging the buffer before starting
  * to store data in it.
  *
+ * We normally reach here only if enlargement is needed, since callers
+ * go through the inline function which doesn't call this otherwise.
+ *
  * NB: In the backend, because we use repalloc() to enlarge the buffer, the
  * string buffer will remain allocated in the same memory context that was
  * current when initStringInfo was called, even if another context is now
  * current.  This is the desired and indeed critical behavior!
  */
 void
-enlargeStringInfo(StringInfo str, int needed)
+enlargeStringInfo_OOL(StringInfo str, int needed)
 {
     int            newlen;

diff --git a/src/include/lib/stringinfo.h b/src/include/lib/stringinfo.h
index 5a2a3db..30ba826 100644
--- a/src/include/lib/stringinfo.h
+++ b/src/include/lib/stringinfo.h
@@ -87,6 +87,25 @@ extern void initStringInfo(StringInfo str);
 extern void resetStringInfo(StringInfo str);

 /*------------------------
+ * enlargeStringInfo
+ * Make sure a StringInfo's buffer can hold at least 'needed' more bytes
+ * ('needed' does not include the terminating null).
+ * The inlined code eliminates the common case where no work is needed.
+ */
+extern void enlargeStringInfo_OOL(StringInfo str, int needed);
+
+static inline void
+enlargeStringInfo(StringInfo str, int needed)
+{
+    /*
+     * Do the test in unsigned arithmetic so that if "needed" is negative,
+     * we'll go to the out-of-line code which will error out.
+     */
+    if ((unsigned) needed >= (unsigned) (str->maxlen - str->len))
+        enlargeStringInfo_OOL(str, needed);
+}
+
+/*------------------------
  * appendStringInfo
  * Format text data under the control of fmt (an sprintf-style format string)
  * and append it to whatever is already in str.  More space is allocated
@@ -139,10 +158,27 @@ extern void appendStringInfoSpaces(StringInfo str, int count);
 /*------------------------
  * appendBinaryStringInfo
  * Append arbitrary binary data to a StringInfo, allocating more space
- * if necessary.
+ * if necessary.  Ensures that a trailing null byte is present.
  */
-extern void appendBinaryStringInfo(StringInfo str,
-                                   const char *data, int datalen);
+static inline void
+appendBinaryStringInfo(StringInfo str, const char *data, int datalen)
+{
+    Assert(str != NULL);
+
+    /* Make more room if needed */
+    enlargeStringInfo(str, datalen);
+
+    /* OK, append the data */
+    memcpy(str->data + str->len, data, datalen);
+    str->len += datalen;
+
+    /*
+     * Keep a trailing null in place, even though it's probably useless for
+     * binary data.  (Some callers are dealing with text but call this because
+     * their input isn't null-terminated.)
+     */
+    str->data[str->len] = '\0';
+}

 /*------------------------
  * appendBinaryStringInfoNT
@@ -152,10 +188,4 @@ extern void appendBinaryStringInfo(StringInfo str,
 extern void appendBinaryStringInfoNT(StringInfo str,
                                      const char *data, int datalen);

-/*------------------------
- * enlargeStringInfo
- * Make sure a StringInfo's buffer can hold at least 'needed' more bytes.
- */
-extern void enlargeStringInfo(StringInfo str, int needed);
-
 #endif                            /* STRINGINFO_H */

Re: Why is pq_begintypsend so slow?

От
Tom Lane
Дата:
I wrote:
> I saw at this point that the remaining top spots were
> enlargeStringInfo and appendBinaryStringInfo, so I experimented
> with inlining them (again, see patch below).  That *did* move
> the needle: down to 72691 ms, or 20% better than HEAD.

Oh ... marking the test in the inline part of enlargeStringInfo()
as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
Might be over-optimizing for this particular case, perhaps, but
I think that's a reasonable marking given that we overallocate
the stringinfo buffer for most uses.

(But ... I'm not finding these numbers to be super reproducible
across different ASLR layouts.  So take it with a grain of salt.)

            regards, tom lane



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2020-01-11 22:32:45 -0500, Tom Lane wrote:
> I wrote:
> > I saw at this point that the remaining top spots were
> > enlargeStringInfo and appendBinaryStringInfo, so I experimented
> > with inlining them (again, see patch below).  That *did* move
> > the needle: down to 72691 ms, or 20% better than HEAD.
>
> Oh ... marking the test in the inline part of enlargeStringInfo()
> as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
> Might be over-optimizing for this particular case, perhaps
>
> (But ... I'm not finding these numbers to be super reproducible
> across different ASLR layouts.  So take it with a grain of salt.)

FWIW, I've also observed, in another thread (the node func generation
thing [1]), that inlining enlargeStringInfo() helps a lot, especially
when inlining some of its callers. Moving e.g. appendStringInfo() inline
allows the compiler to sometimes optimize away the strlen. But if
e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
unconditionally, successive appends cannot optimize away memory accesses
for ->len/->data.

For the case of send functions, we really ought to have at least
pq_begintypsend(), enlargeStringInfo() and pq_endtypsend() inline. That
way the compiler ought to be able to avoid repeatedly loading/storing
->len, after the initial initStringInfo() call. Might even make sense to
also have initStringInfo() inline, because then the compiler would
probably never actually materialize the StringInfoData (and would
automatically have good aliasing information too).


The commit referenced above is obviously quite WIP-ey, and contains
things that should be split into separate commits. But I think it might
be worth moving more functions into the header, like I've done in that
commit.

The commit also adds appendStringInfoU?Int(32,64) operations - I've
unsuprisingly found these to be *considerably* faster than going through
appendStringInfoString().


> but I think that's a reasonable marking given that we overallocate
> the stringinfo buffer for most uses.

Wonder if it's worth providing a function to initialize the stringinfo
differently for the many cases where we have at least a very good idea
of how long the string will be. It's sad to allocate 1kb just for
e.g. int4send to send an integer plus length.

Greetings,

Andres Freund

[1]
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=127e860cf65f50434e0bb97acbba4b0ea6f38cfd



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2020-01-13 15:18:04 -0800, Andres Freund wrote:
> On 2020-01-11 22:32:45 -0500, Tom Lane wrote:
> > I wrote:
> > > I saw at this point that the remaining top spots were
> > > enlargeStringInfo and appendBinaryStringInfo, so I experimented
> > > with inlining them (again, see patch below).  That *did* move
> > > the needle: down to 72691 ms, or 20% better than HEAD.
> >
> > Oh ... marking the test in the inline part of enlargeStringInfo()
> > as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
> > Might be over-optimizing for this particular case, perhaps
> >
> > (But ... I'm not finding these numbers to be super reproducible
> > across different ASLR layouts.  So take it with a grain of salt.)
>
> FWIW, I've also observed, in another thread (the node func generation
> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
> when inlining some of its callers. Moving e.g. appendStringInfo() inline
> allows the compiler to sometimes optimize away the strlen. But if
> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
> unconditionally, successive appends cannot optimize away memory accesses
> for ->len/->data.

With a set of patches doing so, int4send itself is not a significant
factor for my test benchmark [1] anymore. The assembly looks about as
good as one could hope, I think:

# save rbx on the stack
   0x00000000004b7f90 <+0>:    push   %rbx
   0x00000000004b7f91 <+1>:    sub    $0x20,%rsp
# store integer to be sent into rbx
   0x00000000004b7f95 <+5>:    mov    0x20(%rdi),%rbx
# palloc length argument
   0x00000000004b7f99 <+9>:    mov    $0x9,%edi
   0x00000000004b7f9e <+14>:    callq  0x5d9aa0 <palloc>
# store integer in buffer (ebx is 4 byte portion of rbx)
   0x00000000004b7fa3 <+19>:    movbe  %ebx,0x4(%rax)
# store varlena header
   0x00000000004b7fa8 <+24>:    movl   $0x20,(%rax)
# restore stack and rbx registerz
   0x00000000004b7fae <+30>:    add    $0x20,%rsp
   0x00000000004b7fb2 <+34>:    pop    %rbx
   0x00000000004b7fb3 <+35>:    retq

All the $0x20 constants are a bit confusing, but they just happen to be
the same for int4send. It's the size of the stack frame,
offset for FunctionCallInfoBaseData->args[0], the varlena header (and then the stack
frame again) respectively.

Note that I had to annotate palloc with __attribute__((malloc)) to make
the compiler understand that palloc's returned value will not alias with
anything problematic (e.g. the potential of aliasing with fcinfo
prevents optimizing to the above without that annotation).  I think such
annotations would be a good idea anyway, precisely because they allow
the compiler to optimize code significantly better.


These together yields about a 1.8x speedup for me. The profile shows
that the overhead now is overwhelmingly elsewhere:
+   26.30%  postgres  postgres          [.] CopyOneRowTo
+   13.40%  postgres  postgres          [.] tts_buffer_heap_getsomeattrs
+   10.61%  postgres  postgres          [.] AllocSetAlloc
+    9.26%  postgres  libc-2.29.so      [.] __memmove_avx_unaligned_erms
+    7.32%  postgres  postgres          [.] SendFunctionCall
+    6.02%  postgres  postgres          [.] palloc
+    4.45%  postgres  postgres          [.] int4send
+    3.68%  postgres  libc-2.29.so      [.] _IO_fwrite
+    2.71%  postgres  postgres          [.] heapgettup_pagemode
+    1.96%  postgres  postgres          [.] AllocSetReset
+    1.83%  postgres  postgres          [.] CopySendEndOfRow
+    1.75%  postgres  libc-2.29.so      [.] _IO_file_xsputn@@GLIBC_2.2.5
+    1.60%  postgres  postgres          [.] ExecStoreBufferHeapTuple
+    1.57%  postgres  postgres          [.] DoCopyTo
+    1.16%  postgres  postgres          [.] memcpy@plt
+    1.07%  postgres  postgres          [.] heapgetpage

Even without using the new pq_begintypesend_ex()/initStringInfoEx(), the
generated code is still considerably better than before, yielding a
1.58x speedup. Tallocator overhead unsurprisingly is higher:
+   24.93%  postgres  postgres          [.] CopyOneRowTo
+   17.10%  postgres  postgres          [.] AllocSetAlloc
+   10.09%  postgres  postgres          [.] tts_buffer_heap_getsomeattrs
+    6.50%  postgres  libc-2.29.so      [.] __memmove_avx_unaligned_erms
+    5.99%  postgres  postgres          [.] SendFunctionCall
+    5.11%  postgres  postgres          [.] palloc
+    3.95%  postgres  libc-2.29.so      [.] _int_malloc
+    3.38%  postgres  postgres          [.] int4send
+    2.54%  postgres  postgres          [.] heapgettup_pagemode
+    2.11%  postgres  libc-2.29.so      [.] _int_free
+    2.06%  postgres  postgres          [.] MemoryContextReset
+    2.02%  postgres  postgres          [.] AllocSetReset
+    1.97%  postgres  libc-2.29.so      [.] _IO_fwrite
+    1.47%  postgres  postgres          [.] DoCopyTo
+    1.14%  postgres  postgres          [.] ExecStoreBufferHeapTuple
+    1.06%  postgres  libc-2.29.so      [.] _IO_file_xsputn@@GLIBC_2.2.5
+    1.04%  postgres  libc-2.29.so      [.] malloc


Adding a few pg_restrict*, and using appendBinaryStringInfoNT instead of
appendBinaryStringInfo in CopySend* gains another 1.05x.


This does result in some code growth, but given the size of the
improvements, and that the improvements are significant even without
code changes to callsites, that seems worth it.

before:
   text       data        bss        dec        hex    filename
8482739     172304     204240    8859283     872e93    src/backend/postgres
after:
   text       data        bss        dec        hex    filename
8604300     172304     204240    8980844     89096c    src/backend/postgres

Regards,

Andres

[1]
CREATE TABLE lotsaints4(c01 int4 NOT NULL, c02 int4 NOT NULL, c03 int4 NOT NULL, c04 int4 NOT NULL, c05 int4 NOT NULL,
c06int4 NOT NULL, c07 int4 NOT NULL, c08 int4 NOT NULL, c09 int4 NOT NULL, c10 int4 NOT NULL);
 
INSERT INTO lotsaints4 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1,
2000000);
VACUUM FREEZE lotsaints4;
COPY lotsaints4 TO '/dev/null' WITH binary;

CREATE TABLE lotsaints8(c01 int8 NOT NULL, c02 int8 NOT NULL, c03 int8 NOT NULL, c04 int8 NOT NULL, c05 int8 NOT NULL,
c06int8 NOT NULL, c07 int8 NOT NULL, c08 int8 NOT NULL, c09 int8 NOT NULL, c10 int8 NOT NULL);
 
INSERT INTO lotsaints8 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1,
2000000);
VACUUM FREEZE lotsaints8;
COPY lotsaints8 TO '/dev/null' WITH binary;

Вложения

Re: Why is pq_begintypsend so slow?

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
>> FWIW, I've also observed, in another thread (the node func generation
>> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
>> when inlining some of its callers. Moving e.g. appendStringInfo() inline
>> allows the compiler to sometimes optimize away the strlen. But if
>> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
>> unconditionally, successive appends cannot optimize away memory accesses
>> for ->len/->data.

> With a set of patches doing so, int4send itself is not a significant
> factor for my test benchmark [1] anymore.

This thread seems to have died out, possibly because the last set of
patches that Andres posted was sufficiently complicated and invasive
that nobody wanted to review it.  I thought about this again after
seeing that [1] is mostly about pq_begintypsend overhead, and had
an epiphany: there isn't really a strong reason for pq_begintypsend
to be inserting bits into the buffer at all.  The bytes will be
filled by pq_endtypsend, and nothing in between should be touching
them.  So I propose 0001 attached.  It's poking into the stringinfo
abstraction a bit more than I would want to do if there weren't a
compelling performance reason to do so, but there evidently is.

With 0001, pq_begintypsend drops from being the top single routine
in a profile of a test case like [1] to being well down the list.
The next biggest cost compared to text-format output is that
printtup() itself is noticeably more expensive.  A lot of the extra
cost there seems to be from pq_sendint32(), which is getting inlined
into printtup(), and there probably isn't much we can do to make that
cheaper. But eliminating a common subexpression as in 0002 below does
help noticeably, at least with the rather old gcc I'm using.

For me, the combination of these two eliminates most but not quite
all of the cost penalty of binary over text output as seen in [1].

            regards, tom lane

[1] https://www.postgresql.org/message-id/CAMovtNoHFod2jMAKQjjxv209PCTJx5Kc66anwWvX0mEiaXwgmA%40mail.gmail.com

diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c
index a6f990c..03b7404 100644
--- a/src/backend/libpq/pqformat.c
+++ b/src/backend/libpq/pqformat.c
@@ -328,11 +328,16 @@ void
 pq_begintypsend(StringInfo buf)
 {
     initStringInfo(buf);
-    /* Reserve four bytes for the bytea length word */
-    appendStringInfoCharMacro(buf, '\0');
-    appendStringInfoCharMacro(buf, '\0');
-    appendStringInfoCharMacro(buf, '\0');
-    appendStringInfoCharMacro(buf, '\0');
+
+    /*
+     * Reserve four bytes for the bytea length word.  We don't need to fill
+     * them with anything (pq_endtypsend will do that), and this function is
+     * enough of a hot spot that it's worth cheating to save some cycles. Note
+     * in particular that we don't bother to guarantee that the buffer is
+     * null-terminated.
+     */
+    Assert(buf->maxlen > 4);
+    buf->len = 4;
 }
 
 /* --------------------------------
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index dd1bac0..a9315c6 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -438,11 +438,12 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
         {
             /* Binary output */
             bytea       *outputbytes;
+            int            outputlen;

             outputbytes = SendFunctionCall(&thisState->finfo, attr);
-            pq_sendint32(buf, VARSIZE(outputbytes) - VARHDRSZ);
-            pq_sendbytes(buf, VARDATA(outputbytes),
-                         VARSIZE(outputbytes) - VARHDRSZ);
+            outputlen = VARSIZE(outputbytes) - VARHDRSZ;
+            pq_sendint32(buf, outputlen);
+            pq_sendbytes(buf, VARDATA(outputbytes), outputlen);
         }
     }


Re: Why is pq_begintypsend so slow?

От
Ranier Vilela
Дата:
Em seg., 18 de mai. de 2020 às 13:38, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Andres Freund <andres@anarazel.de> writes:
>> FWIW, I've also observed, in another thread (the node func generation
>> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
>> when inlining some of its callers. Moving e.g. appendStringInfo() inline
>> allows the compiler to sometimes optimize away the strlen. But if
>> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
>> unconditionally, successive appends cannot optimize away memory accesses
>> for ->len/->data.

> With a set of patches doing so, int4send itself is not a significant
> factor for my test benchmark [1] anymore.

This thread seems to have died out, possibly because the last set of
patches that Andres posted was sufficiently complicated and invasive
that nobody wanted to review it.  I thought about this again after
seeing that [1] is mostly about pq_begintypsend overhead, and had
an epiphany: there isn't really a strong reason for pq_begintypsend
to be inserting bits into the buffer at all.  The bytes will be
filled by pq_endtypsend, and nothing in between should be touching
them.  So I propose 0001 attached.  It's poking into the stringinfo
abstraction a bit more than I would want to do if there weren't a
compelling performance reason to do so, but there evidently is.

With 0001, pq_begintypsend drops from being the top single routine
in a profile of a test case like [1] to being well down the list.
The next biggest cost compared to text-format output is that
printtup() itself is noticeably more expensive.  A lot of the extra
cost there seems to be from pq_sendint32(), which is getting inlined
into printtup(), and there probably isn't much we can do to make that
cheaper. But eliminating a common subexpression as in 0002 below does
help noticeably, at least with the rather old gcc I'm using.
Again, I see problems with the types declared in Postgres.
1. pq_sendint32 (StringInfo buf, uint32 i)
2. extern void pq_sendbytes (StringInfo buf, const char * data, int datalen);

Wouldn't it be better to declare outputlen (0002) as uint32?
To avoid converting from (int) to (uint32), even if afterwards there is a conversion from (uint32) to (int)?

regards,
Ranier Vilela

Re: Why is pq_begintypsend so slow?

От
Tom Lane
Дата:
Ranier Vilela <ranier.vf@gmail.com> writes:
> Again, I see problems with the types declared in Postgres.
> 1. pq_sendint32 (StringInfo buf, uint32 i)
> 2. extern void pq_sendbytes (StringInfo buf, const char * data, int
> datalen);

We could spend the next ten years cleaning up minor discrepancies
like that, and have nothing much to show for the work.

> To avoid converting from (int) to (uint32), even if afterwards there is a
> conversion from (uint32) to (int)?

You do realize that that conversion costs nothing?

            regards, tom lane



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2020-05-18 12:38:05 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> FWIW, I've also observed, in another thread (the node func generation
> >> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
> >> when inlining some of its callers. Moving e.g. appendStringInfo() inline
> >> allows the compiler to sometimes optimize away the strlen. But if
> >> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
> >> unconditionally, successive appends cannot optimize away memory accesses
> >> for ->len/->data.
>
> > With a set of patches doing so, int4send itself is not a significant
> > factor for my test benchmark [1] anymore.
>
> This thread seems to have died out, possibly because the last set of
> patches that Andres posted was sufficiently complicated and invasive
> that nobody wanted to review it.

Well, I wasn't really planning to try to get that patchset into 13, and
it wasn't in the CF...


> I thought about this again after seeing that [1] is mostly about
> pq_begintypsend overhead

I'm not really convinced that that's the whole problem. Using the
benchmark from upthread, I get (median of three):
master: 1181.581
yours: 1171.445
mine: 598.031

That's a very significant difference, imo. It helps a bit with the
benchmark from your [1], but not that much.


> With 0001, pq_begintypsend drops from being the top single routine
> in a profile of a test case like [1] to being well down the list.
> The next biggest cost compared to text-format output is that
> printtup() itself is noticeably more expensive.  A lot of the extra
> cost there seems to be from pq_sendint32(), which is getting inlined
> into printtup(), and there probably isn't much we can do to make that
> cheaper. But eliminating a common subexpression as in 0002 below does
> help noticeably, at least with the rather old gcc I'm using.

I think there's plenty more we can do:

First, it's unnecessary to re-initialize a FunctionCallInfo for every
send/recv/out/in call. Instead we can reuse the same over and over.


After that, the biggest remaining overhead for Jack's test is the palloc
for the stringinfo, as far as I can see.  I've complained about that
before...

I've just hacked up a modification where, for send functions,
fcinfo->context contains a stringinfo set up by printtup/CopyTo. That,
combined with using a FunctionCallInfo set up beforehand, instead of
re-initializing it in every printtup cycle, results in a pretty good
saving.

Making the binary protocol 20% faster than text, in Jack's testcase. And
my lotsaints4 test, goes further down to ~410ms (this is 2.5x faster
than where started).

Now obviously, the hack with passing a StringInfo in ->context is just
that, a hack. A somewhat gross one even. But I think it pretty clearly
shows the problem and the way out.

I don't know what the best non-gross solution for the overhead of the
out/send functions is. There seems to be at least the following
major options (and a lots of variants thereof):

1) Just continue to incur significant overhead for every datum
2) Accept the uglyness of passing in a buffer via
   FunctionCallInfo->context. Change nearly all in-core send functions
   over to that.
3) Pass string buffer through an new INTERNAL argument to send/output
   function, allow both old/new style send functions. Use a wrapper
   function to adapt the "old style" to the "new style".
4) Like 3, but make the new argument optional, and use ad-hoc
   stringbuffer if not provided. I don't like the unnecessary branches
   this adds.

The biggest problem after that is that we waste a lot of time memcpying
stuff around repeatedly. There is:
1) send function: datum -> per datum stringinfo
2) printtup: per datum stringinfo -> per row stringinfo
3) socket_putmessage: per row stringinfo -> PqSendBuffer
4) send(): PqSendBuffer -> kernel buffer

It's obviously hard to avoid 1) and 4) in the common case, but the
number of other copies seem pretty clearly excessive.


If we change the signature of the out/send function to always target a
string buffer, we could pretty easily avoid 2), and for out functions
we'd not have to redundantly call strlen (as the argument to
pq_sendcountedtext) anymore, which seems substantial too.

As I argued before, I think it's unnecessary to have a separate buffer
between 3-4). We should construct the outgoing message inside the send
buffer. I still don't understand what "recursion" danger there is,
nothing below printtup should ever send protocol messages, no?


Sometimes there's also 0) in the above, when detoasting a datum...

Greetings,

Andres Freund

Вложения

Re: Why is pq_begintypsend so slow?

От
Robert Haas
Дата:
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <andres@anarazel.de> wrote:
> The biggest problem after that is that we waste a lot of time memcpying
> stuff around repeatedly. There is:
> 1) send function: datum -> per datum stringinfo
> 2) printtup: per datum stringinfo -> per row stringinfo
> 3) socket_putmessage: per row stringinfo -> PqSendBuffer
> 4) send(): PqSendBuffer -> kernel buffer
>
> It's obviously hard to avoid 1) and 4) in the common case, but the
> number of other copies seem pretty clearly excessive.

I too have seen recent benchmarking data where this was a big problem.
Basically, you need a workload where the server doesn't have much or
any actual query processing to do, but is just returning a lot of
stuff to a really fast client - e.g. a locally connected client.
That's not necessarily the most common case but, if you have it, all
this extra copying is really pretty expensive.

My first thought was to wonder about changing all of our send/output
functions to write into a buffer passed as an argument rather than
returning something which we then have to copy into a different
buffer, but that would be a somewhat painful change, so it is probably
better to first pursue the idea of getting rid of some of the other
copies that happen in more centralized places (e.g. printtup). I
wonder if we could replace the whole
pq_beginmessage...()/pq_send....()/pq_endmessage...() system with
something a bit better-designed. For instance, suppose we get rid of
the idea that the caller supplies the buffer, and we move the
responsibility for error recovery into the pqcomm layer. So you do
something like:

my_message = xyz_beginmessage('D');
xyz_sendint32(my_message, 42);
xyz_endmessage(my_message);

Maybe what happens here under the hood is we keep a pool of free
message buffers sitting around, and you just grab one and put your
data into it. When you end the message we add it to a list of used
message buffers that are waiting to be sent, and once we send the data
it goes back on the free list. If an error occurs after
xyz_beginmessage() and before xyz_endmessage(), we put the buffer back
on the free list. That would allow us to merge (2) and (3) into a
single copy. To go further, we could allow send/output functions to
opt in to receiving a message buffer rather than returning a value,
and then we could get rid of (1) for types that participate. (4) seems
unavoidable AFAIK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2020-06-03 11:30:42 -0400, Robert Haas wrote:
> I too have seen recent benchmarking data where this was a big problem.
> Basically, you need a workload where the server doesn't have much or
> any actual query processing to do, but is just returning a lot of
> stuff to a really fast client - e.g. a locally connected client.
> That's not necessarily the most common case but, if you have it, all
> this extra copying is really pretty expensive.

Even when the query actually is doing something, it's still quite
possible to get the memcpies to be be measurable (say > 10% of
cycles). Obviously not in a huge aggregating query. Even in something
like pgbench -M prepared -S, which is obviously spending most of its
cycles elsewhere, the patches upthread improve throughput by ~1.5% (and
that's not eliding several unnecessary copies).


> My first thought was to wonder about changing all of our send/output
> functions to write into a buffer passed as an argument rather than
> returning something which we then have to copy into a different
> buffer, but that would be a somewhat painful change, so it is probably
> better to first pursue the idea of getting rid of some of the other
> copies that happen in more centralized places (e.g. printtup).

For those I think the allocator overhead is the bigger issue than the
memcpy itself. I wonder how much we could transparently hide in
pq_begintypsend()/pq_endtypsend().


> I
> wonder if we could replace the whole
> pq_beginmessage...()/pq_send....()/pq_endmessage...() system with
> something a bit better-designed. For instance, suppose we get rid of
> the idea that the caller supplies the buffer, and we move the
> responsibility for error recovery into the pqcomm layer. So you do
> something like:
> 
> my_message = xyz_beginmessage('D');
> xyz_sendint32(my_message, 42);
> xyz_endmessage(my_message);
> 
> Maybe what happens here under the hood is we keep a pool of free
> message buffers sitting around, and you just grab one and put your
> data into it.

Why do we need multiple buffers? ISTM we don't want to just send
messages at endmsg() time, because that implies unnecessary syscall
overhead. Nor do we want to imply the overhead of the copy from the
message buffer to the network buffer.

To me that seems to imply that the best approach would be to have
PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
record the starting position of the current message somewhere
(->cursor?). When an error is thrown, we reset the position to be where
the in-progress message would have begun.

I've previously outlined a slightly more complicated scheme, where we
have "proxy" stringinfos that point into another stringinfo, instead of
their own buffer. And know how to resize the "outer" buffer when
needed. That'd have some advantages, but I'm not sure it's really
needed.


There's some disadvantages with what I describe above, in particular
when dealing with send() sending only parts of our network buffer. We
couldn't cheaply reuse the already sent memory in that case.

I've before wondered / suggested that we should have StringInfos not
insist on having one consecutive buffer (which obviously implies needing
to copy contents when growing). Instead it should have a list of buffers
containing chunks of the data, and never copy contents around while the
string is being built. We'd only allocate a buffer big enough for all
data when the caller actually wants to have all the resulting data in
one string (rather than using an API that can iterate over chunks).

For the network buffer case that'd allow us to reuse the earlier buffers
even in the "partial send" case. And more generally it'd allow us to be
less wasteful with buffer sizes, and perhaps even have a small "inline"
buffer inside StringInfoData avoiding unnecessary memory allocations in
a lot of cases where the common case is only a small amount of data
being used. And I think the overhead while appending data to such a
stringinfo should be neglegible, because it'd just require the exact
same checks we already have to do for enlargeStringInfo().


> (4) seems unavoidable AFAIK.

Not entirely. Linux can do zero-copy sends, but it does require somewhat
complicated black magic rituals. Including more complex buffer
management for the application, because the memory containing the
to-be-sent data cannot be reused until the kernel notifies that it's
done with the buffer.

See https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html

That might be something worth pursuing in the future (since it, I think,
basically avoids spending any cpu cycles on copying data around in the
happy path, relying on DMA instead), but I think for now there's much
bigger fish to fry.

I am hoping that somebody will write a nicer abstraction for zero-copy
sends using io_uring. avoiding the need of a separate completion queue,
by simply only signalling completion for the sendmsg operation once the
buffer isn't needed anymore. There's no corresponding completion logic
for normal sendmsg() calls, so it makes sense that something had to be
invented before something like io_uring existed.

Greetings,

Andres Freund



Re: Why is pq_begintypsend so slow?

От
Robert Haas
Дата:
On Wed, Jun 3, 2020 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> Why do we need multiple buffers? ISTM we don't want to just send
> messages at endmsg() time, because that implies unnecessary syscall
> overhead. Nor do we want to imply the overhead of the copy from the
> message buffer to the network buffer.

It would only matter if there are multiple messages being constructed
at the same time, and that's probably not common, but maybe there's
some way it can happen. It doesn't seem like it really costs anything
to allow for it, and it might be useful sometime. For instance,
consider your idea of using Linux black magic to do zero-copy sends.
Now you either need multiple buffers, or you need one big buffer that
you can recycle a bit at a time.

> To me that seems to imply that the best approach would be to have
> PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
> record the starting position of the current message somewhere
> (->cursor?). When an error is thrown, we reset the position to be where
> the in-progress message would have begun.

Yeah, I thought about that, but then how you detect the case where two
different people try to undertake message construction at the same
time?

Like, with the idea I was proposing, you could still decide to limit
yourself to 1 buffer at the same time, and just elog() if someone
tries to allocate a second buffer when you've already reached the
maximum number of allocated buffers (i.e. one). But if you just have
one buffer in a global variable and everybody writes into it, you
might not notice if some unrelated code writes data into that buffer
in the middle of someone else's message construction. Doing it the way
I proposed, writing data requires passing a buffer pointer, so you can
be sure that somebody had to get the buffer from somewhere... and any
rules you want to enforce can be enforced at that point.

> I've before wondered / suggested that we should have StringInfos not
> insist on having one consecutive buffer (which obviously implies needing
> to copy contents when growing). Instead it should have a list of buffers
> containing chunks of the data, and never copy contents around while the
> string is being built. We'd only allocate a buffer big enough for all
> data when the caller actually wants to have all the resulting data in
> one string (rather than using an API that can iterate over chunks).

It's a thought. I doubt it's worth it for small amounts of data, but
for large amounts it might be. On the other hand, a better idea still
might be to size the buffer correctly from the start...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2020-06-09 11:46:09 -0400, Robert Haas wrote:
> On Wed, Jun 3, 2020 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> > Why do we need multiple buffers? ISTM we don't want to just send
> > messages at endmsg() time, because that implies unnecessary syscall
> > overhead. Nor do we want to imply the overhead of the copy from the
> > message buffer to the network buffer.
> 
> It would only matter if there are multiple messages being constructed
> at the same time, and that's probably not common, but maybe there's
> some way it can happen.

ISTM that it'd be pretty broken if it could happen. We cannot have two
different parts of the system send messages to the client
independently. The protocol is pretty stateful...


> > To me that seems to imply that the best approach would be to have
> > PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
> > record the starting position of the current message somewhere
> > (->cursor?). When an error is thrown, we reset the position to be where
> > the in-progress message would have begun.
> 
> Yeah, I thought about that, but then how you detect the case where two
> different people try to undertake message construction at the same
> time?

Set a boolean and assert out if one already is in progress? We'd need
some state to know where to reset the position to on error anyway.


> Like, with the idea I was proposing, you could still decide to limit
> yourself to 1 buffer at the same time, and just elog() if someone
> tries to allocate a second buffer when you've already reached the
> maximum number of allocated buffers (i.e. one). But if you just have
> one buffer in a global variable and everybody writes into it, you
> might not notice if some unrelated code writes data into that buffer
> in the middle of someone else's message construction. Doing it the way
> I proposed, writing data requires passing a buffer pointer, so you can
> be sure that somebody had to get the buffer from somewhere... and any
> rules you want to enforce can be enforced at that point.

I'd hope that we'd encapsulate the buffer management into file local
variables in pqcomm.c or such, and that code outside of that cannot
access the out buffer directly without using the appropriate helpers.


> > I've before wondered / suggested that we should have StringInfos not
> > insist on having one consecutive buffer (which obviously implies needing
> > to copy contents when growing). Instead it should have a list of buffers
> > containing chunks of the data, and never copy contents around while the
> > string is being built. We'd only allocate a buffer big enough for all
> > data when the caller actually wants to have all the resulting data in
> > one string (rather than using an API that can iterate over chunks).
> 
> It's a thought. I doubt it's worth it for small amounts of data, but
> for large amounts it might be. On the other hand, a better idea still
> might be to size the buffer correctly from the start...

I think those are complimentary. I do agree that's it's useful to size
stringinfos more appropriately immediately (there's an upthread patch
adding a version of initStringInfo() that does so, quite useful for
small stringinfos in particular). But there's enough cases where that's
not really knowable ahead of time that I think it'd be quite useful to
have support for the type of buffer I describe above.

Greetings,

Andres Freund



Re: Why is pq_begintypsend so slow?

От
Robert Haas
Дата:
On Tue, Jun 9, 2020 at 3:23 PM Andres Freund <andres@anarazel.de> wrote:
> ISTM that it'd be pretty broken if it could happen. We cannot have two
> different parts of the system send messages to the client
> independently. The protocol is pretty stateful...

There's a difference between building messages concurrently and
sending them concurrently.

> Set a boolean and assert out if one already is in progress? We'd need
> some state to know where to reset the position to on error anyway.

Sure, that's basically just different notation for the same thing. I
might prefer my notation over yours, but you might prefer the reverse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is pq_begintypsend so slow?

От
Robert Haas
Дата:
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <andres@anarazel.de> wrote:
> I don't know what the best non-gross solution for the overhead of the
> out/send functions is. There seems to be at least the following
> major options (and a lots of variants thereof):
>
> 1) Just continue to incur significant overhead for every datum
> 2) Accept the uglyness of passing in a buffer via
>    FunctionCallInfo->context. Change nearly all in-core send functions
>    over to that.
> 3) Pass string buffer through an new INTERNAL argument to send/output
>    function, allow both old/new style send functions. Use a wrapper
>    function to adapt the "old style" to the "new style".
> 4) Like 3, but make the new argument optional, and use ad-hoc
>    stringbuffer if not provided. I don't like the unnecessary branches
>    this adds.

I ran into this problem in another context today while poking at some
pg_basebackup stuff. There's another way of solving this problem which
I think we should consider: just get rid of the per-row stringinfo and
push the bytes directly from wherever they are into PqSendBuffer. Once
we start doing this, we can't error out, because internal_flush()
might've been called, sending a partial message to the client. Trying
to now switch to sending an ErrorResponse will break protocol sync.
But it seems possible to avoid that. Just call all of the output
functions first, and also do any required encoding conversions
(pq_sendcountedtext -> pg_server_to_client). Then, do a bunch of
pq_putbytes() calls to shove the message out -- there's the small
matter of an assertion failure, but whatever. This effectively
collapses two copies into one. Or alternatively, build up an array of
iovecs and then have a variant of pq_putmessage(), like
pq_putmessage_iovec(), that knows what to do with them.

One advantage of this approach is that it approximately doubles the
size of the DataRow message we can send. We're currently limited to
<1GB because of palloc, but the wire protocol just needs it to be <2GB
so that a signed integer does not overflow. It would be nice to buy
more than a factor of two here, but that would require a wire protocol
change, and 2x is not bad.

Another advantage of this approach is that it doesn't require passing
StringInfos all over the place. For the use case that I was looking
at, that appears awkward. I'm not saying I couldn't make it work, but
it wouldn't be my first choice. Right now, I've got data bubbling down
a chain of handlers until it eventually gets sent off to the client;
with your approach, I think I'd need to bubble buffers up and then
bubble data down, which seems quite a bit more complex.

A disadvantage of this approach is that we still end up doing three
copies: one from the datum to the per-datum StringInfo, a second into
PqSendBuffer, and a third from there to the kernel. However, we could
probably improve on this. Whenever we internal_flush(), consider
whether the chunk of data we're the process of copying (the current
call to pq_putbytes(), or the current iovec) has enough bytes
remaining to completely refill the buffer. If so, secure_write() a
buffer's worth of bytes (or more) directly, bypassing PqSendBuffer.
That way, we avoid individual system calls (or calls to OpenSSL or
GSS) for small numbers of bytes, but we also avoid extra copying when
transmitting larger amounts of data.

Even with that optimization, this still seems like it could end up
being less efficient than your proposal (surprise, surprise). If we've
got a preallocated buffer which we won't be forced to resize during
message construction -- and for DataRow messages we can get there just
by keeping the buffer around, so that we only need to reallocate when
we see a larger message than we've ever seen before -- and we write
all the data directly into that buffer and then send it from there
straight to the kernel, we only ever do 2 copies, whereas what I'm
proposing sometimes does 3 copies and sometimes only 2.

While I admit that's not great, it seems likely to still be a
significant win over what we have now, and it's a *lot* less invasive
than your proposal. Not only does your approach require changing all
of the type-output and type-sending functions inside and outside core
to use this new model, admittedly with the possibility of backward
compatibility, but it also means that we could need similarly invasive
changes in any other place that wants to use this new style of message
construction. You can't write any data anywhere that you might want to
later incorporate into a protocol message unless you write it into a
StringInfo; and not only that, but you have to be able to get the
right amount of data into the right place in the StringInfo right from
the start. I think that in some cases that will require fairly complex
orchestration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is pq_begintypsend so slow?

От
Robert Haas
Дата:
Here is some review for the first few patches in this series.

I am generally in favor of applying 0001-0003 no matter what else we
decide to do here. However, as might be expected, I have a few
questions and comments.

Regarding 0001:

I dislike the name initStringInfoEx() because we do not use the "Ex"
convention for function names anywhere in the code base. We do
sometimes use "extended", so this could be initStringInfoExtended(),
perhaps, or something more specific, like initStringInfoWithLength().

Regarding the FIXME in that function, I suggest that this should be
the caller's job. Otherwise, there will probably be some caller which
doesn't want the add-one behavior, and then said caller will subtract
one to compensate, and that will be silly.

I am not familiar with pg_restrict and don't entirely understand the
motivation for it here. I suspect you have done some experiments and
figured out that it produces better code, but it would be interesting
to hear more about how you got there. Perhaps there could even be some
brief comments about it. Locutions like this are particularly
confusing to me:

+static inline void
+resetStringInfo(StringInfoData *pg_restrict str)
+{
+       *(char *pg_restrict) (str->data) = '\0';
+       str->len = 0;
+       str->cursor = 0;
+}

I don't understand how this can be saving anything. I think the
restrict definitions here mean that str->data does not overlap with
str itself, but considering that these are unconditional stores, so
what? If the code were doing something like memset(str->data, 0,
str->len) then I'd get it: it might be useful to know for optimization
purposes that the memset isn't overwriting str->len. But what code can
we produce for this that wouldn't be valid if str->data = &str? I
assume this is my lack of understanding rather than an actual problem
with the patch, but I would be grateful if you could explain.

It is easier to see why the pg_restrict stuff you've introduced into
appendBinaryStringInfoNT is potentially helpful: e.g. in
appendBinaryStringInfoNT, it promises that memcpy can't clobber
str->len, so the compiler is free to reorder without changing the
results. Or so I imagine. But then the one in appendBinaryStringInfo()
confuses me again: if str->data is already declared as a restricted
pointer, then why do we need to cast str->data + str->len to be
restricted also?

In appendStringInfoChar, why do we need to cast to restrict twice? Can
we not just do something like this:

char *pg_restrict ep = str->data+str->len;
ep[0] = ch;
ep[1] = '\0';

Regarding 0002:

Totally mechanical, seems fine.

Regarding 0003:

For the same reasons as above, I suggest renaming pq_begintypsend_ex()
to pq_begintypsend_extended() or pq_begintypsend_with_length() or
something of that sort, rather than using _ex.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2020-07-31 11:14:46 -0400, Robert Haas wrote:
> Here is some review for the first few patches in this series.

Thanks!


> I am generally in favor of applying 0001-0003 no matter what else we
> decide to do here. However, as might be expected, I have a few
> questions and comments.
> 
> Regarding 0001:
> 
> I dislike the name initStringInfoEx() because we do not use the "Ex"
> convention for function names anywhere in the code base. We do
> sometimes use "extended", so this could be initStringInfoExtended(),
> perhaps, or something more specific, like initStringInfoWithLength().

I dislike the length of the function name, but ...


> Regarding the FIXME in that function, I suggest that this should be
> the caller's job. Otherwise, there will probably be some caller which
> doesn't want the add-one behavior, and then said caller will subtract
> one to compensate, and that will be silly.

Fair point.


> I am not familiar with pg_restrict and don't entirely understand the
> motivation for it here. I suspect you have done some experiments and
> figured out that it produces better code, but it would be interesting
> to hear more about how you got there. Perhaps there could even be some
> brief comments about it. Locutions like this are particularly
> confusing to me:
> 
> +static inline void
> +resetStringInfo(StringInfoData *pg_restrict str)
> +{
> +       *(char *pg_restrict) (str->data) = '\0';
> +       str->len = 0;
> +       str->cursor = 0;
> +}

The restrict tells the compiler that 'str' and 'str->data' won't be
pointing to the same memory. Which can simpilify the code its
generating.  E.g. it'll allow the compiler to keep str->data in a
register, instead of reloading it for the next
appendStringInfo*. Without the restrict it can't, because str->data = 0
could otherwise overwrite parts of the value of ->data itself, if ->data
pointed into the StringInfo. Similarly, str->data could be overwritten
by str->len in some other cases.

Partially the reason we need to add the markers is that we compile with
-fno-strict-aliasing. But even if we weren't, this case wouldn't be
solved without an explicit marker even then, because char * is allowed
to alias...

Besides keeping ->data in a register, the restrict can also just
entirely elide the null byte write in some cases, e.g. because the
resetStringInfo() is followed by a appendStringInfoString(, "constant").


> I don't understand how this can be saving anything. I think the
> restrict definitions here mean that str->data does not overlap with
> str itself, but considering that these are unconditional stores, so
> what? If the code were doing something like memset(str->data, 0,
> str->len) then I'd get it: it might be useful to know for optimization
> purposes that the memset isn't overwriting str->len. But what code can
> we produce for this that wouldn't be valid if str->data = &str? I
> assume this is my lack of understanding rather than an actual problem
> with the patch, but I would be grateful if you could explain.

I hope the above makes this make sene now? It's about subsequent uses of
the StringInfo, rather than the body of resetStringInfo itself.


> It is easier to see why the pg_restrict stuff you've introduced into
> appendBinaryStringInfoNT is potentially helpful: e.g. in
> appendBinaryStringInfoNT, it promises that memcpy can't clobber
> str->len, so the compiler is free to reorder without changing the
> results. Or so I imagine. But then the one in appendBinaryStringInfo()
> confuses me again: if str->data is already declared as a restricted
> pointer, then why do we need to cast str->data + str->len to be
> restricted also?

But str->data isn't declared restricted without the explicit use of
restrict? str is restrict'ed, but it doesn't apply "recursively" to all
pointers contained therein.


> In appendStringInfoChar, why do we need to cast to restrict twice? Can
> we not just do something like this:
> 
> char *pg_restrict ep = str->data+str->len;
> ep[0] = ch;
> ep[1] = '\0';

I don't think that'd tell the compiler that this couldn't overlap with
str itself? A single 'restrict' can never (?) help, you need *two*
things that are marked as not overlapping in any way.


Greetings,

Andres Freund



Re: Why is pq_begintypsend so slow?

От
Robert Haas
Дата:
On Fri, Jul 31, 2020 at 11:50 AM Andres Freund <andres@anarazel.de> wrote:
> I hope the above makes this make sene now? It's about subsequent uses of
> the StringInfo, rather than the body of resetStringInfo itself.

That does make sense, except that
https://en.cppreference.com/w/c/language/restrict says "During each
execution of a block in which a restricted pointer P is declared
(typically each execution of a function body in which P is a function
parameter), if some object that is accessible through P (directly or
indirectly) is modified, by any means, then all accesses to that
object (both reads and writes) in that block must occur through P
(directly or indirectly), otherwise the behavior is undefined." So my
interpretation of this was that it couldn't really affect what
happened outside of the function itself, even if the compiler chose to
perform inlining. But I think see what you're saying: the *inference*
is only valid with respect to restrict pointers in a particular
function, but what can be optimized as a result of that inference may
be something further afield, if inlining is performed. Perhaps we
could add a comment about this, e.g.

Marking these pointers with pg_restrict tells the compiler that str
and str->data can't overlap, which may allow the compiler to optimize
better when this code is inlined. For example, it may be possible to
keep str->data in a register across consecutive appendStringInfoString
operations.

Since pg_restrict is not widely used, I think it's worth adding this
kind of annotation, lest other hackers get confused. I'm probably not
the only one who isn't on top of this.

> > In appendStringInfoChar, why do we need to cast to restrict twice? Can
> > we not just do something like this:
> >
> > char *pg_restrict ep = str->data+str->len;
> > ep[0] = ch;
> > ep[1] = '\0';
>
> I don't think that'd tell the compiler that this couldn't overlap with
> str itself? A single 'restrict' can never (?) help, you need *two*
> things that are marked as not overlapping in any way.

But what's the difference between:

+       *(char *pg_restrict) (str->data + str->len) = ch;
+       str->len++;
+       *(char *pg_restrict) (str->data + str->len) = '\0';

And:

char *pg_restrict ep = str->data+str->len;
ep[0] = ch;
ep[1] = '\0';
++str->len;

Whether or not str itself is marked restricted is another question;
what I'm talking about is why we need to repeat (char *pg_restrict)
(str->data + str->len).

I don't have any further comment on the remainder of your reply.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2020-07-31 12:28:04 -0400, Robert Haas wrote:
> On Fri, Jul 31, 2020 at 11:50 AM Andres Freund <andres@anarazel.de> wrote:
> > I hope the above makes this make sene now? It's about subsequent uses of
> > the StringInfo, rather than the body of resetStringInfo itself.
>
> That does make sense, except that
> https://en.cppreference.com/w/c/language/restrict says "During each
> execution of a block in which a restricted pointer P is declared
> (typically each execution of a function body in which P is a function
> parameter), if some object that is accessible through P (directly or
> indirectly) is modified, by any means, then all accesses to that
> object (both reads and writes) in that block must occur through P
> (directly or indirectly), otherwise the behavior is undefined." So my
> interpretation of this was that it couldn't really affect what
> happened outside of the function itself, even if the compiler chose to
> perform inlining. But I think see what you're saying: the *inference*
> is only valid with respect to restrict pointers in a particular
> function, but what can be optimized as a result of that inference may
> be something further afield, if inlining is performed.

Right. There's two aspects:

1) By looking at the function, with the restrict, the compiler can infer
   more about the behaviour of the function. E.g. knowing that -> len
   has a specific value, or that ->data[n] does. That information then
   can be used together with subsequent operations, e.g. avoiding a
   re-read of ->len. That could work in some cases even if subsequent
   operations were *not* marked up with restrict.

2) The restrict signals to the compiler that we guarantee (i.e. it would
   be undefined behaviour if not) that the pointers do not
   overlap. Which means it can assume that in some of the calling code
   as well, if it can analyze that ->data isn't changed, for example.


> Perhaps we could add a comment about this, e.g.
> Marking these pointers with pg_restrict tells the compiler that str
> and str->data can't overlap, which may allow the compiler to optimize
> better when this code is inlined. For example, it may be possible to
> keep str->data in a register across consecutive appendStringInfoString
> operations.
>
> Since pg_restrict is not widely used, I think it's worth adding this
> kind of annotation, lest other hackers get confused. I'm probably not
> the only one who isn't on top of this.

Would it make more sense to have a bit of an explanation at
pg_restrict's definition, instead of having it at (eventually) multiple
places?


> > > In appendStringInfoChar, why do we need to cast to restrict twice? Can
> > > we not just do something like this:
> > >
> > > char *pg_restrict ep = str->data+str->len;
> > > ep[0] = ch;
> > > ep[1] = '\0';
> >
> > I don't think that'd tell the compiler that this couldn't overlap with
> > str itself? A single 'restrict' can never (?) help, you need *two*
> > things that are marked as not overlapping in any way.
>
> But what's the difference between:
>
> +       *(char *pg_restrict) (str->data + str->len) = ch;
> +       str->len++;
> +       *(char *pg_restrict) (str->data + str->len) = '\0';
>
> And:
>
> char *pg_restrict ep = str->data+str->len;
> ep[0] = ch;
> ep[1] = '\0';
> ++str->len;
>
> Whether or not str itself is marked restricted is another question;
> what I'm talking about is why we need to repeat (char *pg_restrict)
> (str->data + str->len).

Ah, I misunderstood. Yea, there's no reason not to do that.

Greetings,

Andres Freund



Re: Why is pq_begintypsend so slow?

От
Robert Haas
Дата:
On Fri, Jul 31, 2020 at 1:00 PM Andres Freund <andres@anarazel.de> wrote:
> > Perhaps we could add a comment about this, e.g.
> > Marking these pointers with pg_restrict tells the compiler that str
> > and str->data can't overlap, which may allow the compiler to optimize
> > better when this code is inlined. For example, it may be possible to
> > keep str->data in a register across consecutive appendStringInfoString
> > operations.
> >
> > Since pg_restrict is not widely used, I think it's worth adding this
> > kind of annotation, lest other hackers get confused. I'm probably not
> > the only one who isn't on top of this.
>
> Would it make more sense to have a bit of an explanation at
> pg_restrict's definition, instead of having it at (eventually) multiple
> places?

I think, at least for the first few, it might be better to have a more
specific explanation at the point of use, as it may be easier to
understand in specific cases than in general. I imagine this only
really makes sense for places that are pretty hot.

> Ah, I misunderstood. Yea, there's no reason not to do that.

OK, then I vote for that version, as I think it looks nicer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

I was reminded of the patchset I had posted in this thread by
https://postgr.es/m/679d5455cbbb0af667ccb753da51a475bae1eaed.camel%40cybertec.at

On 2020-07-31 13:35:43 -0400, Robert Haas wrote:
> On Fri, Jul 31, 2020 at 1:00 PM Andres Freund <andres@anarazel.de> wrote:
> > > Perhaps we could add a comment about this, e.g.
> > > Marking these pointers with pg_restrict tells the compiler that str
> > > and str->data can't overlap, which may allow the compiler to optimize
> > > better when this code is inlined. For example, it may be possible to
> > > keep str->data in a register across consecutive appendStringInfoString
> > > operations.
> > >
> > > Since pg_restrict is not widely used, I think it's worth adding this
> > > kind of annotation, lest other hackers get confused. I'm probably not
> > > the only one who isn't on top of this.
> >
> > Would it make more sense to have a bit of an explanation at
> > pg_restrict's definition, instead of having it at (eventually) multiple
> > places?
> 
> I think, at least for the first few, it might be better to have a more
> specific explanation at the point of use, as it may be easier to
> understand in specific cases than in general. I imagine this only
> really makes sense for places that are pretty hot.

Whenever I looked at adding these comments, it felt wrong. We end up with
repetitive boilerplate stuff as quite a few functions use it. I've thus not
addressed this aspect in the attached rebased version.  Perhaps a compromise
would be to add such a comment to the top of stringinfo.h?


> > Ah, I misunderstood. Yea, there's no reason not to do that.
> 
> OK, then I vote for that version, as I think it looks nicer.

Done.

Greetings,

Andres Freund

Вложения

Re: Why is pq_begintypsend so slow?

От
Sutou Kouhei
Дата:
Hi,

In <20240218015955.rmw5mcmobt5hbene@awork3.anarazel.de>
  "Re: Why is pq_begintypsend so slow?" on Sat, 17 Feb 2024 17:59:55 -0800,
  Andres Freund <andres@anarazel.de> wrote:

> v3-0008-wip-make-in-out-send-recv-calls-in-copy.c-cheaper.patch

It seems that this is an alternative approach of [1].

[1]
https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995


+typedef struct CopyInAttributeInfo
+{
+    AttrNumber    num;
+    const char *name;
+
+    Oid            typioparam;
+    int32        typmod;
+
+    FmgrInfo    in_finfo;
+    union
+    {
+        FunctionCallInfoBaseData fcinfo;
+        char        fcinfo_data[SizeForFunctionCallInfo(3)];
+    }            in_fcinfo;

Do we need one FunctionCallInfoBaseData for each attribute?
How about sharing one FunctionCallInfoBaseData by all
attributes like [1]?


@@ -956,20 +956,47 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 
                 values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
             }
-
-            /*
-             * If ON_ERROR is specified with IGNORE, skip rows with soft
-             * errors
-             */
-            else if (!InputFunctionCallSafe(&in_functions[m],
-                                            string,
-                                            typioparams[m],
-                                            att->atttypmod,
-                                            (Node *) cstate->escontext,
-                                            &values[m]))

Inlining InputFuncallCallSafe() here to use pre-initialized
fcinfo will decrease maintainability. Because we abstract
InputFunctionCall family in fmgr.c. If we define a
InputFunctionCall variant here, we need to change both of
fmgr.c and here when InputFunctionCall family is changed.
How about defining details in fmgr.c and call it here
instead like [1]?

+                fcinfo->args[0].value = CStringGetDatum(string);
+                fcinfo->args[0].isnull = false;
+                fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam);
+                fcinfo->args[1].isnull = false;
+                fcinfo->args[2].value = Int32GetDatum(attr->typmod);
+                fcinfo->args[2].isnull = false;

I think that "fcinfo->isnull = false;" is also needed like
[1].

@@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
     if (fld_size == -1)
     {
         *isnull = true;
-        return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod);
+        return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod);

Why pre-initialized fcinfo isn't used here?


Thanks,
-- 
kou



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2024-02-18 17:38:09 +0900, Sutou Kouhei wrote:
> In <20240218015955.rmw5mcmobt5hbene@awork3.anarazel.de>
>   "Re: Why is pq_begintypsend so slow?" on Sat, 17 Feb 2024 17:59:55 -0800,
>   Andres Freund <andres@anarazel.de> wrote:
> 
> > v3-0008-wip-make-in-out-send-recv-calls-in-copy.c-cheaper.patch
> 
> It seems that this is an alternative approach of [1].

Note that what I posted was a very lightly polished rebase of a ~4 year old
patchset.

> [1]
https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995
> 
> 
> +typedef struct CopyInAttributeInfo
> +{
> +    AttrNumber    num;
> +    const char *name;
> +
> +    Oid            typioparam;
> +    int32        typmod;
> +
> +    FmgrInfo    in_finfo;
> +    union
> +    {
> +        FunctionCallInfoBaseData fcinfo;
> +        char        fcinfo_data[SizeForFunctionCallInfo(3)];
> +    }            in_fcinfo;
> 
> Do we need one FunctionCallInfoBaseData for each attribute?
> How about sharing one FunctionCallInfoBaseData by all
> attributes like [1]?

That makes no sense to me. You're throwing away most of the possible gains by
having to update the FunctionCallInfo fields on every call. You're saving
neglegible amounts of memory at a substantial runtime cost.


> @@ -956,20 +956,47 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
>  
>                  values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
>              }
> -
> -            /*
> -             * If ON_ERROR is specified with IGNORE, skip rows with soft
> -             * errors
> -             */
> -            else if (!InputFunctionCallSafe(&in_functions[m],
> -                                            string,
> -                                            typioparams[m],
> -                                            att->atttypmod,
> -                                            (Node *) cstate->escontext,
> -                                            &values[m]))
> 
> Inlining InputFuncallCallSafe() here to use pre-initialized
> fcinfo will decrease maintainability. Because we abstract
> InputFunctionCall family in fmgr.c. If we define a
> InputFunctionCall variant here, we need to change both of
> fmgr.c and here when InputFunctionCall family is changed.
> How about defining details in fmgr.c and call it here
> instead like [1]?

I'm not sure I buy that that is a problem. It's not like my approach was
actually bypassing fmgr abstractions alltogether - instead it just used the
lower level APIs, because it's a performance sensitive area.


> @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
>      if (fld_size == -1)
>      {
>          *isnull = true;
> -        return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod);
> +        return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod);
> 
> Why pre-initialized fcinfo isn't used here?

Because it's a prototype and because I don't think it's a common path.

Greetings,

Andres Freund



Re: Why is pq_begintypsend so slow?

От
Michael Paquier
Дата:
On Sun, Feb 18, 2024 at 12:09:06PM -0800, Andres Freund wrote:
> On 2024-02-18 17:38:09 +0900, Sutou Kouhei wrote:
>> @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
>>      if (fld_size == -1)
>>      {
>>          *isnull = true;
>> -        return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod);
>> +        return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod);
>>
>> Why pre-initialized fcinfo isn't used here?
>
> Because it's a prototype and because I don't think it's a common path.

0008 and 0010 (a bit) are the only patches of the set that touch some
of the areas that would be impacted by the refactoring to use
callbacks in the COPY code, still I don't see anything that could not
be changed in what's updated here, the one-row callback in COPY FROM
being the most touched.  So I don't quite see why each effort could
not happen on their own?

Or Andres, do you think that any improvements you've been proposing in
this area should happen before we consider refactoring the COPY code
to plug in the callbacks?  I'm a bit confused by the situation, TBH.
--
Michael

Вложения

Re: Why is pq_begintypsend so slow?

От
Sutou Kouhei
Дата:
Hi,

In <20240218200906.zvihkrs46yl2juzf@awork3.anarazel.de>
  "Re: Why is pq_begintypsend so slow?" on Sun, 18 Feb 2024 12:09:06 -0800,
  Andres Freund <andres@anarazel.de> wrote:

>> [1]
https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995

>> Do we need one FunctionCallInfoBaseData for each attribute?
>> How about sharing one FunctionCallInfoBaseData by all
>> attributes like [1]?
> 
> That makes no sense to me. You're throwing away most of the possible gains by
> having to update the FunctionCallInfo fields on every call. You're saving
> neglegible amounts of memory at a substantial runtime cost.

The number of updated fields of your approach and [1] are
same:

Your approach: 6 (I think that "fcinfo->isnull = false" is
needed though.)

+                fcinfo->args[0].value = CStringGetDatum(string);
+                fcinfo->args[0].isnull = false;
+                fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam);
+                fcinfo->args[1].isnull = false;
+                fcinfo->args[2].value = Int32GetDatum(attr->typmod);
+                fcinfo->args[2].isnull = false;

[1]: 6 (including "fcinfo->isnull = false")

+    fcinfo->flinfo = flinfo;
+    fcinfo->context = escontext;
+    fcinfo->isnull = false;
+    fcinfo->args[0].value = CStringGetDatum(str);
+    fcinfo->args[1].value = ObjectIdGetDatum(typioparam);
+    fcinfo->args[2].value = Int32GetDatum(typmod);


>> Inlining InputFuncallCallSafe() here to use pre-initialized
>> fcinfo will decrease maintainability. Because we abstract
>> InputFunctionCall family in fmgr.c. If we define a
>> InputFunctionCall variant here, we need to change both of
>> fmgr.c and here when InputFunctionCall family is changed.
>> How about defining details in fmgr.c and call it here
>> instead like [1]?
> 
> I'm not sure I buy that that is a problem. It's not like my approach was
> actually bypassing fmgr abstractions alltogether - instead it just used the
> lower level APIs, because it's a performance sensitive area.

[1] provides some optimized abstractions, which are
implemented with lower level APIs, without breaking the
abstractions.

Note that I don't have a strong opinion how to implement
this optimization. If other developers think this approach
makes sense for this optimization, I don't object it.

>> @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
>>      if (fld_size == -1)
>>      {
>>          *isnull = true;
>> -        return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod);
>> +        return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod);
>> 
>> Why pre-initialized fcinfo isn't used here?
> 
> Because it's a prototype and because I don't think it's a common path.

How about adding a comment why we don't need to optimize
this case?


I don't have a strong opinion how to implement this
optimization as I said above. It seems that you like your
approach. So I withdraw [1]. Could you complete this
optimization? Can we proceed making COPY format extensible
without this optimization? It seems that it'll take a little
time to complete this optimization because your patch is
still WIP. And it seems that you can work on it after making
COPY format extensible.


Thanks,
-- 
kou



Re: Why is pq_begintypsend so slow?

От
Andres Freund
Дата:
Hi,

On 2024-02-19 10:02:52 +0900, Sutou Kouhei wrote:
> In <20240218200906.zvihkrs46yl2juzf@awork3.anarazel.de>
>   "Re: Why is pq_begintypsend so slow?" on Sun, 18 Feb 2024 12:09:06 -0800,
>   Andres Freund <andres@anarazel.de> wrote:
> 
> >> [1]
https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995
> 
> >> Do we need one FunctionCallInfoBaseData for each attribute?
> >> How about sharing one FunctionCallInfoBaseData by all
> >> attributes like [1]?
> > 
> > That makes no sense to me. You're throwing away most of the possible gains by
> > having to update the FunctionCallInfo fields on every call. You're saving
> > neglegible amounts of memory at a substantial runtime cost.
> 
> The number of updated fields of your approach and [1] are
> same:
> 
> Your approach: 6 (I think that "fcinfo->isnull = false" is
> needed though.)
> 
> +                fcinfo->args[0].value = CStringGetDatum(string);
> +                fcinfo->args[0].isnull = false;
> +                fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam);
> +                fcinfo->args[1].isnull = false;
> +                fcinfo->args[2].value = Int32GetDatum(attr->typmod);
> +                fcinfo->args[2].isnull = false;
> 
> [1]: 6 (including "fcinfo->isnull = false")
> 
> +    fcinfo->flinfo = flinfo;
> +    fcinfo->context = escontext;
> +    fcinfo->isnull = false;
> +    fcinfo->args[0].value = CStringGetDatum(str);
> +    fcinfo->args[1].value = ObjectIdGetDatum(typioparam);
> +    fcinfo->args[2].value = Int32GetDatum(typmod);

If you want to do so you can elide the isnull assignments in my approach just
as well as yours. Assigning not just the value but also flinfo and context is
overhead.  But you can't elide assigning flinfo and context, which is why
reusing one FunctionCallInfo isn't going to win

I don't think you necessarily need to assign fcinfo->isnull on every call,
these functions aren't allowed to return NULL IIRC. And if they do we'd error
out, so it could only happen once.


> I don't have a strong opinion how to implement this
> optimization as I said above. It seems that you like your
> approach. So I withdraw [1]. Could you complete this
> optimization? Can we proceed making COPY format extensible
> without this optimization? It seems that it'll take a little
> time to complete this optimization because your patch is
> still WIP. And it seems that you can work on it after making
> COPY format extensible.

I don't think optimizing this aspect needs to block making copy extensible.

I don't know how much time/energy I'll have to focus on this in the near
term. I really just reposted this because the earlier patches were relevant
for the discussion in another thread.  If you want to pick the COPY part up,
feel free.

Greetings,

Andres Freund



Re: Why is pq_begintypsend so slow?

От
Sutou Kouhei
Дата:
Hi,

In <20240219195351.5vy7cdl3wxia66kg@awork3.anarazel.de>
  "Re: Why is pq_begintypsend so slow?" on Mon, 19 Feb 2024 11:53:51 -0800,
  Andres Freund <andres@anarazel.de> wrote:

>> I don't have a strong opinion how to implement this
>> optimization as I said above. It seems that you like your
>> approach. So I withdraw [1]. Could you complete this
>> optimization? Can we proceed making COPY format extensible
>> without this optimization? It seems that it'll take a little
>> time to complete this optimization because your patch is
>> still WIP. And it seems that you can work on it after making
>> COPY format extensible.
> 
> I don't think optimizing this aspect needs to block making copy extensible.

OK. I'll work on making copy extensible without this
optimization.

> I don't know how much time/energy I'll have to focus on this in the near
> term. I really just reposted this because the earlier patches were relevant
> for the discussion in another thread.  If you want to pick the COPY part up,
> feel free.

OK. I may work on this after I complete making copy
extensible if you haven't completed this yet.


Thanks,
-- 
kou