Re: Why is pq_begintypsend so slow?
| От | Andres Freund |
|---|---|
| Тема | Re: Why is pq_begintypsend so slow? |
| Дата | |
| Msg-id | 20200603015559.qc7ry5jqmoxlpu4y@alap3.anarazel.de обсуждение исходный текст |
| Ответ на | Re: Why is pq_begintypsend so slow? (Tom Lane <tgl@sss.pgh.pa.us>) |
| Ответы |
Re: Why is pq_begintypsend so slow?
Re: Why is pq_begintypsend so slow? Re: Why is pq_begintypsend so slow? |
| Список | pgsql-hackers |
Hi, On 2020-05-18 12:38:05 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> FWIW, I've also observed, in another thread (the node func generation > >> thing [1]), that inlining enlargeStringInfo() helps a lot, especially > >> when inlining some of its callers. Moving e.g. appendStringInfo() inline > >> allows the compiler to sometimes optimize away the strlen. But if > >> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo() > >> unconditionally, successive appends cannot optimize away memory accesses > >> for ->len/->data. > > > With a set of patches doing so, int4send itself is not a significant > > factor for my test benchmark [1] anymore. > > This thread seems to have died out, possibly because the last set of > patches that Andres posted was sufficiently complicated and invasive > that nobody wanted to review it. Well, I wasn't really planning to try to get that patchset into 13, and it wasn't in the CF... > I thought about this again after seeing that [1] is mostly about > pq_begintypsend overhead I'm not really convinced that that's the whole problem. Using the benchmark from upthread, I get (median of three): master: 1181.581 yours: 1171.445 mine: 598.031 That's a very significant difference, imo. It helps a bit with the benchmark from your [1], but not that much. > With 0001, pq_begintypsend drops from being the top single routine > in a profile of a test case like [1] to being well down the list. > The next biggest cost compared to text-format output is that > printtup() itself is noticeably more expensive. A lot of the extra > cost there seems to be from pq_sendint32(), which is getting inlined > into printtup(), and there probably isn't much we can do to make that > cheaper. But eliminating a common subexpression as in 0002 below does > help noticeably, at least with the rather old gcc I'm using. I think there's plenty more we can do: First, it's unnecessary to re-initialize a FunctionCallInfo for every send/recv/out/in call. Instead we can reuse the same over and over. After that, the biggest remaining overhead for Jack's test is the palloc for the stringinfo, as far as I can see. I've complained about that before... I've just hacked up a modification where, for send functions, fcinfo->context contains a stringinfo set up by printtup/CopyTo. That, combined with using a FunctionCallInfo set up beforehand, instead of re-initializing it in every printtup cycle, results in a pretty good saving. Making the binary protocol 20% faster than text, in Jack's testcase. And my lotsaints4 test, goes further down to ~410ms (this is 2.5x faster than where started). Now obviously, the hack with passing a StringInfo in ->context is just that, a hack. A somewhat gross one even. But I think it pretty clearly shows the problem and the way out. I don't know what the best non-gross solution for the overhead of the out/send functions is. There seems to be at least the following major options (and a lots of variants thereof): 1) Just continue to incur significant overhead for every datum 2) Accept the uglyness of passing in a buffer via FunctionCallInfo->context. Change nearly all in-core send functions over to that. 3) Pass string buffer through an new INTERNAL argument to send/output function, allow both old/new style send functions. Use a wrapper function to adapt the "old style" to the "new style". 4) Like 3, but make the new argument optional, and use ad-hoc stringbuffer if not provided. I don't like the unnecessary branches this adds. The biggest problem after that is that we waste a lot of time memcpying stuff around repeatedly. There is: 1) send function: datum -> per datum stringinfo 2) printtup: per datum stringinfo -> per row stringinfo 3) socket_putmessage: per row stringinfo -> PqSendBuffer 4) send(): PqSendBuffer -> kernel buffer It's obviously hard to avoid 1) and 4) in the common case, but the number of other copies seem pretty clearly excessive. If we change the signature of the out/send function to always target a string buffer, we could pretty easily avoid 2), and for out functions we'd not have to redundantly call strlen (as the argument to pq_sendcountedtext) anymore, which seems substantial too. As I argued before, I think it's unnecessary to have a separate buffer between 3-4). We should construct the outgoing message inside the send buffer. I still don't understand what "recursion" danger there is, nothing below printtup should ever send protocol messages, no? Sometimes there's also 0) in the above, when detoasting a datum... Greetings, Andres Freund
Вложения
- v2-0001-stringinfo-Move-more-functions-inline-provide-ini.patch
- v2-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch
- v2-0003-pqformat-Move-functions-for-type-sending-inline-a.patch
- v2-0004-WIP-Annotate-palloc-with-malloc-and-other-compile.patch
- v2-0005-Move-int4-int8-send-to-use-pq_begintypsend_ex.patch
- v2-0006-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch
- v2-0007-wip-make-send-recv-calls-in-printtup.c-copy.c-che.patch
- v2-0008-inline-pq_sendbytes-stop-maintaining-trailing-nul.patch
- v2-0009-heavy-wip-Allow-string-buffer-reuse-in-send-funct.patch
В списке pgsql-hackers по дате отправления: