Обсуждение: [HACKERS] More efficient truncation of pg_stat_activity query strings
Hi, I've recently seen a benchmark in which pg_mbcliplen() showed up prominently. Which it will basically in any benchmark with longer query strings, but fast queries. That's not that uncommon. I wonder if we could avoid the cost of pg_mbcliplen() from within pgstat_report_activity(), by moving some of the cost to the read side. pgstat values are obviously read far less frequently in nearly all cases that are performance relevant. Therefore I wonder if we couldn't just store a querystring that's essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the read side. I think that should work because all *server side* encodings store character lengths in the *first* byte of a multibyte character (at least one clientside encoding, gb18030, doesn't behave that way). That'd necessitate an added memory copy in pg_stat_get_activity(), but that seems fairly harmless. Faults in my thinking? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> read side. I think that should work because all *server side* encodings > store character lengths in the *first* byte of a multibyte character What do you mean? I don't see such data in a multibyte string. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-12 16:53:49 +0900, Tatsuo Ishii wrote: > > read side. I think that should work because all *server side* encodings > > store character lengths in the *first* byte of a multibyte character > > What do you mean? I don't see such data in a multibyte string. Check the information the pg_*_mblen use / how the relevant encodings work. Will be something like int pg_utf_mblen(const unsigned char *s) {int len; if ((*s & 0x80) == 0) len = 1;else if ((*s & 0xe0) == 0xc0) len = 2;else if ((*s & 0xf0) == 0xe0) len = 3;else if((*s & 0xf8) == 0xf0) len = 4; #ifdef NOT_USEDelse if ((*s & 0xfc) == 0xf8) len = 5;else if ((*s & 0xfe) == 0xfc) len = 6; #endifelse len = 1;return len; } As you can see, only the first character (*s) is accessed to determine the length/width of the multibyte-character. That's afaict the case for all server-side encodings. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> Check the information the pg_*_mblen use / how the relevant encodings > work. Will be something like > int > pg_utf_mblen(const unsigned char *s) > { > int len; > > if ((*s & 0x80) == 0) > len = 1; > else if ((*s & 0xe0) == 0xc0) > len = 2; > else if ((*s & 0xf0) == 0xe0) > len = 3; > else if ((*s & 0xf8) == 0xf0) > len = 4; > #ifdef NOT_USED > else if ((*s & 0xfc) == 0xf8) > len = 5; > else if ((*s & 0xfe) == 0xfc) > len = 6; > #endif > else > len = 1; > return len; > } > > As you can see, only the first character (*s) is accessed to determine > the length/width of the multibyte-character. That's afaict the case for > all server-side encodings. So your idea is just storing cmd_str into st_activity, which might be clipped in the middle of multibyte character. And the reader of the string will call pg_mbclipen() when it needs to read the string. Yes, I think it works except for gb18030 by the reason you said. However, if we could have variants of mblen functions with additional parameter: input string length, then we could remove this inconstancy. I don't know if this is overkill or not, though. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 12, 2017 at 3:19 AM, Andres Freund <andres@anarazel.de> wrote: > Therefore I wonder if we couldn't just store a querystring that's > essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the > read side. I think that should work because all *server side* encodings > store character lengths in the *first* byte of a multibyte character > (at least one clientside encoding, gb18030, doesn't behave that way). > > That'd necessitate an added memory copy in pg_stat_get_activity(), but > that seems fairly harmless. Interesting idea. I was (ha, ha, what a coincidence) also thinking about this problem and was wondering if we couldn't be a lot smarter about pg_mbcliplen(). I mean, pg_mbcliplen is basically just being used here to trim away any partial character that would have to get chopped off to fit within the length limit. But right now it's scanning the whole string to do this, which is unnecessary. At least for UTF-8, we could do that much more directly: if the string is short enough, stop, else, look at cmd_str[pgstat_track_activity_query_size]. If that character has (c & 0xc0) != 0x80, write a '\0' and stop; else, back up until you find a character that for which that continuation holds, write a '\0', and stop. This kind of approach only works if we have a definitive test for whether something is a "continuation character" which probably isn't true in all encodings, but maybe it's still worth considering. Your idea is probably a lot simpler to implement, though, and I definitely agree that shifting the work from the write side to the read side makes sense. Updating pg_stat_activity is a lot more common than reading it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Sep 12, 2017 at 3:19 AM, Andres Freund <andres@anarazel.de> wrote: >> Therefore I wonder if we couldn't just store a querystring that's >> essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the >> read side. > Interesting idea. I was (ha, ha, what a coincidence) also thinking > about this problem and was wondering if we couldn't be a lot smarter > about pg_mbcliplen(). ... > for UTF-8, we could do that much more directly: if the string is short > enough, stop, else, look at cmd_str[pgstat_track_activity_query_size]. > If that character has (c & 0xc0) != 0x80, write a '\0' and stop; else, > back up until you find a character that for which that continuation > holds, write a '\0', and stop. This kind of approach only works if we > have a definitive test for whether something is a "continuation > character" which probably isn't true in all encodings, but maybe it's > still worth considering. Offhand I think it doesn't work in anything but UTF8. A way that would work in all encodings is to back up to the first non-high-bit-set byte and then mbcliplen from there. Probably, there's enough ASCII in typical SQL commands that this would often be a win. > Your idea is probably a lot simpler to implement, though, and I > definitely agree that shifting the work from the write side to the > read side makes sense. Updating pg_stat_activity is a lot more common > than reading it. +1. I kinda doubt that it is worth optimizing further than that, really. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-12 00:19:48 -0700, Andres Freund wrote: > Hi, > > I've recently seen a benchmark in which pg_mbcliplen() showed up > prominently. Which it will basically in any benchmark with longer query > strings, but fast queries. That's not that uncommon. > > I wonder if we could avoid the cost of pg_mbcliplen() from within > pgstat_report_activity(), by moving some of the cost to the read > side. pgstat values are obviously read far less frequently in nearly all > cases that are performance relevant. > > Therefore I wonder if we couldn't just store a querystring that's > essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the > read side. I think that should work because all *server side* encodings > store character lengths in the *first* byte of a multibyte character > (at least one clientside encoding, gb18030, doesn't behave that way). > > That'd necessitate an added memory copy in pg_stat_get_activity(), but > that seems fairly harmless. > > Faults in my thinking? Here's a patch that implements that idea. Seems to work well. I'm a bit loathe to add proper regression tests for this, seems awfully dependent on specific track_activity_query_size settings. I did confirm using gdb that I see incomplete characters before pgstat_clip_activity(), but not after. I've renamed st_activity to st_activity_raw to increase the likelihood that potential external users of st_activity notice and adapt. Increases the noise, but imo to a very bareable amount. Don't feel strongly though. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Thu, Sep 14, 2017 at 11:33 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-09-12 00:19:48 -0700, Andres Freund wrote: >> Hi, >> >> I've recently seen a benchmark in which pg_mbcliplen() showed up >> prominently. Which it will basically in any benchmark with longer query >> strings, but fast queries. That's not that uncommon. >> >> I wonder if we could avoid the cost of pg_mbcliplen() from within >> pgstat_report_activity(), by moving some of the cost to the read >> side. pgstat values are obviously read far less frequently in nearly all >> cases that are performance relevant. >> >> Therefore I wonder if we couldn't just store a querystring that's >> essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the >> read side. I think that should work because all *server side* encodings >> store character lengths in the *first* byte of a multibyte character >> (at least one clientside encoding, gb18030, doesn't behave that way). >> >> That'd necessitate an added memory copy in pg_stat_get_activity(), but >> that seems fairly harmless. >> >> Faults in my thinking? > > Here's a patch that implements that idea. Seems to work well. I'm a > bit loathe to add proper regression tests for this, seems awfully > dependent on specific track_activity_query_size settings. I did confirm > using gdb that I see incomplete characters before > pgstat_clip_activity(), but not after. > > I've renamed st_activity to st_activity_raw to increase the likelihood > that potential external users of st_activity notice and adapt. Increases > the noise, but imo to a very bareable amount. Don't feel strongly > though. > Hello, The patch looks good to me. I've done some regression testing with a custom script on my local system. The script contains the following statement: SELECT 'aaa..<repeated 600 times>' as col; Test 1 ----------------------------------- duration: 300 seconds clients/threads: 1 On HEAD TPS: 13181 + 9.30% 0.27% postgres postgres [.] pgstat_report_activity + 8.88% 0.02% postgres postgres [.] pg_mbcliplen + 7.76% 4.77% postgres postgres [.] pg_encoding_mbcliplen + 4.06% 4.06% postgres postgres [.] pg_utf_mblen With the patch TPS:13628 (+3.39%) + 0.36% 0.21% postgres postgres [.] pgstat_report_activity Test 2 ----------------------------------- duration: 300 seconds clients/threads: 8 On HEAD TPS: 53084 + 12.17% 0.20% postgres postgres [.] pgstat_report_activity + 11.83% 0.02% postgres postgres [.] pg_mbcliplen + 11.19% 8.03% postgres postgres [.] pg_encoding_mbcliplen + 3.74% 3.73% postgres postgres [.] pg_utf_mblen With the patch TPS: 63949 (+20.4%) + 0.40% 0.25% postgres postgres [.] pgstat_report_activity This shows the significance of this patch in terms of performance improvement of pgstat_report_activity. Is there any other tests I should do for the same? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 14, 2017 at 2:03 AM, Andres Freund <andres@anarazel.de> wrote: > Here's a patch that implements that idea. From the department of cosmetic nitpicking: + * All supported server-encodings allow to determine the length of a make it possible to determine + * multi-byte character from it's first byte (not the case for client its this is not the case + * encodings, see GB18030). As st_activity always is stored using server is always stored using a server + * pgstat_clip_activity() to trunctae correctly. truncate -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-15 08:25:09 -0400, Robert Haas wrote: > On Thu, Sep 14, 2017 at 2:03 AM, Andres Freund <andres@anarazel.de> wrote: > > Here's a patch that implements that idea. > > From the department of cosmetic nitpicking: > > + * All supported server-encodings allow to determine the length of a > > make it possible to determine > > + * multi-byte character from it's first byte (not the case for client > > its > > this is not the case > > + * encodings, see GB18030). As st_activity always is stored using server > > is always stored using a server > > + * pgstat_clip_activity() to trunctae correctly. Version correcting these is attached. Thanks! Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Andres Freund <andres@anarazel.de> writes: > Version correcting these is attached. Thanks! I'd vote for undoing the s/st_activity/st_activity_raw/g business. That may have been useful while writing the patch, to ensure you found all the references; but it's just creating a lot of unnecessary delta in the final code, with the attendant hazards for back-patches. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-15 16:45:47 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Version correcting these is attached. Thanks! > > I'd vote for undoing the s/st_activity/st_activity_raw/g business. > That may have been useful while writing the patch, to ensure you > found all the references; but it's just creating a lot of unnecessary > delta in the final code, with the attendant hazards for back-patches. I was wondering about that too (see [1]). My only concern is that some extensions out there might be accessing the string expecting it to be properly truncated. The rename at least forces them to look for the new name... I'm slightly in favor of keeping the rename, it doesn't seem likely to cause a lot of backpatch pain. Regards, Andres Freund [1] http://archives.postgresql.org/message-id/20170914060329.j7lt7oyyzquxiut6%40alap3.anarazel.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 15, 2017 at 4:49 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-09-15 16:45:47 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >> > Version correcting these is attached. Thanks! >> >> I'd vote for undoing the s/st_activity/st_activity_raw/g business. >> That may have been useful while writing the patch, to ensure you >> found all the references; but it's just creating a lot of unnecessary >> delta in the final code, with the attendant hazards for back-patches. > > I was wondering about that too (see [1]). My only concern is that some > extensions out there might be accessing the string expecting it to be > properly truncated. The rename at least forces them to look for the new > name... I'm slightly in favor of keeping the rename, it doesn't seem > likely to cause a lot of backpatch pain. I tend to agree with you, but it's not a huge deal either way. Even if somebody fails to update third-party code that touches this, a lot of times it'll work anyway. But that very fact is, of course, why it would be slightly better to break it explicitly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi, On 2017-09-15 17:43:35 +0530, Kuntal Ghosh wrote: > The patch looks good to me. I've done some regression testing with a > custom script on my local system. The script contains the following > statement: > SELECT 'aaa..<repeated 600 times>' as col; > > Test 1 > ----------------------------------- > duration: 300 seconds > clients/threads: 1 > With the patch TPS:13628 (+3.39%) > + 0.36% 0.21% postgres postgres [.] pgstat_report_activity > > Test 2 > ----------------------------------- > duration: 300 seconds > clients/threads: 8 > With the patch TPS: 63949 (+20.4%) > + 0.40% 0.25% postgres postgres [.] pgstat_report_activity > > This shows the significance of this patch in terms of performance > improvement of pgstat_report_activity. Is there any other tests I > should do for the same? Thanks for the test! I think this looks good, no further tests necessary. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers