Обсуждение: [HACKERS] More efficient truncation of pg_stat_activity query strings

Поиск
Список
Период
Сортировка

[HACKERS] More efficient truncation of pg_stat_activity query strings

От
Andres Freund
Дата:
Hi,

I've recently seen a benchmark in which pg_mbcliplen() showed up
prominently. Which it will basically in any benchmark with longer query
strings, but fast queries. That's not that uncommon.

I wonder if we could avoid the cost of pg_mbcliplen() from within
pgstat_report_activity(), by moving some of the cost to the read
side. pgstat values are obviously read far less frequently in nearly all
cases that are performance relevant.

Therefore I wonder if we couldn't just store a querystring that's
essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
read side.  I think that should work because all *server side* encodings
store character lengths in the *first* byte of a multibyte character
(at least one clientside encoding, gb18030, doesn't behave that way).

That'd necessitate an added memory copy in pg_stat_get_activity(), but
that seems fairly harmless.

Faults in my thinking?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity querystrings

От
Tatsuo Ishii
Дата:
> read side.  I think that should work because all *server side* encodings
> store character lengths in the *first* byte of a multibyte character

What do you mean? I don't see such data in a multibyte string.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity querystrings

От
Andres Freund
Дата:
On 2017-09-12 16:53:49 +0900, Tatsuo Ishii wrote:
> > read side.  I think that should work because all *server side* encodings
> > store character lengths in the *first* byte of a multibyte character
> 
> What do you mean? I don't see such data in a multibyte string.

Check the information the pg_*_mblen use / how the relevant encodings
work. Will be something like
int
pg_utf_mblen(const unsigned char *s)
{int            len;
if ((*s & 0x80) == 0)    len = 1;else if ((*s & 0xe0) == 0xc0)    len = 2;else if ((*s & 0xf0) == 0xe0)    len = 3;else
if((*s & 0xf8) == 0xf0)    len = 4;
 
#ifdef NOT_USEDelse if ((*s & 0xfc) == 0xf8)    len = 5;else if ((*s & 0xfe) == 0xfc)    len = 6;
#endifelse    len = 1;return len;
}

As you can see, only the first character (*s) is accessed to determine
the length/width of the multibyte-character.  That's afaict the case for
all server-side encodings.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity querystrings

От
Tatsuo Ishii
Дата:
> Check the information the pg_*_mblen use / how the relevant encodings
> work. Will be something like
> int
> pg_utf_mblen(const unsigned char *s)
> {
>     int            len;
> 
>     if ((*s & 0x80) == 0)
>         len = 1;
>     else if ((*s & 0xe0) == 0xc0)
>         len = 2;
>     else if ((*s & 0xf0) == 0xe0)
>         len = 3;
>     else if ((*s & 0xf8) == 0xf0)
>         len = 4;
> #ifdef NOT_USED
>     else if ((*s & 0xfc) == 0xf8)
>         len = 5;
>     else if ((*s & 0xfe) == 0xfc)
>         len = 6;
> #endif
>     else
>         len = 1;
>     return len;
> }
> 
> As you can see, only the first character (*s) is accessed to determine
> the length/width of the multibyte-character.  That's afaict the case for
> all server-side encodings.

So your idea is just storing cmd_str into st_activity, which might be
clipped in the middle of multibyte character. And the reader of the
string will call pg_mbclipen() when it needs to read the string. Yes,
I think it works except for gb18030 by the reason you said.

However, if we could have variants of mblen functions with additional
parameter: input string length, then we could remove this inconstancy.
I don't know if this is overkill or not, though.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity query strings

От
Robert Haas
Дата:
On Tue, Sep 12, 2017 at 3:19 AM, Andres Freund <andres@anarazel.de> wrote:
> Therefore I wonder if we couldn't just store a querystring that's
> essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
> read side.  I think that should work because all *server side* encodings
> store character lengths in the *first* byte of a multibyte character
> (at least one clientside encoding, gb18030, doesn't behave that way).
>
> That'd necessitate an added memory copy in pg_stat_get_activity(), but
> that seems fairly harmless.

Interesting idea.  I was (ha, ha, what a coincidence) also thinking
about this problem and was wondering if we couldn't be a lot smarter
about pg_mbcliplen().  I mean, pg_mbcliplen is basically just being
used here to trim away any partial character that would have to get
chopped off to fit within the length limit.  But right now it's
scanning the whole string to do this, which is unnecessary.  At least
for UTF-8, we could do that much more directly: if the string is short
enough, stop, else, look at cmd_str[pgstat_track_activity_query_size].
If that character has (c & 0xc0) != 0x80, write a '\0' and stop; else,
back up until you find a character that for which that continuation
holds, write a '\0', and stop.  This kind of approach only works if we
have a definitive test for whether something is a "continuation
character" which probably isn't true in all encodings, but maybe it's
still worth considering.

Your idea is probably a lot simpler to implement, though, and I
definitely agree that shifting the work from the write side to the
read side makes sense.  Updating pg_stat_activity is a lot more common
than reading it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity query strings

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Sep 12, 2017 at 3:19 AM, Andres Freund <andres@anarazel.de> wrote:
>> Therefore I wonder if we couldn't just store a querystring that's
>> essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
>> read side.

> Interesting idea.  I was (ha, ha, what a coincidence) also thinking
> about this problem and was wondering if we couldn't be a lot smarter
> about pg_mbcliplen().  ...

> for UTF-8, we could do that much more directly: if the string is short
> enough, stop, else, look at cmd_str[pgstat_track_activity_query_size].
> If that character has (c & 0xc0) != 0x80, write a '\0' and stop; else,
> back up until you find a character that for which that continuation
> holds, write a '\0', and stop.  This kind of approach only works if we
> have a definitive test for whether something is a "continuation
> character" which probably isn't true in all encodings, but maybe it's
> still worth considering.

Offhand I think it doesn't work in anything but UTF8.  A way that would
work in all encodings is to back up to the first non-high-bit-set
byte and then mbcliplen from there.  Probably, there's enough ASCII
in typical SQL commands that this would often be a win.

> Your idea is probably a lot simpler to implement, though, and I
> definitely agree that shifting the work from the write side to the
> read side makes sense.  Updating pg_stat_activity is a lot more common
> than reading it.

+1.  I kinda doubt that it is worth optimizing further than that,
really.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity querystrings

От
Andres Freund
Дата:
On 2017-09-12 00:19:48 -0700, Andres Freund wrote:
> Hi,
> 
> I've recently seen a benchmark in which pg_mbcliplen() showed up
> prominently. Which it will basically in any benchmark with longer query
> strings, but fast queries. That's not that uncommon.
> 
> I wonder if we could avoid the cost of pg_mbcliplen() from within
> pgstat_report_activity(), by moving some of the cost to the read
> side. pgstat values are obviously read far less frequently in nearly all
> cases that are performance relevant.
> 
> Therefore I wonder if we couldn't just store a querystring that's
> essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
> read side.  I think that should work because all *server side* encodings
> store character lengths in the *first* byte of a multibyte character
> (at least one clientside encoding, gb18030, doesn't behave that way).
> 
> That'd necessitate an added memory copy in pg_stat_get_activity(), but
> that seems fairly harmless.
> 
> Faults in my thinking?

Here's a patch that implements that idea.  Seems to work well.  I'm a
bit loathe to add proper regression tests for this, seems awfully
dependent on specific track_activity_query_size settings.  I did confirm
using gdb that I see incomplete characters before
pgstat_clip_activity(), but not after.

I've renamed st_activity to st_activity_raw to increase the likelihood
that potential external users of st_activity notice and adapt. Increases
the noise, but imo to a very bareable amount. Don't feel strongly
though.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] More efficient truncation of pg_stat_activity query strings

От
Kuntal Ghosh
Дата:
On Thu, Sep 14, 2017 at 11:33 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-09-12 00:19:48 -0700, Andres Freund wrote:
>> Hi,
>>
>> I've recently seen a benchmark in which pg_mbcliplen() showed up
>> prominently. Which it will basically in any benchmark with longer query
>> strings, but fast queries. That's not that uncommon.
>>
>> I wonder if we could avoid the cost of pg_mbcliplen() from within
>> pgstat_report_activity(), by moving some of the cost to the read
>> side. pgstat values are obviously read far less frequently in nearly all
>> cases that are performance relevant.
>>
>> Therefore I wonder if we couldn't just store a querystring that's
>> essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
>> read side.  I think that should work because all *server side* encodings
>> store character lengths in the *first* byte of a multibyte character
>> (at least one clientside encoding, gb18030, doesn't behave that way).
>>
>> That'd necessitate an added memory copy in pg_stat_get_activity(), but
>> that seems fairly harmless.
>>
>> Faults in my thinking?
>
> Here's a patch that implements that idea.  Seems to work well.  I'm a
> bit loathe to add proper regression tests for this, seems awfully
> dependent on specific track_activity_query_size settings.  I did confirm
> using gdb that I see incomplete characters before
> pgstat_clip_activity(), but not after.
>
> I've renamed st_activity to st_activity_raw to increase the likelihood
> that potential external users of st_activity notice and adapt. Increases
> the noise, but imo to a very bareable amount. Don't feel strongly
> though.
>
Hello,

The patch looks good to me. I've done some regression testing with a
custom script on my local system. The script contains the following
statement:
SELECT 'aaa..<repeated 600 times>' as col;

Test 1
-----------------------------------
duration: 300 seconds
clients/threads: 1

On HEAD TPS: 13181
+    9.30%     0.27%  postgres  postgres             [.] pgstat_report_activity
+    8.88%     0.02%  postgres  postgres             [.] pg_mbcliplen
+    7.76%     4.77%  postgres  postgres             [.] pg_encoding_mbcliplen
+    4.06%     4.06%  postgres  postgres             [.] pg_utf_mblen

With the patch TPS:13628 (+3.39%)
+    0.36%     0.21%  postgres  postgres  [.] pgstat_report_activity

Test 2
-----------------------------------
duration: 300 seconds
clients/threads: 8

On HEAD TPS: 53084
+   12.17%     0.20%  postgres  postgres             [.]
pgstat_report_activity
+   11.83%     0.02%  postgres  postgres             [.] pg_mbcliplen
+   11.19%     8.03%  postgres  postgres             [.] pg_encoding_mbcliplen
+    3.74%     3.73%  postgres  postgres             [.] pg_utf_mblen

With the patch TPS: 63949 (+20.4%)
+    0.40%     0.25%  postgres  postgres  [.] pgstat_report_activity

This shows the significance of this patch in terms of performance
improvement of pgstat_report_activity. Is there any other tests I
should do for the same?


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity query strings

От
Robert Haas
Дата:
On Thu, Sep 14, 2017 at 2:03 AM, Andres Freund <andres@anarazel.de> wrote:
> Here's a patch that implements that idea.

From the department of cosmetic nitpicking:

+     * All supported server-encodings allow to determine the length of a

make it possible to determine

+     * multi-byte character from it's first byte (not the case for client

its

this is not the case

+     * encodings, see GB18030). As st_activity always is stored using server

is always stored using a server

+     * pgstat_clip_activity() to trunctae correctly.

truncate

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity querystrings

От
Andres Freund
Дата:
On 2017-09-15 08:25:09 -0400, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 2:03 AM, Andres Freund <andres@anarazel.de> wrote:
> > Here's a patch that implements that idea.
> 
> From the department of cosmetic nitpicking:
> 
> +     * All supported server-encodings allow to determine the length of a
> 
> make it possible to determine
> 
> +     * multi-byte character from it's first byte (not the case for client
> 
> its
> 
> this is not the case
> 
> +     * encodings, see GB18030). As st_activity always is stored using server
> 
> is always stored using a server
> 
> +     * pgstat_clip_activity() to trunctae correctly.

Version correcting these is attached. Thanks!

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] More efficient truncation of pg_stat_activity query strings

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Version correcting these is attached. Thanks!

I'd vote for undoing the s/st_activity/st_activity_raw/g business.
That may have been useful while writing the patch, to ensure you
found all the references; but it's just creating a lot of unnecessary
delta in the final code, with the attendant hazards for back-patches.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity querystrings

От
Andres Freund
Дата:
On 2017-09-15 16:45:47 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Version correcting these is attached. Thanks!
> 
> I'd vote for undoing the s/st_activity/st_activity_raw/g business.
> That may have been useful while writing the patch, to ensure you
> found all the references; but it's just creating a lot of unnecessary
> delta in the final code, with the attendant hazards for back-patches.

I was wondering about that too (see [1]). My only concern is that some
extensions out there might be accessing the string expecting it to be
properly truncated. The rename at least forces them to look for the new
name...  I'm slightly in favor of keeping the rename, it doesn't seem
likely to cause a lot of backpatch pain.

Regards,

Andres Freund


[1] http://archives.postgresql.org/message-id/20170914060329.j7lt7oyyzquxiut6%40alap3.anarazel.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity query strings

От
Robert Haas
Дата:
On Fri, Sep 15, 2017 at 4:49 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-09-15 16:45:47 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>> > Version correcting these is attached. Thanks!
>>
>> I'd vote for undoing the s/st_activity/st_activity_raw/g business.
>> That may have been useful while writing the patch, to ensure you
>> found all the references; but it's just creating a lot of unnecessary
>> delta in the final code, with the attendant hazards for back-patches.
>
> I was wondering about that too (see [1]). My only concern is that some
> extensions out there might be accessing the string expecting it to be
> properly truncated. The rename at least forces them to look for the new
> name...  I'm slightly in favor of keeping the rename, it doesn't seem
> likely to cause a lot of backpatch pain.

I tend to agree with you, but it's not a huge deal either way.  Even
if somebody fails to update third-party code that touches this, a lot
of times it'll work anyway.  But that very fact is, of course, why it
would be slightly better to break it explicitly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More efficient truncation of pg_stat_activity querystrings

От
Andres Freund
Дата:
Hi,

On 2017-09-15 17:43:35 +0530, Kuntal Ghosh wrote:
> The patch looks good to me. I've done some regression testing with a
> custom script on my local system. The script contains the following
> statement:

> SELECT 'aaa..<repeated 600 times>' as col;
> 
> Test 1
> -----------------------------------
> duration: 300 seconds
> clients/threads: 1

> With the patch TPS:13628 (+3.39%)
> +    0.36%     0.21%  postgres  postgres  [.] pgstat_report_activity
> 
> Test 2
> -----------------------------------
> duration: 300 seconds
> clients/threads: 8

> With the patch TPS: 63949 (+20.4%)
> +    0.40%     0.25%  postgres  postgres  [.] pgstat_report_activity
> 
> This shows the significance of this patch in terms of performance
> improvement of pgstat_report_activity. Is there any other tests I
> should do for the same?

Thanks for the test! I think this looks good, no further tests
necessary.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers