Обсуждение: Improve handling of pg_stat_statements handling of bind "IN" variables

Поиск
Список
Период
Сортировка

Improve handling of pg_stat_statements handling of bind "IN" variables

От
Pavel Trukhanov
Дата:
Hi Hackers

I would like to embark on a journey to try to implement this issue I
found on TODO list –
https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com
In short: pgss distinguishes "SELECT * WHERE id IN (1, 2)" and "SELECT
* WHERE id IN (1, 2, 3)" as two separate queryId's, resulting in
separate entries in pgss. While in practice in most cases it should be
considered as the same thing.

Though it was added in TODO by Bruce Momjian some time ago, I
personally have been annoyed by this issue, because we use pgss as a
data source in our monitoring system okmeter.io – so we've been using
some work arounds for this in our system.

The way AFAIU it is suggested to be handled in the previous thread is
to not jumble ArrayExpr recursively and just treat it as "some list of
zero or more nodes".
I have already lurked around related code, but I have stumbled on some
problems with the way I see I can implement this.

So I want to ask for advice and maybe even guidance because I'm new to
PG internals and not a regular in C coding.

1. ArrayExpr
ArrayExpr is used to represent not only "IN" clauses, but also for
example "SELECT ARRAY[1, 2, 3]" and maybe some other cases I didn't
think of.
That brings the question whether "IN (...)" should be handled
separately from actual usage of ARRAY.
Or it is okay for any ARRAY to be jumbled w/o respect to number of
entries in it?
With that, "SELECT ARRAY[1, 2]" becomes undistinguishable from "SELECT
ARRAY[1, 2, 3]" etc in pgss.

I'm asking this because I'm not sure if it would be okay to handle
both cases in the same way.
For example "SELECT ARRAY[1, 2, a] FROM table" and "SELECT ARRAY[b]
FROM table" might end up in the same pgss entry.

While a separate handling for "IN (...)" seems to require lots of
changes – starting from parser (new parser node type) and further.
How should I proceed?

2 Weird arrays - with Consts and Params or const expressions or
different types etc
SELECT * FROM test WHERE a IN (1, $1)
SELECT * FROM test WHERE a IN (1, 3+1)
SELECT * FROM test WHERE a IN (1, 2.1)
SELECT * FROM test WHERE a IN (1.1, 2.1)  etc.
How those should be handled?
Should those be indistinguishable from "IN ($1, $2, $3)" as well?
Or such non realistic usage examples are negligible and no one cares
what happens with them?

3 Tests in pgss.sql/out and Vars
I would like someone to point me in a direction of how could I
implement a test that will query
"SELECT * FROM test WHERE a IN ($1, $2, $3, ...)" with params, not
consts, because I think this is the most common case actually.
And existing tests only check consts in "IN" list.
I don't see a way to implement such a test with the existing test
infrastructure.
Though if that might be considered out of the scope for this TODO, it
would be okay with me.


I would appreciate any feedback.
---
Pavel Trukhanov



Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Dmitry Dolgov
Дата:
> On Thu, Jun 13, 2019 at 1:38 PM Pavel Trukhanov <pavel.trukhanov@gmail.com> wrote:
>
> Hi Hackers
>
> I would like to embark on a journey to try to implement this issue I
> found on TODO list –
> https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com
> In short: pgss distinguishes "SELECT * WHERE id IN (1, 2)" and "SELECT
> * WHERE id IN (1, 2, 3)" as two separate queryId's, resulting in
> separate entries in pgss. While in practice in most cases it should be
> considered as the same thing.
>
> Though it was added in TODO by Bruce Momjian some time ago, I
> personally have been annoyed by this issue, because we use pgss as a
> data source in our monitoring system okmeter.io – so we've been using
> some work arounds for this in our system.

Thanks! One more time stumbled upon it just now, so I agree it would be nice.

> 1. ArrayExpr
> ArrayExpr is used to represent not only "IN" clauses, but also for
> example "SELECT ARRAY[1, 2, 3]" and maybe some other cases I didn't
> think of.
> That brings the question whether "IN (...)" should be handled
> separately from actual usage of ARRAY.

If I understand correctly, "IN (...)" is reprecented by A_Expr with kind
AEXPR_IN, and only in transformAExprIn converted to ArrayExpr if possible. So
probably it doesn't makes sense to introduce another one.

> For example "SELECT ARRAY[1, 2, a] FROM table" and "SELECT ARRAY[b]
> FROM table" might end up in the same pgss entry.

What are a, b here, parameters?

> 3 Tests in pgss.sql/out and Vars
> I would like someone to point me in a direction of how could I
> implement a test that will query
> "SELECT * FROM test WHERE a IN ($1, $2, $3, ...)" with params, not
> consts

Wouldn't a prepared statement work? It will create an ArrayExpr with Params
inside.



Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Pavel Trukhanov
Дата:
Thanks for your reply

> If I understand correctly, "IN (...)" is reprecented by A_Expr with kind
> AEXPR_IN, and only in transformAExprIn converted to ArrayExpr if possible.
That seems to be correct, yes, thank you.

> So probably it doesn't makes sense to introduce another one.
Though I might've used wrong words to describe my holdback here, what
I meant is that I'll need to create new node type (in primnodes.h?)
for IN-list, that will allow to differentiate it from direct "ARRAY"
usage.
This will require changes to parse_expr.c, execExpr.c, etc, which
seems to be overkill for such issue IMO, hence the question.
Please advise.

> > For example "SELECT ARRAY[1, 2, a] FROM table" and "SELECT ARRAY[b]
> > FROM table" might end up in the same pgss entry.
>
> What are a, b here, parameters?

Here a and b are table columns.
I couldn't come up with other examples of ARRAY usage, would
appreciate any suggestions.


> > 3 Tests in pgss.sql/out and Vars
> > I would like someone to point me in a direction of how could I
> > implement a test that will query
> > "SELECT * FROM test WHERE a IN ($1, $2, $3, ...)" with params, not
> > consts
>
> Wouldn't a prepared statement work? It will create an ArrayExpr with Params
> inside.

Thanks for the tip. It seems to work, at least it looks like it.



Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Tom Lane
Дата:
Pavel Trukhanov <pavel.trukhanov@gmail.com> writes:
> Though I might've used wrong words to describe my holdback here, what
> I meant is that I'll need to create new node type (in primnodes.h?)
> for IN-list, that will allow to differentiate it from direct "ARRAY"
> usage.
> This will require changes to parse_expr.c, execExpr.c, etc, which
> seems to be overkill for such issue IMO, hence the question.

I do not think you need new expression infrastructure.  IMO what's going
on here is that we're indulging in premature optimization in the parser.
It would be better from a structural standpoint if the output of parse
analysis were closer to what the user wrote, and then the business of
separating Vars from Consts and reducing the Consts to an array were
handled in the planner's expression preprocessing phase.

So maybe what you should be thinking about is a preliminary patch that's
mostly in the nature of refactoring, to move that processing to where
it should be.

Of course, life is never quite that simple; there are at least two
issues you'd have to think about.

* The parsing phase is responsible for determining the semantics of
the query, in particular resolving the data types of the IN-list items
and choosing the comparison operators that will be used.  The planner
is not allowed to rethink that.  What I'm not clear about offhand is
whether the existing coding in parse analysis might lead to different
choices of data types/operators than a more straightforward approach
does.  If so, we'd have to decide whether that's okay.

* Postponing this work might make things slower overall, which wouldn't
matter too much for short IN-lists, but you can bet that people who
throw ten-thousand-entry IN-lists at us will notice.  So you'd need to
keep an eye on efficiency and make sure you don't end up repeating
similar processing over and over.

            regards, tom lane



Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Pavel Trukhanov
Дата:
Thanks for the feedback.

I thought once again about separating IN from ARRAY, with refactoring
etc as Tom suggested, and now I don't think it's worth it to do so.
I've tried to implement that, and besides that it will require to
change things in every part of query processing pipeline, it seems
that most of the times I will have to repeat (copy/paste) for IN case
all the code that now works in for ARRAY. At first I though there will
be simplifications, that will justify such refactoring - i.e. I
thought I could at least drop "multidims" bool that tells ARRAY[] from
ARRAY[ARRAY[]]. But it turns out it's not the case – one can write
something like "x IN (ARRAY[1], ARRAY[1,2])" that will result in
multidim IN-list array.

So I don't think there's actually enough benefit to split those two apart.

Now I want to do this: just add a meta info (basically a bool field)
to the ArrayExpr struct, so on later stages we could tell if that's an
ArrayExpr of an ARRAY or of an IN list. Plus to add ignoring updating
Jumble for expression subtree within IN-list array.

If that approach doesn't seem too bad to anyone, I would like to go
forward and submit a patch – it seems pretty straightforward to
implement that.

Thoughts?

Thank you.
 ---
Pasha Trukhanov



Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Greg Stark
Дата:


On Sat., Jun. 15, 2019, 12:29 p.m. Pavel Trukhanov, <pavel.trukhanov@gmail.com> wrote:

So I don't think there's actually enough benefit to split those two apart.

Now I want to do this: just add a meta info (basically a bool field)
to the ArrayExpr struct, so on later stages we could tell if that's an
ArrayExpr of an ARRAY or of an IN list. Plus to add ignoring updating
Jumble for expression subtree within IN-list array.

If that approach doesn't seem too bad to anyone, I would like to go
forward and submit a patch – it seems pretty straightforward to
implement that.

So what would this do for someone who explicitly writes:

WHERE col = ANY ?

and pass an array?

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Greg Stark
Дата:


On Sat., Jun. 15, 2019, 8:30 p.m. Greg Stark, <stark@mit.edu> wrote:


So what would this do for someone who explicitly writes:

WHERE col = ANY ?

and pass an array?

Actually thinking about this for two more seconds the question is what it would do with a query like

WHERE col = ANY '1,2,3'::integer[]

Or 

WHERE col = ANY ARRAY[1,2,3]

Whichever the language binding that is failing to do parameterized queries is doing to emulate them.

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> Actually thinking about this for two more seconds the question is what it
> would do with a query like
> WHERE col = ANY '1,2,3'::integer[]
> Or
> WHERE col = ANY ARRAY[1,2,3]
> Whichever the language binding that is failing to do parameterized queries
> is doing to emulate them.

Yeah, one interesting question is whether this is actually modeling
what happens with real-world applications --- are they sending Consts,
or Params?

I resolutely dislike the idea of marking arrays that came from IN
differently from other ones; that's just a hack and it's going to give
rise to unexplainable behavioral differences for logically-equivalent
queries.

One idea that comes to me after looking at the code involved is that
it might be an improvement across-the-board if transformAExprIn were to
reduce the generated ArrayExpr to an array Const immediately, in cases
where all the inputs are Consts.  That is going to happen anyway come
plan time, so it'd have zero impact on semantics or query performance.
Doing it earlier would cost nothing, and could even be a net win, by
reducing per-parse-node overhead in places like the rewriter.

The advantage for the problem at hand is that a Const that's an array
of 2 elements is going to look the same as a Const that's any other
number of elements so far as pg_stat_statements is concerned.

This doesn't help of course in cases where the values aren't all
Consts.  Since we eliminated Vars already, the main practical case
would be that they're Params, leading us back to the previous
question of whether apps are binding queries with different numbers
of parameter markers in an IN, and how hard pg_stat_statements should
try to fuzz that if they are.

Then, to Greg's point, there's a question of whether transformArrayExpr
should do likewise, ie try to produce an array Const immediately.
I'm a bit less excited about that, but consistency suggests that
we should have it act the same as the IN case.

            regards, tom lane



Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Pavel Trukhanov
Дата:
Thanks for your input.

As for real-world applications – being a founder of a server monitoring saas (okmeter) I have access to stats on hundreds of postgres installations.

It shows that IN with a variable number of params is ~7 times more used than ANY(array).


On Wed, Jun 26, 2019 at 11:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <stark@mit.edu> writes:
> Actually thinking about this for two more seconds the question is what it
> would do with a query like
> WHERE col = ANY '1,2,3'::integer[]
> Or
> WHERE col = ANY ARRAY[1,2,3]
> Whichever the language binding that is failing to do parameterized queries
> is doing to emulate them.

Yeah, one interesting question is whether this is actually modeling
what happens with real-world applications --- are they sending Consts,
or Params?

I resolutely dislike the idea of marking arrays that came from IN
differently from other ones; that's just a hack and it's going to give
rise to unexplainable behavioral differences for logically-equivalent
queries.

One idea that comes to me after looking at the code involved is that
it might be an improvement across-the-board if transformAExprIn were to
reduce the generated ArrayExpr to an array Const immediately, in cases
where all the inputs are Consts.  That is going to happen anyway come
plan time, so it'd have zero impact on semantics or query performance.
Doing it earlier would cost nothing, and could even be a net win, by
reducing per-parse-node overhead in places like the rewriter.

The advantage for the problem at hand is that a Const that's an array
of 2 elements is going to look the same as a Const that's any other
number of elements so far as pg_stat_statements is concerned.

This doesn't help of course in cases where the values aren't all
Consts.  Since we eliminated Vars already, the main practical case
would be that they're Params, leading us back to the previous
question of whether apps are binding queries with different numbers
of parameter markers in an IN, and how hard pg_stat_statements should
try to fuzz that if they are.

Then, to Greg's point, there's a question of whether transformArrayExpr
should do likewise, ie try to produce an array Const immediately.
I'm a bit less excited about that, but consistency suggests that
we should have it act the same as the IN case.

                        regards, tom lane

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Dmitry Dolgov
Дата:
> On Thu, Oct 3, 2019 at 3:33 AM Pavel Trukhanov <pavel.trukhanov@gmail.com> wrote:
>
>> On Wed, Jun 26, 2019 at 11:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Greg Stark <stark@mit.edu> writes:
>> > Actually thinking about this for two more seconds the question is what it
>> > would do with a query like
>> > WHERE col = ANY '1,2,3'::integer[]
>> > Or
>> > WHERE col = ANY ARRAY[1,2,3]
>> > Whichever the language binding that is failing to do parameterized queries
>> > is doing to emulate them.
>>
>> Yeah, one interesting question is whether this is actually modeling
>> what happens with real-world applications --- are they sending Consts,
>> or Params?
>>
>> I resolutely dislike the idea of marking arrays that came from IN
>> differently from other ones; that's just a hack and it's going to give
>> rise to unexplainable behavioral differences for logically-equivalent
>> queries.
>
> Thanks for your input.
>
> As for real-world applications – being a founder of a server monitoring saas
> (okmeter) I have access to stats on hundreds of postgres installations.
>
> It shows that IN with a variable number of params is ~7 times more used than
> ANY(array).

Hi,

I would like to do some archaeology and inquire about this thread, since
unfortunately there was no patch presented as far as I see.

IIUC the ideas suggested in this thread are evolving mostly about modifying
parser:

> On Fri, Jun 14, 2019 at 2:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I do not think you need new expression infrastructure. IMO what's going on
> here is that we're indulging in premature optimization in the parser. It
> would be better from a structural standpoint if the output of parse analysis
> were closer to what the user wrote, and then the business of separating Vars
> from Consts and reducing the Consts to an array were handled in the planner's
> expression preprocessing phase.
>
> So maybe what you should be thinking about is a preliminary patch that's
> mostly in the nature of refactoring, to move that processing to where it
> should be.
>
> Of course, life is never quite that simple; there are at least two
> issues you'd have to think about.
>
> * The parsing phase is responsible for determining the semantics of
> the query, in particular resolving the data types of the IN-list items
> and choosing the comparison operators that will be used.  The planner
> is not allowed to rethink that.  What I'm not clear about offhand is
> whether the existing coding in parse analysis might lead to different
> choices of data types/operators than a more straightforward approach
> does.  If so, we'd have to decide whether that's okay.
>
> * Postponing this work might make things slower overall, which wouldn't
> matter too much for short IN-lists, but you can bet that people who
> throw ten-thousand-entry IN-lists at us will notice.  So you'd need to
> keep an eye on efficiency and make sure you don't end up repeating
> similar processing over and over.

This puzzles me, since the original issue sounds like a "representation"
problem, when we want to calculate jumble hash in a way that obviously
repeating parameters or constants are hashed into one value. I see the point in
ideas like this:

>> One idea that comes to me after looking at the code involved is that
>> it might be an improvement across-the-board if transformAExprIn were to
>> reduce the generated ArrayExpr to an array Const immediately, in cases
>> where all the inputs are Consts.  That is going to happen anyway come
>> plan time, so it'd have zero impact on semantics or query performance.
>> Doing it earlier would cost nothing, and could even be a net win, by
>> reducing per-parse-node overhead in places like the rewriter.
>>
>> The advantage for the problem at hand is that a Const that's an array
>> of 2 elements is going to look the same as a Const that's any other
>> number of elements so far as pg_stat_statements is concerned.
>>
>> This doesn't help of course in cases where the values aren't all
>> Consts.  Since we eliminated Vars already, the main practical case
>> would be that they're Params, leading us back to the previous
>> question of whether apps are binding queries with different numbers
>> of parameter markers in an IN, and how hard pg_stat_statements should
>> try to fuzz that if they are.
>>
>> Then, to Greg's point, there's a question of whether transformArrayExpr
>> should do likewise, ie try to produce an array Const immediately.
>> I'm a bit less excited about that, but consistency suggests that
>> we should have it act the same as the IN case.

Interestingly enough, something similar was already mentioned in [1]. But no
one jumped into this, probably due to its relative complexity, lack of personal
time resources or not clear way to handle Params (I'm actually not sure about
the statistics for Consts vs Params myself and need to check this, but can
easily imagine both could be an often problem).

Another idea also was mentioned in [1]:

> I wonder whether we could improve this by arranging things so that both
> Consts and Params contribute zero to the jumble hash, and a list of these
> things also contributes zero, regardless of the length of the list.

Taking everything into account, is there anything particularly wrong about
approach of squashing down lists of constants/parameters in pg_stat_statements
itself? This sounds simpler, and judging from my experiments even preventing
jumbling of ArrayExpr and rte values constants of the same type with a position
index above some threshold will already help a lot in many cases that I
observe.

[1]:
https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com



Re: Improve handling of pg_stat_statements handling of bind "IN" variables

От
Pavel Trukhanov
Дата:
Hey, let me know if there's any way I can help.

I would argue that making even a small improvement here would be beneficial to many.

On Tue, Jul 21, 2020 at 11:59 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Thu, Oct 3, 2019 at 3:33 AM Pavel Trukhanov <pavel.trukhanov@gmail.com> wrote:
>
>> On Wed, Jun 26, 2019 at 11:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Greg Stark <stark@mit.edu> writes:
>> > Actually thinking about this for two more seconds the question is what it
>> > would do with a query like
>> > WHERE col = ANY '1,2,3'::integer[]
>> > Or
>> > WHERE col = ANY ARRAY[1,2,3]
>> > Whichever the language binding that is failing to do parameterized queries
>> > is doing to emulate them.
>>
>> Yeah, one interesting question is whether this is actually modeling
>> what happens with real-world applications --- are they sending Consts,
>> or Params?
>>
>> I resolutely dislike the idea of marking arrays that came from IN
>> differently from other ones; that's just a hack and it's going to give
>> rise to unexplainable behavioral differences for logically-equivalent
>> queries.
>
> Thanks for your input.
>
> As for real-world applications – being a founder of a server monitoring saas
> (okmeter) I have access to stats on hundreds of postgres installations.
>
> It shows that IN with a variable number of params is ~7 times more used than
> ANY(array).

Hi,

I would like to do some archaeology and inquire about this thread, since
unfortunately there was no patch presented as far as I see.

IIUC the ideas suggested in this thread are evolving mostly about modifying
parser:

> On Fri, Jun 14, 2019 at 2:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I do not think you need new expression infrastructure. IMO what's going on
> here is that we're indulging in premature optimization in the parser. It
> would be better from a structural standpoint if the output of parse analysis
> were closer to what the user wrote, and then the business of separating Vars
> from Consts and reducing the Consts to an array were handled in the planner's
> expression preprocessing phase.
>
> So maybe what you should be thinking about is a preliminary patch that's
> mostly in the nature of refactoring, to move that processing to where it
> should be.
>
> Of course, life is never quite that simple; there are at least two
> issues you'd have to think about.
>
> * The parsing phase is responsible for determining the semantics of
> the query, in particular resolving the data types of the IN-list items
> and choosing the comparison operators that will be used.  The planner
> is not allowed to rethink that.  What I'm not clear about offhand is
> whether the existing coding in parse analysis might lead to different
> choices of data types/operators than a more straightforward approach
> does.  If so, we'd have to decide whether that's okay.
>
> * Postponing this work might make things slower overall, which wouldn't
> matter too much for short IN-lists, but you can bet that people who
> throw ten-thousand-entry IN-lists at us will notice.  So you'd need to
> keep an eye on efficiency and make sure you don't end up repeating
> similar processing over and over.

This puzzles me, since the original issue sounds like a "representation"
problem, when we want to calculate jumble hash in a way that obviously
repeating parameters or constants are hashed into one value. I see the point in
ideas like this:

>> One idea that comes to me after looking at the code involved is that
>> it might be an improvement across-the-board if transformAExprIn were to
>> reduce the generated ArrayExpr to an array Const immediately, in cases
>> where all the inputs are Consts.  That is going to happen anyway come
>> plan time, so it'd have zero impact on semantics or query performance.
>> Doing it earlier would cost nothing, and could even be a net win, by
>> reducing per-parse-node overhead in places like the rewriter.
>>
>> The advantage for the problem at hand is that a Const that's an array
>> of 2 elements is going to look the same as a Const that's any other
>> number of elements so far as pg_stat_statements is concerned.
>>
>> This doesn't help of course in cases where the values aren't all
>> Consts.  Since we eliminated Vars already, the main practical case
>> would be that they're Params, leading us back to the previous
>> question of whether apps are binding queries with different numbers
>> of parameter markers in an IN, and how hard pg_stat_statements should
>> try to fuzz that if they are.
>>
>> Then, to Greg's point, there's a question of whether transformArrayExpr
>> should do likewise, ie try to produce an array Const immediately.
>> I'm a bit less excited about that, but consistency suggests that
>> we should have it act the same as the IN case.

Interestingly enough, something similar was already mentioned in [1]. But no
one jumped into this, probably due to its relative complexity, lack of personal
time resources or not clear way to handle Params (I'm actually not sure about
the statistics for Consts vs Params myself and need to check this, but can
easily imagine both could be an often problem).

Another idea also was mentioned in [1]:

> I wonder whether we could improve this by arranging things so that both
> Consts and Params contribute zero to the jumble hash, and a list of these
> things also contributes zero, regardless of the length of the list.

Taking everything into account, is there anything particularly wrong about
approach of squashing down lists of constants/parameters in pg_stat_statements
itself? This sounds simpler, and judging from my experiments even preventing
jumbling of ArrayExpr and rte values constants of the same type with a position
index above some threshold will already help a lot in many cases that I
observe.

[1]: https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com