Обсуждение: pg_stat_statements normalization: re-review
Hello again, I'm reviewing the revised version of pg_stat_statements again. In particular, this new version has done away with the mechanical but invasive change to the lexing that preserved *both* the position and length of constant values. (See http://archives.postgresql.org/message-id/CAEYLb_WGeFCT7MfJ8FXf-CR6BSE6Lbn+O1VX3+OGrc4Bscn4=A@mail.gmail.com) I did the review front-matter again (check against a recent version, make sure it does what it says it'll do, et al..) and did trigger an assertion failure that seems to be an oversight, already reported to Peter. I found that oversight by running the SQLAlchemy Python query-generation toolkit's tests, which are voluminous. The functionality is seemingly mostly unchanged. What I'm going to do here is focus on the back-end source changes that are not part of the contrib. The size of the changes are much reduced, but their semantics are also somewhat more complex...so, here goes. Conclusion first: * The small changes to hashing are probably not strictly required, unless collisions are known to get terrible. * The hook to analyze is pretty straight-forward and seem like the other hooks * The addition of a field to scribble upon in the Query and Plan nodes is something I'd like to understand better, as these leave me with a bit of disquiet. What follows is in much more detail, and broken up by functionality and file grouping in the backend. src/include/access/hash.h | 4 ++-src/backend/access/hash/hashfunc.c | 21 ++++++++++--------- This adjusts the hash_any procedure to support returning two possible precisions (64 bit and 32 bit) by tacking on a Boolean flag to make the precision selectable. The hash_any operator is never used directly, and instead via macro, and a second macro to support 64-bit precision has been added. It seems a bit wonky to me to use a flag to select this rather than encapsulating the common logic in a procedure and just break this up into three symbols, though. I think the longer precision is used to try to get fewer collisions with not too much slowdown. Although it may meaningfully improve the quality of the hashing for pg_stat_statements or living without the extra hashing bits might lead to unusable results (?), per the usual semantics of hashing it is probably not *strictly* necessary. A smidgen of avoidable formatting niggles (> 80 columns) are in this section. src/backend/nodes/copyfuncs.c | 5 ++++src/backend/nodes/equalfuncs.c | 4 +++src/backend/nodes/outfuncs.c | 10 +++++++++src/backend/optimizer/plan/planner.c | 1 +src/backend/nodes/readfuncs.c | 12 +++++++++++src/include/nodes/parsenodes.h | 3 ++src/include/nodes/plannodes.h | 2 + These have all been changed in the usual manner to support one new field, the queryId, on the toplevel Plan and Query nodes. The queryId is 64-bit field copied from queries to plans to tie together plans "to be used by plugins" (quoth the source) as they flow through the system. pg_stat_statements fills them with the 64-bit hash of the query text -- a reasonable functionality to possess, I think, but the abstraction seems iffy: plugins cannot compose well if they all need to use the queryId for different reasons. Somehow I get the feeling that this field can be avoided or its use case can be satisfied in a more satisfying way. Mysteriously, in the Query representation the symbol uses an underscored-notation (query_id) and in the planner it uses a camelcase one, queryId. Overall, the other fields in those structs all use camelcase, so I'd recommend normalizing it. src/backend/parser/analyze.c | 37 ++++++++++++++++++++++++++++++---src/include/parser/analyze.h | 12+++++++++++ These changes implement hooks for the once-non-hookable symbols parse_analyze_varparams and parse_analyze, in seemingly the same way they are implemented in other hooks in the system. These are neatly symmetrical with the planner hook. I only ask if there is a way to have one hook and not two, but I suppose that would be a similar question as "why is are there two ways to symbols to enter into parsing, and not one". -- fdr
On 23 February 2012 09:58, Daniel Farina <daniel@heroku.com> wrote:
> What I'm going to do here is focus on the back-end source changes that
> are not part of the contrib.  The size of the changes are much
> reduced, but their semantics are also somewhat more complex...so, here
> goes.  Conclusion first:
>
> * The small changes to hashing are probably not strictly required,
> unless collisions are known to get terrible.
I imagine that collisions would be rather difficult to demonstrate at
all with a 32-bit value.
> * The addition of a field to scribble upon in the Query and Plan nodes
> is something I'd like to understand better, as these leave me with a
> bit of disquiet.
>
> What follows is in much more detail, and broken up by functionality
> and file grouping in the backend.
>
>  src/include/access/hash.h            |    4 ++-
>  src/backend/access/hash/hashfunc.c   |   21 ++++++++++---------
>
> This adjusts the hash_any procedure to support returning two possible
> precisions (64 bit and 32 bit) by tacking on a Boolean flag to make
> the precision selectable.  The hash_any operator is never used
> directly, and instead via macro, and a second macro to support 64-bit
> precision has been added.  It seems a bit wonky to me to use a flag to
> select this rather than encapsulating the common logic in a procedure
> and just break this up into three symbols, though.  I think the longer
> precision is used to try to get fewer collisions with not too much
> slowdown.  Although it may meaningfully improve the quality of the
> hashing for pg_stat_statements or living without the extra hashing
> bits might lead to unusable results (?), per the usual semantics of
> hashing it is probably not *strictly* necessary.
It might be unnecessary. It's difficult to know where to draw the line.
> A smidgen of avoidable formatting niggles (> 80 columns) are in this section.
>
>  src/backend/nodes/copyfuncs.c        |    5 ++++
>  src/backend/nodes/equalfuncs.c       |    4 +++
>  src/backend/nodes/outfuncs.c         |   10 +++++++++
>  src/backend/optimizer/plan/planner.c |    1 +
>  src/backend/nodes/readfuncs.c        |   12 +++++++++++
>  src/include/nodes/parsenodes.h       |    3 ++
>  src/include/nodes/plannodes.h        |    2 +
I'll look into those.
> These have all been changed in the usual manner to support one new
> field, the queryId, on the toplevel Plan and Query nodes.  The queryId
> is 64-bit field copied from queries to plans to tie together plans "to
> be used by plugins" (quoth the source) as they flow through the
> system.  pg_stat_statements fills them with the 64-bit hash of the
> query text -- a reasonable functionality to possess, I think, but the
> abstraction seems iffy: plugins cannot compose well if they all need
> to use the queryId for different reasons.  Somehow I get the feeling
> that this field can be avoided or its use case can be satisfied in a
> more satisfying way.
Maybe the answer here is to have a simple mechanism for modules to
claim the exclusive right to be able to set that flag?
> Mysteriously, in the Query representation the symbol uses an
> underscored-notation (query_id) and in the planner it uses a camelcase
> one, queryId.  Overall, the other fields in those structs all use
> camelcase, so I'd recommend normalizing it.
Fair enough.
There is one other piece of infrastructure that I think is going to
need to go into core that was not in the most recent revision. I
believe that it can be justified independently of this work.
Basically, there are some very specific scenarios in which the
location of Const nodes is incorrect, because the core system makes
that location the same location as another coercion-related node. Now,
this actually doesn't matter to the positioning of the caret in most
error messages, as they usually ought to be the same anyway, as in
this scenario:
select 'foo'::integer;
Here, both the Const and coercion node's location start from the
SCONST token - no problem there.
However, consider this query:
select cast('foo' as integer);
and this query:
select integer 'foo';
Now, the location of the Const and Coercion nodes is *still* the same,
but starting from the location of the "cast" in the first example and
"integer" in the second.
I believe that this is a bug. They should have different locations
here, so that I can always rely on the location of a Const having a
single token that is one of only a handful of possible CONST tokens,
as I currently can in all other scenarios.
Of course, this didn't particularly matter before now, as the Const
location field only dictated caret position in messages generated from
outside the parser. Any error message should be blaming the Coercion
node if there's a problem with coercion, so there's no reason why this
needs to break any error context messages. If a message wants to blame
a Const on something, surely it should have a caret placed in the
right location - the location of the const token. If it wants to blame
the Coercion node on something, that won't change, so the context
message will be the same as before.
It's my intention that the next revision, which I'll get out in the
next couple of days, will have these additional changes to the core
system, and will be able to run the sqlalchemy test suite without that
assertion failure - I might get around to running a variety of tests
from different popular frameworks/ORMs.
I had hoped that some committer would agree that this was a useful
change even without this patch, and go ahead and commit the necessary
changes, but I can appreciate that everyone is quite busy at this time
of the cycle and so I'm not relying on that happening.
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
			
		On 23 February 2012 11:09, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 23 February 2012 09:58, Daniel Farina <daniel@heroku.com> wrote: >> * The small changes to hashing are probably not strictly required, >> unless collisions are known to get terrible. > > I imagine that collisions would be rather difficult to demonstrate at > all with a 32-bit value. Assuming that the hash function exhibits a perfectly uniform distribution of values (FWIW, hash_any is said to exhibit the avalanche effect), the birthday problem provides a mathematical basis for estimating the probability of some 2 queries being alike. Assuming a population of 1,000 queries are in play at any given time (i.e. the default value of pg_stat_statements.max) and 2 ^ 32 "days of the year", that puts the probability of a collision at a vanishingly small number. I cannot calculate the number with Gnome calculator. Let's pretend that 2 ^ 32 is exactly 42 million, rather than approximately 4.2 *billion*. Even then, the probability of collision is a minuscule 0.000001857 . I'd have to agree that a uint32 hash is quite sufficient here. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Feb 23, 2012 at 3:09 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> These have all been changed in the usual manner to support one new
>> field, the queryId, on the toplevel Plan and Query nodes.  The queryId
>> is 64-bit field copied from queries to plans to tie together plans "to
>> be used by plugins" (quoth the source) as they flow through the
>> system.  pg_stat_statements fills them with the 64-bit hash of the
>> query text -- a reasonable functionality to possess, I think, but the
>> abstraction seems iffy: plugins cannot compose well if they all need
>> to use the queryId for different reasons.  Somehow I get the feeling
>> that this field can be avoided or its use case can be satisfied in a
>> more satisfying way.
>
> Maybe the answer here is to have a simple mechanism for modules to
> claim the exclusive right to be able to set that flag?
As I see it, the queryId as the most contentious internals change in
the patch, as queryId is not really an opaque id tracking a query's
progression, but rather a bit of extra space for query jumbles, and
query jumbles only, as far as I can tell: thus, not really a generally
useful backend service, but rather specifically for pg_stat_statements
-- that may not take such a special field off the table (more
opinionated people will probably comment on that), but it's a pretty
awkward kind of inclusion.
I'm posting an experimental patch in having the backend define state
whose only guarantee is to be equal between a PlannedStmt and a Query
node that derived it, and its effect on pg_stat_statements.  (the
latter not included in this email, but available via git[0])  The
exact mechanism I use is pretty questionable, but my skimming of the
entry points of the code suggests it might work: I use the pointer to
the Query node from the PlannedStmt, which is only unique as long as
the Query (head) node remains allocated while the PlannedStmt is also
alive.  A big if, but able to be verified with assertions in debug
mode ("does the pointer see poisoned memory?") and low overhead
presuming one had to allocate memory already.  However, the
abstraction is as such that if the key generation needs to change
compliant consumers should be fine.
At ExecutorFinish (already hookable) all NodeKeys remembered by an
extension should be invalidated, as that memory is free and ready to
be used again.
As Query node availability from the PlannedStmt is not part of the
node for a reason, it is rendered as an opaque uintptr_t, and hidden
behind a type that only is intended to define equality and
"definedness" (!= NULL is the implementation).
The clear penalty is that the extension must be able to map from the
arbitrary, backend-assigned queryId (which I have renamed nodeKey just
to make it easier to discuss) to its own internal value it cares
about: the query jumble.  I used a very simple, fixed-size association
array with a small free-space location optimization, taking advantage
of the property in pg_stat_statements can simply not process a query
if it doesn't want to (but should process most of them so one can get
a good sense of what is going on).
A credible(?) alternative I have thought of is to instead expose a
memory context for use by extensions on the interesting nodes; in
doing so extensions can request space and store whatever they need.
The part that I feel might be tricky is figuring out a reasonable and
appropriate time to free that 'extension-context' and what context
that context should live under.  But I'd be wiling to try it with
haste if it sounds a lot more credible or useful than what I've got.
--
fdr
[0]: https://github.com/fdr/postgres/tree/wrench-pgss
			
		Вложения
On Fri, Feb 24, 2012 at 3:14 AM, Daniel Farina <daniel@heroku.com> wrote: > At ExecutorFinish (already hookable) all NodeKeys remembered by an > extension should be invalidated, as that memory is free and ready to > be used again. I think this statement is false; I thought it was true because *not* invalidating gives me correct-appearing but unsound results. Firstly, it should be ExecutorEnd, not ExecutorFinish, but this doesn't work because those run in the exec_simple_query path twice: once in the planner and then again in PortalCleanup, it's important that invalidation only occur at the end, and not twice. So there's a problem of a kind resulting from this experiment. -- fdr